#Instructions

This document is a template to help you get started and will mirror the work that you will do in modules 2, 3, 4 and 5 with the Taxi Trip dataset problem.

You should save a copy of this in your Colab and change the name of the file to include your student number.

Within this document there are comments to help you along and some boilerplate code that you can adjust to get you started but the code will be very similar to that found in the practice document.

This document has the following sections and should be submitted with those in place:



*   Title
*   Introduction
*   Module 2: Get the data
*   Module 3: Basic statistics and visualisations
*   Module 4: Regression models
*   Module 5: Using the outcomes 


Enjoy and learn lots.

# Problem: Can we accurately predict the number of collisions for any given day of the week?

##Introduction

You work as a product owner for a car insurance company offering a daily insurance policy for car rentals.   

The company operates in New York and wants to price its insurance to reflect collision risk and associated costs. It wants you to explore a new feature for development that will make better predictions about this. We will use New York traffic collision data to make estimates about the number of collisions on a given day.  

For this you require weather data as there has been a link between weather and traffic collisions. The company is using data given to them by the emergency services.

Note: You will be given a file entitled collisions_and_weather_data.csv, testdata2019.csv. Due to Covid-19, all data since early 2020 has been fairly useless with respect to patterns. The company can see that the data has recently returned to full pre-pandemic levels and you will be provided data from 1st of January 2013 to 31st of December 2018 and the test data will be from 2019.   

Remember, you will have to put these files in your Google Drive.

MY THOUGHTS ON THE ASSUMPTIONS:

- We don't have data on number of journeys. Collisions is likely related to number of journeys, but we can't normalise w.r.t. this, or even test it as a hypothesis, since we don't have that data.

## Module 2: Get the data

This section contains boilerplate code. As long as you have uploaded your CSV files to your Google Drive, you can just run the cells as normal.

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly as pl
import plotly.graph_objects as go
import plotly.express as px
import datetime as dt
import tensorflow as tf

from scipy import stats
from IPython.display import display
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error



Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd




In [2]:
#set the size of our plots as they are a little small by default.
plt.rcParams["figure.figsize"] = (20,5)

In [10]:
# Custom functions and constants
custom_plasma_scale = [
    [0, px.colors.sequential.Plasma[0]],  # Start of the Plasma scale
    # Map points in between as needed; this example directly jumps to clamping
    [0.8, px.colors.sequential.Plasma[-2]],  # Roughly the 80% mark in the Plasma palette
    [1, px.colors.sequential.Plasma[-2]]  # Clamp the color scale to the color at 80%
]
OICD = 'OICD_grouped_by_week_day'
OICD_YEAR_DAY = 'OICD_grouped_by_year_day'
MED_RES = 'median_residual'
N_COLS = 'NUM_COLLISIONS'

# I have standardized a year to 366 days, so in order for day_of_year to match for both leap years and non-leap years, March onwards need to be adjusted +1 in the non-leap years.
def adjust_for_leap_year(year, month, day_of_year):
    adjustment = 0 if year % 4 == 0 and (year % 100 != 0 or year % 400 == 0) else 1
    return day_of_year + adjustment if month > 2 else day_of_year
    

# This function converts every day of the year to a value in the range 1-2562, i.e. a Cartesian product of the days of the week and days of the year, with all the Sundays first, then Saturdays, etc..
def convert_date_to_one_indexed_cycle_day_grouped_by_day_of_week(year, month, day, day_num):
    
    temp_day_num = day_num 
    date = dt.datetime(year, month, day)
    day_of_year = date.timetuple().tm_yday

    adjusted_day_of_year = adjust_for_leap_year(year, month, day_of_year) # correct so March 1st is always 61 of the year. 

    day_of_total_cycle = adjusted_day_of_year + (temp_day_num * 366)
    return day_of_total_cycle


# This function does the same, but groups by year instead of day of the week.
def convert_date_to_one_indexed_cycle_day_grouped_by_day_of_year(year, month, day, day_num):
    
    temp_day_num = day_num 
    date = dt.datetime(year, month, day)
    day_of_year = date.timetuple().tm_yday

    adjusted_day_of_year = adjust_for_leap_year(year, month, day_of_year) # correct so March 1st is always 61 of the year. 

    # Calculate day of total cycle with the adjusted day_of_year
    day_of_total_cycle = (adjusted_day_of_year + temp_day_num) + (6 * adjusted_day_of_year)

    return day_of_total_cycle



def show_correlation_matrix(df, title):
    corrMatrix = df.corr(numeric_only=True)
    fig = px.imshow(
        corrMatrix,
        width=1000,
        height=1000,
        title=title,
        color_continuous_scale='PiYG',
        zmin=-1,
        zmax=1
        )
    fig.update_yaxes(tickfont=dict(family='Century Gothic', size=14), range=[-1,1])
    fig.update_xaxes(tickfont=dict(family='Century Gothic', size=14))

    alt = "Hello"
    fig.show()


days_titles = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat' ]

# So that I can make music later on. I saved the FFT as a sample and made some ambient EDM.
def output_wav(yf):
    # Create a wave
    ifft_result = ifft(yf)
    
    # Make real first
    real_ifft_result = np.real(ifft_result)
    
    # Normalise to -1 to 1
    normalized_ifft = np.interp(real_ifft_result, (real_ifft_result.min(), real_ifft_result.max()), (-1, +1))
    
    # Ensure it's real...
    audio_signal = np.real(normalized_ifft)
    
    # Rate
    sampling_rate = 96_000 # 96kHz for Serum.
    
    # Write to WAV
    write("collisions_of_NYC.wav", sampling_rate, audio_signal.astype(np.float32))

In [13]:
# Link with your google drive
# from google.colab import drive
# drive.mount('/content/gdrive')

In [27]:
# get our collated taxi trip and weather data from google drive (or in my case my local storage) TODO Swap for the Google Cloud link before submitting.
df = pd.read_csv('D:\\Coding\\JupyterNotebookLBD\\scientificProject\\data\\LBD_New_York_collisions_and_weather_data (1).csv')
# Replace null data values with 
df.replace({9999.9: np.nan, 999.9: np.nan, 99.99: np.nan}, inplace=True)

original_df = df.copy()

print('Initial dataframe, with null data instead of max values')
display(df.head())
display(df.describe())

mean_by_day = df.groupby('day')[N_COLS].mean()
print("Mean by day:")
print(mean_by_day)

# Dropping records missing 'mxpsd' as that shows a weak correlation but has missing data in some rows. (Count: 102 rows dropped)
df = df.dropna(axis=0, subset=['mxpsd']) 
display(df.describe())

# Re-order the day_nums so the correlation is stronger
df['day'] = (df['day'] + 1) % 7 # % 7 already moves Sunday to zero-indexed 0, so adding 1 also moves Saturday.
df.loc[df['day'] < 2, 'day'] = 1 - df['day'] # Swap Saturday and Sunday because the linear experience of time is arbitrary.

# Group by 'year' and calculate the median of 'NUM_COLLISIONS', then subtract it. This allows us to effectively eliminate the correlation to year, controlling for the unknown number of total journeys which is likely to be the causative externality.
df[MED_RES] = df[N_COLS] - df.groupby('year')[N_COLS].transform('median')


df[OICD] = df.apply(lambda row: convert_date_to_one_indexed_cycle_day_grouped_by_day_of_week(row['year'], row['mo'], row['da'], row['day']), axis=1)

df[OICD_YEAR_DAY] = df.apply(lambda row: convert_date_to_one_indexed_cycle_day_grouped_by_day_of_year(row['year'], row['mo'], row['da'], row['day']), axis=1)

print('Dataframe after adding OICD columns:')
display(df.describe())
df.head()


Initial dataframe, with null data instead of max values


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS
0,2,2013,1,1,01/01/2013,37.8,23.6,1011.9,10.0,6.1,8.9,19.0,39.9,33.1,0.0,,0,381
1,3,2013,1,2,02/01/2013,27.1,10.5,1016.8,10.0,5.3,9.9,19.0,33.1,21.9,0.0,,0,480
2,4,2013,1,3,03/01/2013,28.4,14.1,1020.6,10.0,3.7,8.0,15.0,32.0,24.1,0.0,,0,549
3,5,2013,1,4,04/01/2013,33.4,18.6,1017.0,10.0,6.5,13.0,24.1,37.0,30.0,0.0,,0,505
4,6,2013,1,5,05/01/2013,36.1,18.7,1020.6,10.0,6.6,12.0,21.0,42.1,32.0,0.0,,0,389


Unnamed: 0,day,year,mo,da,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS
count,2191.0,2191.0,2191.0,2191.0,2191.0,2191.0,2168.0,2159.0,2089.0,2089.0,1376.0,2191.0,2191.0,2191.0,176.0,2191.0,2191.0
mean,4.0,2015.500228,6.523962,15.726609,55.721086,41.12031,1017.225046,8.953682,4.533605,9.311776,20.205523,65.226974,47.875947,0.141031,6.427273,0.079416,602.121862
std,2.000457,1.707859,3.449207,8.800821,17.506851,19.298085,7.205239,1.563377,2.05003,3.114821,4.706593,18.15633,17.152164,0.353569,4.723467,0.270448,102.452173
min,1.0,2013.0,1.0,1.0,6.9,-16.1,992.1,1.7,0.0,2.9,14.0,17.6,-0.9,0.0,1.2,0.0,10.0
25%,2.0,2014.0,4.0,8.0,41.55,26.4,1012.6,8.45,3.1,7.0,17.1,50.0,35.1,0.0,2.0,0.0,533.0
50%,4.0,2016.0,7.0,16.0,56.9,42.6,1017.0,9.8,4.3,8.9,19.0,66.9,48.0,0.0,5.9,0.0,604.0
75%,6.0,2017.0,10.0,23.0,71.9,57.5,1021.725,10.0,5.7,11.1,22.9,82.0,64.0,0.08,9.1,0.0,670.0
max,7.0,2018.0,12.0,31.0,89.1,74.8,1042.1,10.0,15.5,24.1,40.0,98.1,82.9,4.53,22.0,1.0,1161.0


Mean by day:
day
1    603.185304
2    626.808307
3    621.057508
4    635.878594
5    673.884984
6    559.916933
7    494.121406
Name: NUM_COLLISIONS, dtype: float64


Unnamed: 0,day,year,mo,da,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS
count,2089.0,2089.0,2089.0,2089.0,2089.0,2089.0,2084.0,2077.0,2089.0,2089.0,1376.0,2089.0,2089.0,2089.0,172.0,2089.0,2089.0
mean,3.994256,2015.404978,6.364289,15.630445,56.214265,41.482001,1017.179367,8.954357,4.533605,9.311776,20.205523,65.824031,48.315223,0.139694,6.437209,0.082336,600.501675
std,1.999513,1.669737,3.382665,8.829828,17.557594,19.393204,7.149609,1.559558,2.05003,3.114821,4.706593,18.156722,17.229664,0.354351,4.749319,0.274942,99.814277
min,1.0,2013.0,1.0,1.0,6.9,-16.1,992.1,1.7,0.0,2.9,14.0,18.0,-0.9,0.0,1.2,0.0,188.0
25%,2.0,2014.0,3.0,8.0,42.1,26.7,1012.6,8.4,3.1,7.0,17.1,51.1,35.1,0.0,2.0,0.0,533.0
50%,4.0,2015.0,6.0,16.0,58.0,43.4,1016.9,9.8,4.3,8.9,19.0,68.0,50.0,0.0,5.9,0.0,603.0
75%,6.0,2017.0,9.0,23.0,72.3,57.9,1021.7,10.0,5.7,11.1,22.9,82.0,64.0,0.08,9.1,0.0,667.0
max,7.0,2018.0,12.0,31.0,89.1,74.8,1042.1,10.0,15.5,24.1,40.0,98.1,82.9,4.53,22.0,1.0,999.0


Dataframe after adding OICD columns:


Unnamed: 0,day,year,mo,da,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
count,2089.0,2089.0,2089.0,2089.0,2089.0,2089.0,2084.0,2077.0,2089.0,2089.0,1376.0,2089.0,2089.0,2089.0,172.0,2089.0,2089.0,2089.0,2089.0,2089.0
mean,3.000957,2015.404978,6.364289,15.630445,56.214265,41.482001,1017.179367,8.954357,4.533605,9.311776,20.205523,65.824031,48.315223,0.139694,6.437209,0.082336,600.501675,-3.164672,1277.159885,1254.667305
std,1.997124,1.669737,3.382665,8.829828,17.557594,19.393204,7.149609,1.559558,2.05003,3.114821,4.706593,18.156722,17.229664,0.354351,4.749319,0.274942,99.814277,94.913364,738.35497,724.910997
min,0.0,2013.0,1.0,1.0,6.9,-16.1,992.1,1.7,0.0,2.9,14.0,18.0,-0.9,0.0,1.2,0.0,188.0,-415.0,1.0,7.0
25%,1.0,2014.0,3.0,8.0,42.1,26.7,1012.6,8.4,3.1,7.0,17.1,51.1,35.1,0.0,2.0,0.0,533.0,-64.0,634.0,632.0
50%,3.0,2015.0,6.0,16.0,58.0,43.4,1016.9,9.8,4.3,8.9,19.0,68.0,50.0,0.0,5.9,0.0,603.0,0.0,1277.0,1248.0
75%,5.0,2017.0,9.0,23.0,72.3,57.9,1021.7,10.0,5.7,11.1,22.9,82.0,64.0,0.08,9.1,0.0,667.0,61.0,1918.0,1864.0
max,6.0,2018.0,12.0,31.0,89.1,74.8,1042.1,10.0,15.5,24.1,40.0,98.1,82.9,4.53,22.0,1.0,999.0,392.0,2561.0,2567.0


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
0,3,2013,1,1,01/01/2013,37.8,23.6,1011.9,10.0,6.1,...,19.0,39.9,33.1,0.0,,0,381,-177.0,1099,10
1,4,2013,1,2,02/01/2013,27.1,10.5,1016.8,10.0,5.3,...,19.0,33.1,21.9,0.0,,0,480,-78.0,1466,18
2,5,2013,1,3,03/01/2013,28.4,14.1,1020.6,10.0,3.7,...,15.0,32.0,24.1,0.0,,0,549,-9.0,1833,26
3,6,2013,1,4,04/01/2013,33.4,18.6,1017.0,10.0,6.5,...,24.1,37.0,30.0,0.0,,0,505,-53.0,2200,34
4,1,2013,1,5,05/01/2013,36.1,18.7,1020.6,10.0,6.6,...,21.0,42.1,32.0,0.0,,0,389,-169.0,371,36


## Module 3: Basic statistics and visualisations

In [22]:
# df = df.sort_values(["year", "mo", "da"], ascending = (True, True, True)) # order the data by year, month, day in ascending order.
df = df.sort_values([OICD], ascending=(True)) # order by OICD
display(df.head()) # check the data again by viewing the first 5 rows


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
1461,0,2017,1,1,01/01/2017,44.8,22.9,1017.9,10.0,5.0,...,21.0,48.0,43.0,0.0,,0,485,-151.0,1,7
1097,0,2016,1,3,03/01/2016,38.4,20.0,1011.7,10.0,6.4,...,21.0,45.0,32.0,0.0,,0,432,-203.0,3,21
733,0,2015,1,4,04/01/2015,45.4,42.3,1010.6,5.8,5.1,...,19.0,55.9,33.1,0.89,,0,381,-222.0,4,28
369,0,2014,1,5,05/01/2014,30.3,21.9,1025.6,6.4,3.8,...,15.0,36.0,27.0,0.0,5.9,1,320,-248.0,5,35
5,0,2013,1,6,06/01/2013,38.3,25.0,1019.5,8.5,5.3,...,17.1,46.0,34.0,0.0,,0,393,-165.0,6,42


In [23]:

# Initial Correlation Overview
show_correlation_matrix(df, "Collision Correlations")

def scatter_x_y(x, y): px.scatter(
    df,
    x=x,
    y=y,
    color='day',
    color_continuous_scale=custom_plasma_scale,  # Use the custom Plasma scale
    title=y + ' / '+ x).show()
    


scatter_x_y(OICD, N_COLS)
scatter_x_y(OICD, MED_RES)
scatter_x_y(OICD_YEAR_DAY, N_COLS)
scatter_x_y(OICD_YEAR_DAY, MED_RES)




As expected, by using the residuals against the year-median collisions, we have strengthened the correlation w.r.t. the periodic behaviour. The last two scatter plots show the clear decrease in linear correlation comparing the raw collisions for the OICD grouped by year, and the same ordering but against the median residuals.

We will now remove the outliers, in order to find a harmonic regression that isn't disturbed by aperiodic black swans.

In [29]:


# Get ZICD and collisions
x = df[OICD]
y = df[MED_RES]

# Do linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

# Calculate Residuals
predicted_y = slope * x + intercept
residuals = y - predicted_y

# Determine Outlier threshold
residual_std = np.std(residuals)
mean_residual = np.mean(residuals)

# We are going to use an threshold of 2 standard deviations, which we can expect to remove the 5% furthest from the mean, assuming the samples conform to an approximate normal distribution. It is generally a reasonable assumption that car collisions are independent probability events - the exception being very large outlier accidents, but these are the unpredictable events we wish to remove anyway.
SIGMA_LIMIT = 2

outlier_upper = mean_residual + SIGMA_LIMIT * residual_std
outlier_lower = mean_residual - SIGMA_LIMIT * residual_std


cleaned_df = df[(residuals <= outlier_upper) & (residuals >= outlier_lower)]

print('Total before cleaning:')
display(df.count())
print('Total after cleaning:')
display(cleaned_df.count())
print('Our original data set for comparison:')
display(original_df.count())
print('Number of rows culled: ', len(original_df) - len(cleaned_df))



linear_reg_cleaned_fig = px.scatter(
    cleaned_df,
    x=OICD,
    y=MED_RES,
    color='day',
    color_continuous_scale=custom_plasma_scale,  # Use the custom Plasma scale
    title=MED_RES + ' / ' + OICD + ' - Cleaned via ' + str(SIGMA_LIMIT) + 'sigma' + ' linear regression'
)
    
linear_reg_cleaned_fig.show()

show_correlation_matrix(cleaned_df, "Collision Outliers removed")


# Assuming `df` is your initial DataFrame
# Shuffle the DataFrame rows
df_shuffled = cleaned_df.sample(frac=1, random_state=42).reset_index(drop=True)


# Calculate the number of rows to cull
num_to_cull = int(len(df) * 0.2)

# Split the data
test_df = df_shuffled[:num_to_cull]  # This will be your test set (20% of the data)
train_df = df_shuffled[num_to_cull:]  # This will be your training set (80% of the data)
display(df_shuffled.head())
display(train_df.head())



Total before cleaning:


day                         2089
year                        2089
mo                          2089
da                          2089
collision_date              2089
temp                        2089
dewp                        2089
slp                         2084
visib                       2077
wdsp                        2089
mxpsd                       2089
gust                        1376
max                         2089
min                         2089
prcp                        2089
sndp                         172
fog                         2089
NUM_COLLISIONS              2089
median_residual             2089
OICD_grouped_by_week_day    2089
OICD_grouped_by_year_day    2089
dtype: int64

Total after cleaning:


day                         1987
year                        1987
mo                          1987
da                          1987
collision_date              1987
temp                        1987
dewp                        1987
slp                         1982
visib                       1975
wdsp                        1987
mxpsd                       1987
gust                        1295
max                         1987
min                         1987
prcp                        1987
sndp                         160
fog                         1987
NUM_COLLISIONS              1987
median_residual             1987
OICD_grouped_by_week_day    1987
OICD_grouped_by_year_day    1987
dtype: int64

Our original data set for comparison:


day               2191
year              2191
mo                2191
da                2191
collision_date    2191
temp              2191
dewp              2191
slp               2168
visib             2159
wdsp              2089
mxpsd             2089
gust              1376
max               2191
min               2191
prcp              2191
sndp               176
fog               2191
NUM_COLLISIONS    2191
dtype: int64

Number of rows culled:  204


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
0,4,2013,8,7,07/08/2013,73.4,60.5,1019.9,10.0,4.6,...,19.0,82.0,64.0,0.0,,0,508,-50.0,1684,1544
1,0,2017,6,4,04/06/2017,63.3,45.2,1015.1,10.0,1.6,...,,73.0,55.0,0.01,,0,468,-168.0,156,1092
2,5,2015,4,16,16/04/2015,58.0,21.0,1029.4,10.0,6.2,...,15.9,72.0,51.1,0.0,,0,560,-43.0,1937,754
3,3,2016,6,7,07/06/2016,76.4,56.0,1000.6,8.7,4.4,...,15.9,84.9,64.9,0.0,,0,709,74.0,1257,1116
4,6,2018,2,16,16/02/2018,55.3,49.2,1004.6,8.4,5.0,...,24.1,62.1,43.0,0.17,,0,582,-45.0,2243,335


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
417,4,2014,4,2,02/04/2014,47.3,35.6,1018.6,9.8,1.9,...,,60.1,39.0,0.0,,0,519,-49.0,1557,655
418,4,2014,5,7,07/05/2014,59.1,28.5,1021.0,10.0,3.2,...,17.1,71.1,50.0,0.0,,0,603,35.0,1592,900
419,3,2015,9,22,22/09/2015,66.8,47.6,1024.2,10.0,6.4,...,15.0,73.0,59.0,0.0,,0,571,-32.0,1364,1865
420,6,2015,7,31,31/07/2015,80.5,61.4,1010.5,8.6,3.3,...,15.0,89.1,72.0,1.95,,0,753,150.0,2409,1497
421,2,2016,7,25,25/07/2016,82.3,69.2,1013.2,7.3,3.3,...,27.0,93.9,75.0,0.0,,1,698,63.0,939,1451


It is clear from looking at the scatter plots of both OICD and OICD_grouped_by_year that alongside the weekly cycle of collision numbers, there are other periodic functions. Some obvious human factors to consider are: academic or public holidays. How busy are people in their daily lives and are they sticking to routine or making extra, unusual journeys? It doesn't matter specifically what these are: it matters that people deviate from their regular patterns and this is usually accompanied by increased risk. The weather aspect also is periodic - we normally call these "seasons"! If the weather has a effect on the number of collisions, this will show a predictable periodic aspect, as the weather is not itself random overall.

To investigate these phenomena, we will take a FFT of the cleaned data we have so far, and extract any significant signals from it.  

In [31]:


from scipy.fft import fft, ifft, fftfreq
from scipy.interpolate import interp1d
from scipy.io.wavfile import write

train_df = train_df.sort_values(by=OICD)

display(train_df.head())

clean_x = train_df[OICD]
clean_y = train_df[MED_RES]

# Since this is now an obvious linear correlation w.r.t. day of the week (OICD), we will find this and removed it from the signal
# Do linear regression
clean_slope, clean_intercept, clean_r_value, clean_p_value, clean_std_err = stats.linregress(clean_x,clean_y)

# Calculate Residuals from this regression.
clean_predicted_y = clean_slope * clean_x + clean_intercept
clean_residuals = clean_y - clean_predicted_y


Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,OICD_grouped_by_week_day,OICD_grouped_by_year_day
1306,0,2016,1,3,03/01/2016,38.4,20.0,1011.7,10.0,6.4,...,21.0,45.0,32.0,0.0,,0,432,-203.0,3,21
834,0,2015,1,4,04/01/2015,45.4,42.3,1010.6,5.8,5.1,...,19.0,55.9,33.1,0.89,,0,381,-222.0,4,28
1488,0,2013,1,6,06/01/2013,38.3,25.0,1019.5,8.5,5.3,...,17.1,46.0,34.0,0.0,,0,393,-165.0,6,42
1407,0,2016,1,10,10/01/2016,50.4,45.9,1004.8,5.3,8.2,...,26.0,59.0,39.9,0.77,,0,462,-173.0,10,70
1106,0,2014,1,12,12/01/2014,47.6,33.8,1005.0,9.7,7.2,...,25.1,57.9,37.0,0.42,,0,388,-180.0,12,84


We now have two separate compenents: the simple linear regression w.r.t. OICD, and some noisy-looking periodic components.

In [None]:

px.line(x=clean_x, y=clean_predicted_y, title="Linear regression line").show()
px.scatter(x=clean_x, y=clean_residuals, title="Cleaned residuals w.r.t. linear regression").show()

regular_x = np.arange(clean_x.min(),clean_x.max()+1,1)

f = interp1d(clean_x, clean_residuals, kind='linear')
regular_y = f(regular_x)

regular_line = interp1d(clean_x, clean_predicted_y, kind='linear')
regular_y_line = f(regular_x)

px.line(x=regular_x, y=regular_y_line, title="Periodic function regression").show()

# Plot the interpolated samples
px.line(df, x=regular_x, y=regular_y, title="Interpolated samples").show()

From the periodic signal data, we will take the FFT and plot the signal amplitude against frequency.

In [36]:

# Number of samples
N = len(regular_y)

# Spacing interpolated to be 1/366
T = 1 / 2562


yf = fft(regular_y)
xf = fftfreq(N, T)[0:N//2]
wavelength_in_days = 1/fftfreq(N, T)[1:N//2]

# magnitude = 2.0/N * np.abs(yf[1:N//2])
magnitude_with_first_value = 2.0/N * np.abs(yf[0:N//2])
# transform_plot = px.line(x=wavelength_in_days,y=magnitude, title="FFT Magnitude For Wavelength")
# transform_plot.update_layout(xaxis_title="Wavelength in days", yaxis_title="Magnitude")
# transform_plot.show()

px.line(x=xf, y = magnitude_with_first_value, title="FFT Magnitude for Frequency").show()



<h1>How do we interpret this plot?</h1>
<ol>
<li>It is indeed noisy, but there are some obvious spikes at the low frequency end</li>
<li>The length of our defined cycle is 2562, so dividing this number by any frequency gives us the wavelength in days that signal
<li>These signals are noteworthily significant: 
<ul><li>2562 / 7 = 366, i.e. a year*.</li><li>2562 / 14 = i.e. 6 months</li> <li>2562 / 56 = every 45 days, or 6 weeks. Could be related to academic semesters.</li> <li>2562 / 105 = every 24 days, or 3 weeks.</li>   
</ol>

<p> These are off-the-cuff human, or seasonal, phenomena that I'm suggesting, to give a concrete basis for whether these signals are plausible. Again: it isn't necessary to answer the <strong>why</strong> at this stage, we are looking for the <strong>what</>.</p>

<p> We will now clean this signal using standard time-domain noise threshold approaches, in order to extract a simplified,  <em>smoothed out</em>, model to reinforce the linear regression we have bookmarked.

*(Remember that we are treating every year as if it were a leap year, because people don't plan according to a date being "the 61st day of the year", they plan for it to be March 1st) 

In [39]:
cut_off_frequency = 1200
cut_off_index = int(cut_off_frequency * N / T)

yf[cut_off_index:-cut_off_index] = 0

#  yf is the FFT of the data and xf is the frequency bins
magnitude = np.abs(yf)


# These three operations are defined as functions, so that later I can loop through them incrementing the parameters to find the optimum noise threshold for extracting a signal that fits the data.

# Top level function
def find_r2(threshold_factor):
    predicted_regular_y, prediction_df = create_prediction_df(threshold_factor)

    linear_prediction_df = pd.DataFrame({
        OICD: regular_x,
        'Prediction': predicted_regular_y
    })
    
    linear_mse, r2_linear, mae_linear = compare_pred_to_df(linear_prediction_df)

    mse, r2, mae = compare_pred_to_df(prediction_df)

    return np.array([threshold_factor, mse, r2, mae, linear_mse, r2_linear, mae_linear])

# Function to compare the model to the real data
def compare_pred_to_df(prediction_df):
    # Merge true values with modulated line
    test_predictions = pd.merge(test_df[[OICD, MED_RES]], prediction_df, on=OICD, how='left')
    # Drop rows where 'NUM_COLLISIONS' or 'Prediction' is NaN
    test_predictions_clean = test_predictions.dropna(subset=[MED_RES, 'Prediction'])
    y_true = test_predictions_clean[MED_RES]
    y_pred = test_predictions_clean['Prediction']
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    return mse, r2, mae

# Function to extract the model using a given noise threshold.
def create_prediction_df(threshold_factor):

    # Identify significant frequencies based on a threshold
    threshold = magnitude.max() * threshold_factor  # Example threshold: 10% of max amplitude8
    significant_indices = np.where(magnitude >= threshold)[0]

    # Create a filtered version of the FFT results
    filtered_yf = np.zeros_like(yf)
    filtered_yf[significant_indices] = yf[significant_indices]

    # Inverse FFT to get the significant cyclical signal
    significant_signal = np.real(ifft(filtered_yf))
    predicted_regular_y = clean_slope * regular_x + clean_intercept
 
    # Combine with linear regression predictions
    combined_prediction = predicted_regular_y + significant_signal

    prediction_df = pd.DataFrame({
        OICD: regular_x,
        'Prediction': combined_prediction
    })
    return predicted_regular_y, prediction_df


outputs = []
# From prior experimentation I've picking a starting value of 0.19 for the noise threshold, as the outcome drops below this.
# For clarification: we're dropping from our FFT any signal that has a magnitude less than (strongest * threshold) 
baseline = 0.19

for factor in range(1000):
    outputs.append(find_r2(baseline+factor/10000))
    
r2_values = np.array(outputs)[:,2]
mae_values = np.array(outputs)[:,3]
threshold_used = np.array(outputs)[:,0]
linear_r2 = np.array(outputs)[:,5]
linear_mae = np.array(outputs)[:,6]

# We put all the results in a dataframe, so we can pick a winner.
df_best_threshold = pd.DataFrame({
    'Threshold Used': threshold_used,
    'R2': r2_values,
    'MAE': mae_values,
    'Linear_R2': linear_r2,
    'Linear_MAE': linear_mae
})

fig = go.Figure()

fig_mae = go.Figure()

fig.add_trace(go.Scatter(x=df_best_threshold['Threshold Used'], y=df_best_threshold['R2'], mode='lines', name='R2 vs Threshold'))
fig.add_trace(go.Scatter(x=df_best_threshold['Threshold Used'], y=df_best_threshold['Linear_R2'], mode='lines', name='Linear_R2 vs Threshold'))
fig_mae.add_trace(go.Scatter(x=df_best_threshold['Threshold Used'], y=df_best_threshold['MAE'], mode='lines', name='MAE vs Threshold'))
fig_mae.add_trace(go.Scatter(x=df_best_threshold['Threshold Used'], y=df_best_threshold['Linear_MAE'], mode='lines', name='Linear_MAE vs Threshold'))
fig.show()
fig_mae.show()


<html >
  <head>


    
  </head>
  <body>
    

<p>
My original outcome here was:
</p>
<ul>
<li>Mean Squared Error (MSE): 4170.4539197977165</li>
<li>R-squared: 0.5029024872423475</li>
</ul>
<p>

This was without correcting for leap years in the OICD calculation (i.e. I wasn't ensuring that March 1st was always day 61 of the year), and without removing the general year-on-year increase by using year_median residuals (the MED_RES column).

So, this is a small but meaningful increase in the predictive strength by making these changes.

I also found by trial and error* that 2sigma was the degree of outlier cleaning to get the strongest harmonic regression. Beyond that and black swans stole my lunch; below that, and we lost some outliers that seem to have a periodic quality to them.

Finally, the above plot is the R-squared value of the range of signal thresholds shown on the x-axis. The optimal threshold for the "noise-gate":
<ul>
<li>0.238, which gives an R-squared value of 0.61 and 
<li>MAE of 39.7</li> 
</ul>

We will now merge this periodic signal with the simple linear regression that we removed beforehand, and input that into the dataframe so we can look at its correlation strength with the actual data. We will also find the residuals w.r.t. this model, so we can see what (if any) other data points might be valuable to explore the remaining 40% of feature variance.

*I did consider creating a meta loop to generate the cartesian product of sigma-outlier values and noise-gate thresholds, but on balance that seems to fall into over-engineering/over-fitting.
</p>

  </body>
</html>


In [41]:
predicted_regular_y, df_best_threshold  = create_prediction_df(0.238)
df_best_threshold.head()
df_with_periodic_prediction = pd.merge(df_best_threshold[['Prediction', OICD]], df, on=OICD, how='left')
df_with_periodic_prediction.dropna(axis=0, how='any', subset='temp', inplace=True)
df_with_periodic_prediction['periodic_residual'] = df_with_periodic_prediction['median_residual']-df_with_periodic_prediction['Prediction']
df_with_periodic_prediction.head()

show_correlation_matrix(df_with_periodic_prediction, title='Now with periodic-based residuals')

If I've understood this correctly, then there is <em>no significant additional predictive power</em> gained by considering the weather. That is not to say the weather doesn't influence collisions: it may do, but weather is inherently seasonal, as are many aspects of human behaviour. By controlling for the established periodic cycle of the number of collisions (across the 2562 OICD range), we find no further significant correlation with the weather data. Put another way, if rain is forecast on a specific day in December, that is less important than the general prediction we can already make from the given day in December. We know it rains in December The same is true of the other weather factors.

Three more observations can be made: 
1. The correlation between the periodic-based predication, and the number of collisions is the strongest in the matrix (0.65), suggesting the additional work in of doing the FFT was justified.
2. The periodic residual is moderately correlated with the NUM_COLLISIONS (0.67), which implies the logical scenario that the more collisions there are on a given day, the more likely it is that this is an unusual day.
3. The predicted value has itself a correlation with the weather factors, similar to the original dataset.
 
This last point is especially important: it implies that the weather is not directly a strong cause of collisions. The collisions change according to the time of year (for a number of reasons), and the weather also changes according to the time of the year. Although it is tempting to make mechanistic hypotheses like "wet roads make braking harder", the data doesn't support the assertion that even a perfect weather forecast would be of significant value in predicting the number of collisions. Since we can't even have a perfect weather forecast, we are absolutely justified in ignoring it.





## Module 4: Regression models

In [172]:

from tensorflow import keras
from tensorflow.keras import layers

one_input_data = [df[OICD], df[MED_RES]] # array of only the ZICD and collisions
headers = [OICD, MED_RES] # Titles of input and output
df_one_input = pd.concat(one_input_data, axis=1, keys=headers) # Make the DF from the data and headers
df_one_input.head()



Unnamed: 0,ZICD_grouped_by_week_day,median_residual
0,1099,-177.0
1,1466,-78.0
2,1833,-9.0
3,2200,-53.0
4,371,-169.0


In [173]:
def train_test_split(df_input):
    train_data_output = df_input.sample(frac=0.8, random_state=0)
    test_data_output = df_input.drop(train_data.index)
    return train_data_output, test_data_output

train_data, test_data = train_test_split(df_one_input)

train_data.describe()

Unnamed: 0,ZICD_grouped_by_week_day,median_residual
count,1671.0,1671.0
mean,1286.686415,-2.974267
std,739.770574,95.750142
min,1.0,-415.0
25%,642.0,-63.5
50%,1295.0,0.0
75%,1937.5,62.0
max,2561.0,392.0


In [174]:
scale_factor_labels = 500

def get_features_and_labels(train_data_i, test_data_i):
    train_features_o = train_data_i.copy()
    test_features_o = test_data_i.copy()
    
    train_labels_o = train_features_o.pop(MED_RES)
    test_labels_o = test_features_o.pop(MED_RES)
    
    train_labels_o = train_labels_o / scale_factor_labels
    test_labels_o = test_labels_o / scale_factor_labels
    
    return train_features_o.astype(float), test_features_o.astype(float), train_labels_o.astype(float), test_labels_o.astype(float)

train_features, test_features, train_labels, test_labels = get_features_and_labels(train_data, test_data)

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))
first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
    print('First example:', first)
    print()
    print('Normalized: ', normalizer(first).numpy()) 

First example: [[2065.]]

Normalized:  [[1.05]]


In [175]:
zicd = np.array(train_features[OICD])
zicd_normalizer = layers.Normalization(input_shape=[1,], axis=None)
zicd_normalizer.adapt(zicd)

zicd_model = tf.keras.Sequential([
    zicd_normalizer,
    layers.Dense(units=1)
])
zicd_model.summary()

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 normalization_42 (Normaliz  (None, 1)                 3         
 ation)                                                          
                                                                 
 dense_25 (Dense)            (None, 1)                 2         
                                                                 
Total params: 5 (24.00 Byte)
Trainable params: 2 (8.00 Byte)
Non-trainable params: 3 (16.00 Byte)
_________________________________________________________________


In [176]:
zicd_model.predict(zicd[:10])



array([[-0.39629978],
       [ 0.30585635],
       [-0.20332143],
       [-0.04140289],
       [ 0.61339974],
       [-0.14323844],
       [ 0.29363605],
       [ 0.19332805],
       [-0.643251  ],
       [-0.22012429]], dtype=float32)

In [177]:
zicd_model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error'
)

In [178]:
%%time
history = zicd_model.fit(
    train_features[OICD],
    train_labels,
    epochs=100,
    verbose=0,
    validation_split=0.2
)

CPU times: total: 7.67 s
Wall time: 5.87 s


In [179]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

Unnamed: 0,loss,val_loss,epoch
95,0.128679,0.124213,95
96,0.125045,0.118899,96
97,0.127821,0.126636,97
98,0.129333,0.117318,98
99,0.128434,0.118627,99


In [180]:
def plot_loss(history):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=history['epoch'], y=history['loss'], mode='lines', name='voss vs epoch'))
    fig.add_trace(go.Scatter(x=history['epoch'], y=history['val_loss'], mode='lines', name='val_loss vs epoch'))
    fig.show()
    
plot_loss(hist)

In [181]:

test_results = {}

test_results['ZICD_model'] = zicd_model.evaluate(
    test_features[OICD],
    test_labels, verbose=0
)


def plot_ZICD(x_input, y_input):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x_input, mode='markers', y=y_input,  name='Predictions'))
    fig.add_trace(go.Scatter(x=train_features[OICD], mode='markers', y=train_labels, name='Data'))
    fig.show()

x = tf.linspace(0,2561, 2562)
y = zicd_model.predict(x)


    
x_flat = x.numpy()
y_flat = y.flatten()

plot_ZICD(x_flat, y_flat)
    



In [182]:
y_scaled = y_flat*scale_factor_labels

ml_pred_df = pd.DataFrame({
    OICD: x_flat,
    'Prediction': y_scaled
})

compare_pred_to_df(ml_pred_df)

(4398.836141380348, 0.396141461816485)

In [183]:
df.head()

Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,...,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS,median_residual,ZICD_grouped_by_week_day,ZICD_grouped_by_year_day
0,3,2013,1,1,01/01/2013,37.8,23.6,1011.9,10.0,6.1,...,19.0,39.9,33.1,0.0,,0,381,-177.0,1099,10
1,4,2013,1,2,02/01/2013,27.1,10.5,1016.8,10.0,5.3,...,19.0,33.1,21.9,0.0,,0,480,-78.0,1466,18
2,5,2013,1,3,03/01/2013,28.4,14.1,1020.6,10.0,3.7,...,15.0,32.0,24.1,0.0,,0,549,-9.0,1833,26
3,6,2013,1,4,04/01/2013,33.4,18.6,1017.0,10.0,6.5,...,24.1,37.0,30.0,0.0,,0,505,-53.0,2200,34
4,1,2013,1,5,05/01/2013,36.1,18.7,1020.6,10.0,6.6,...,21.0,42.1,32.0,0.0,,0,389,-169.0,371,36


In [184]:
df_one_hot = df.copy()
df_one_hot['day'] = df_one_hot['day'].map({0: 'Zero', 1: 'One', 2: 'Two', 3: 'Three', 4: 'Four', 5: 'Five', 6: 'Six'})
df_one_hot = pd.get_dummies(df_one_hot, columns=['day'], prefix='', prefix_sep='')
df_one_hot.head()

Unnamed: 0,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,mxpsd,...,median_residual,ZICD_grouped_by_week_day,ZICD_grouped_by_year_day,Five,Four,One,Six,Three,Two,Zero
0,2013,1,1,01/01/2013,37.8,23.6,1011.9,10.0,6.1,8.9,...,-177.0,1099,10,False,False,False,False,True,False,False
1,2013,1,2,02/01/2013,27.1,10.5,1016.8,10.0,5.3,9.9,...,-78.0,1466,18,False,True,False,False,False,False,False
2,2013,1,3,03/01/2013,28.4,14.1,1020.6,10.0,3.7,8.0,...,-9.0,1833,26,True,False,False,False,False,False,False
3,2013,1,4,04/01/2013,33.4,18.6,1017.0,10.0,6.5,13.0,...,-53.0,2200,34,False,False,False,True,False,False,False
4,2013,1,5,05/01/2013,36.1,18.7,1020.6,10.0,6.6,12.0,...,-169.0,371,36,False,False,True,False,False,False,False


In [185]:
df_one_hot['mo'] = df_one_hot['mo'].map({1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'})
df_one_hot = pd.get_dummies(df_one_hot, columns=['mo'], prefix='', prefix_sep='')
df_one_hot.head()

Unnamed: 0,year,da,collision_date,temp,dewp,slp,visib,wdsp,mxpsd,gust,...,Dec,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep
0,2013,1,01/01/2013,37.8,23.6,1011.9,10.0,6.1,8.9,19.0,...,False,False,True,False,False,False,False,False,False,False
1,2013,2,02/01/2013,27.1,10.5,1016.8,10.0,5.3,9.9,19.0,...,False,False,True,False,False,False,False,False,False,False
2,2013,3,03/01/2013,28.4,14.1,1020.6,10.0,3.7,8.0,15.0,...,False,False,True,False,False,False,False,False,False,False
3,2013,4,04/01/2013,33.4,18.6,1017.0,10.0,6.5,13.0,24.1,...,False,False,True,False,False,False,False,False,False,False
4,2013,5,05/01/2013,36.1,18.7,1020.6,10.0,6.6,12.0,21.0,...,False,False,True,False,False,False,False,False,False,False


In [186]:
dnn_many_input_data = [df_one_hot['One'], df_one_hot['Two'], df_one_hot['Three'], df_one_hot['Four'], df_one_hot['Five'], df_one_hot['Six'], df_one_hot['Zero'], df_one_hot['Jan'], df_one_hot['Feb'], df_one_hot['Mar'], df_one_hot['Apr'], df_one_hot['May'], df_one_hot['Jun'], df_one_hot['Jul'], df_one_hot['Aug'], df_one_hot['Sep'], df_one_hot['Oct'], df_one_hot['Nov'], df_one_hot['Dec'], df_one_hot[MED_RES]]

dnn_with_weather = [df_one_hot['One'], df_one_hot['Two'], df_one_hot['Three'], df_one_hot['Four'], df_one_hot['Five'], df_one_hot['Six'], df_one_hot['Zero'], df_one_hot['Jan'], df_one_hot['Feb'], df_one_hot['Mar'], df_one_hot['Apr'], df_one_hot['May'], df_one_hot['Jun'], df_one_hot['Jul'], df_one_hot['Aug'], df_one_hot['Sep'], df_one_hot['Oct'], df_one_hot['Nov'], df_one_hot['Dec'], df_one_hot[MED_RES], df_one_hot['temp'], df_one_hot['dewp'], df_one_hot['slp'], df_one_hot['visib'], df_one_hot['wdsp'], df_one_hot['mxpsd'], df_one_hot['max'], df_one_hot['min'], df_one_hot['prcp'], df_one_hot['fog']]
headers = ['Zero', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', MED_RES]
headers_with_weather = ['Zero', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', MED_RES, 'temp', 'dewp', 'slp', 'visib', 'wdsp', 'mxpsd', 'max', 'min', 'prcp', 'fog']
df_dnn_many_input_data = pd.concat(dnn_many_input_data, axis=1, keys=headers)

df_dnn_weather= pd.concat(dnn_with_weather, axis=1, keys=headers_with_weather)

df_dnn_weather.dropna(axis=1, inplace=True)
df_dnn_weather.head()

Unnamed: 0,Zero,One,Two,Three,Four,Five,Six,Jan,Feb,Mar,...,Dec,median_residual,temp,dewp,wdsp,mxpsd,max,min,prcp,fog
0,False,False,True,False,False,False,False,True,False,False,...,False,-177.0,37.8,23.6,6.1,8.9,39.9,33.1,0.0,0
1,False,False,False,True,False,False,False,True,False,False,...,False,-78.0,27.1,10.5,5.3,9.9,33.1,21.9,0.0,0
2,False,False,False,False,True,False,False,True,False,False,...,False,-9.0,28.4,14.1,3.7,8.0,32.0,24.1,0.0,0
3,False,False,False,False,False,True,False,True,False,False,...,False,-53.0,33.4,18.6,6.5,13.0,37.0,30.0,0.0,0
4,True,False,False,False,False,False,False,True,False,False,...,False,-169.0,36.1,18.7,6.6,12.0,42.1,32.0,0.0,0


In [187]:
# Many input no weather
train_data_many_input, test_data_many_input = train_test_split(df_dnn_many_input_data)
train_features_many_input, test_features_many_input, train_labels_many_input, test_labels_many_input  = get_features_and_labels(train_data_many_input, test_data_many_input)

# Including the weather
train_data_we, test_data_we = train_test_split(df_dnn_weather)
train_features_we, test_features_we, train_labels_we, test_labels_we  = get_features_and_labels(train_data_we, test_data_we)

print(train_features_we)

def build_and_compile_model(norm):
    model = keras.Sequential([
        norm,
        layers.Dense(64, activation='relu'),
        layers.Dense(64, activation='relu'),
        layers.Dense(1),
    ])
    
    model.compile(loss='mean_absolute_error', optimizer=tf.keras.optimizers.Adam(0.001))
    
    return model

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features_many_input))

first = np.array(train_features_many_input[:1])

with np.printoptions(precision=2, suppress=True):
    print('First example:', first)
    print()
    print('Normalized:', normalizer(first).numpy())
    
     
normalizer_we = tf.keras.layers.Normalization(axis=-1)
print(train_features_we.isna().sum())
print(np.isinf(train_features_we).sum())

sample_data = np.random.rand(100,5)

normalizer_we.adapt(np.array(train_features_we)) # this line is broken

dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()

dnn_model_weather = build_and_compile_model(normalizer_we)
dnn_model_weather.summary()

      Zero  One  Two  Three  Four  Five  Six  Jan  Feb  Mar  ...  Nov  Dec  \
233    0.0  0.0  0.0    0.0   1.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
683    1.0  0.0  0.0    0.0   0.0   0.0  0.0  0.0  0.0  0.0  ...  1.0  0.0   
1681   0.0  0.0  0.0    1.0   0.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
1729   0.0  0.0  1.0    0.0   0.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
810    0.0  0.0  0.0    0.0   0.0   0.0  1.0  0.0  0.0  1.0  ...  0.0  0.0   
...    ...  ...  ...    ...   ...   ...  ...  ...  ...  ...  ...  ...  ...   
59     0.0  0.0  0.0    0.0   0.0   1.0  0.0  0.0  0.0  1.0  ...  0.0  0.0   
1129   0.0  0.0  0.0    0.0   1.0   0.0  0.0  0.0  1.0  0.0  ...  0.0  0.0   
2023   0.0  0.0  1.0    0.0   0.0   0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0   
1088   0.0  0.0  0.0    0.0   0.0   1.0  0.0  0.0  0.0  0.0  ...  0.0  1.0   
1532   0.0  1.0  0.0    0.0   0.0   0.0  0.0  0.0  0.0  1.0  ...  0.0  0.0   

      temp  dewp  wdsp  mxpsd   max   min  prcp  fog  
233   75

In [188]:
%%time
history_dnn = dnn_model.fit(
    train_features_many_input,
    train_labels_many_input,
    epochs=100,
    verbose=0,
    validation_split=0.2
)

CPU times: total: 8.81 s
Wall time: 6.61 s


In [189]:
%%time
history_dnn_weather = dnn_model_weather.fit(
    train_features_we,
    train_labels_we,
    epochs=100,
    verbose=0,
    validation_split=0.2
)

CPU times: total: 9.22 s
Wall time: 6.7 s


In [190]:
hist_dnn = pd.DataFrame(history_dnn.history)
hist_dnn['epoch'] = history_dnn.epoch
plot_loss(hist_dnn)

In [191]:
hist_dnn_we = pd.DataFrame(history_dnn_weather.history)
hist_dnn_we['epoch'] = history_dnn_weather.epoch
plot_loss(hist_dnn_we)

In [215]:
test_results['dnn_model'] = dnn_model.evaluate(test_features_many_input, test_labels_many_input, verbose=0) * scale_factor_labels
pd.DataFrame(test_results, index=['Mean absolute error '+MED_RES]).T

Unnamed: 0,Mean absolute error median_residual
ZICD_model,0.115769
dnn_model,47.376242
dnn_weather,0.11819


In [216]:
test_results['dnn_weather'] = dnn_model_weather.evaluate(test_features_we, test_labels_we, verbose=0) * scale_factor_labels
pd.DataFrame(test_results, index=['Mean absolute error ' + MED_RES]).T

Unnamed: 0,Mean absolute error median_residual
ZICD_model,0.115769
dnn_model,47.376242
dnn_weather,59.0952


# Module 5: Using the outcomes

In this section you want to use the test data to test what kind of money you will potentially make. 

Your company rents cars daily to people in New York City and is struggling in a saturated market. You have noted that you offer a flat rate damage waiver insurance package to all customers and that most customers chose not to take it. This package is something that has the potential to make the company lots of money if marketed properly.

At the moment you offer the package for a fee of 30 dollars per day, with only around 30% of all customers taking it. You rent on average 20,000 vehicles per day and therefore this package makes the company 180,000 dollars. The damage caused by collisions costs on average 500 dollars per collision with 8% of customers encountering a collision of some kind resulting in damage. The total costs from damage come to 800,000 dollars, which is covered by the customers' insurance, but around 10% of this is covered by the company due to fradulent behaviour or customers taking the waiver. This results in a profit of around 100,000 dollars per day for the sale of this package alone. 

This 30 dollars is based on an expected 1,200 collisions per day (based on the maximum).

The goal of this investigation is to accurately predict the number of expected collisions on a given day in order to reduce the price of the on-demand package and therefore give value to the customer. Surveys have shown that a competitive price would result in 80% of respondents taking the damage waiver insurance option – but the price must reflect the associated costs.

In [None]:
df_2019_test_data = pd.read_csv('gdrive/My Drive/testdata2019.csv')

In [None]:
df_2019_test_data = df_2019_test_data.sort_values(["year", "mo", "da"], ascending = (True, True, True))

In [None]:
df_2019_test_data.head()

Unnamed: 0,day,year,mo,da,collision_date,temp,dewp,slp,visib,wdsp,mxpsd,gust,max,min,prcp,sndp,fog,NUM_COLLISIONS
0,2,2019,1,1,2019-01-01,50.5,43.2,1009.8,7.0,999.9,999.9,999.9,57.9,36.0,1.08,999.9,0,430
1,3,2019,1,2,2019-01-02,38.0,23.2,1024.2,10.0,999.9,999.9,999.9,57.9,35.1,0.06,999.9,0,502
2,4,2019,1,3,2019-01-03,41.1,29.4,1015.8,9.9,999.9,999.9,999.9,44.1,35.1,0.0,999.9,0,504
3,5,2019,1,4,2019-01-04,39.7,26.4,1014.8,9.9,999.9,999.9,999.9,46.0,35.1,0.0,999.9,0,598
4,6,2019,1,5,2019-01-05,44.2,41.0,1003.3,5.3,999.9,999.9,999.9,46.9,35.1,0.22,999.9,0,455
