#### CMSE 202 Final Project
### &#9989; Katherine Perry, Eric Ropeta, Emma Herrera, Yuhan Zhu
### &#9989; Section_003
#### &#9989; April 17, 2020

# Modeling Traffic

## The Problem

&emsp; Automobiles are one of the most ubiquitous and important methods of transportation in the world. Thus, any strides we can take to improve efficiency and safety of traffic is crucially important. Many problems plague the automobile world: traffic jams, accidents, poor road design, pedestrian integration, etc. In 2019 in the United States,  almost 40,000 people died  due to traffic collisions and 4.4 million people were seriously injured (1). This is a huge number of people in one year and is clearly a significant societal issue. Additionally, each of us has experienced frustrations with accidents or the flow of traffic in our lives, and we seek to understand why certain patterns and problems with traffic flow occur. 

&emsp; New York City is the largest city in the US and accordingly has a lot of vehicles on the road. With personal cars, taxis, motorcycles, bikes, buses and more, the city's traffic is notoriously bad. Traffic fatalities are often around 200 people each year with 44,508 crashes resulting in serious injuries in 2018 (2). Clearly, traffic collisions are a detrimental problem in NYC.

&emsp; Many factors can have a significant effect on the flow of traffic: number of cars on the road, the speed limit, the time of day, the type of intersection, the safety level of the drivers, and more. We plan to model traffic flow and collision characteristics based on different factors. We will specifically utilize New York City as a case study due to an abundance of data and its traffic issues. Our model can be applied to compare different civil engineering strategies and predict areas and times of danger or problematic traffic flow. 

# Question:
### How can we predict when traffic collisions will occur and what factors are most influential?

## Methodology:  
### Overview:
_Model:_
&emsp; In this project, We used an Artificial Neural Network to examine the traffic flow and collisions in New York City.  This model is appropriate for the data because we have features and labels for each collision. Using the features (location, traffic flow, year) to predict the labels (time of day, and day of the week) will enable us to gain insight into when accidents are most likely to occur in different places. 

_Testing:_
&emsp; We will test our artificial neural network by comparing the predicted labels to the actual labels for time of day and day of the week. We will calculate the mean squared error of the predictions, and adjust our model to minimize this error. Then we will compare the error for time of day and day of week in order to find patterns in the collisions and discover which label is most affected or predictable given certain input parameters.

_Python Tools:_
&emsp; We have used several important python strategies to create this project.  We used the technique of masking and quantitative manipulation of a dataset in order to discern meaning. The most important was creating an Aritificial Neural Network, which is a learning strategy that attempts to correlate inputs and outputs by using an input output and hidden layer. The ANN then learns and updates the weights of the connections between the nodes in each layer using different mathematical strategies like matrix manipulation. The end goal is to be able to take an input and use the weights to produce the correct output.

&emsp; This project required importing  python modules. We utilized many different modules from sklearn in order to manipulate our dataset and implement the neural network. We used pandas to read the csv file and store the collision dataset in a dataframe. We used numpy to perform manipulation and cleaning of the datasets. We used matplotlib for graphing and displaying figures in a convenient and appealing manner. Finally, we used folium to visualize maps of places like New York City with datapoints and high quality representations of streets and spatial relationships.

In [32]:
# import necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import folium

### Data

&emsp; We discovered quite a few datasets that have information about traffic volume in different areas. A majority of the available and well-maintained datasets were from New York City. This played a role in our decision to use NYC as a case study for our model and thus focus on only the NYC datasets. The first has information about traffic collisions including their date, time, location in latitude and longitude and street name. The City of New York Open Data website seems like a comprehensive and reliable source and it has accidents as recent as 4/07/2020 (3).  

&emsp;The next source  has data about traffic volume counts on various streets in NYC. It was updated on Feb 7, 2020, so it has pretty recent and accurate data for modern NYC before the lockdown hit. This dataset is also from City of New York Open Data (5). 

### Section 1
&emsp;In the next section of code, we took steps to load in the first dataset, clean it and visualize the data.  After we used pandas to load in the specific columns of the collision dataset we wanted, we displayed the first few rows of the dataset. Then we used the pandas function dropna to eliminate rows of the dataset that had any NaN values for date, time, latitude, longitude, or street name. This was important because we needed all of the information to create labels, maps, and features in the future. Luckily, the dataset was long enough that eliminating certain rows still left us with plenty of data.

&emsp; Then we utilized folium to create a map at the latitude and longitude of New York City, about (40.79 N, -73.97 W). Then taking a smaller subset of the dataframe, we plotted 200 markers of accidents on a map. 

&emsp; It was difficult to tell whether or not this plotting strategy revealed any patterns in the data because there were a lot of markers on the map. When zooming in however, some roads appeared to have more concentrated accidents. Additionally, adding the time of day to the marker label showed some patterns on streets. For instance there were clusters in Manhattan that seemed to occur right around rush hour in the morning (8 am) or night(5 pm). 

&emsp; We recognized that it would be important to discern these patterns with more refined strategies, such as a neural network.

In [33]:
# load in specific columns of colision dataset this one is from https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data
collisions = pd.read_csv('nyc_crash_data.csv', usecols = ['CRASH DATE', 'CRASH TIME', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME', 'OFF STREET NAME', 'CROSS STREET NAME'])

In [34]:
#look at the top of the collisions dataset
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME
0,09/26/2017,18:18,40.73706,-73.85266,LONG ISLAND EXPRESSWAY,,
1,10/15/2017,2:10,40.853867,-73.93928,,,180 CABRINI BOULEVARD
2,10/04/2017,14:00,40.526604,-74.22892,,,2965 VETERANS ROAD WEST
3,10/17/2017,15:23,40.58571,-73.91276,BELT PARKWAY,,
4,09/26/2017,12:00,40.74485,-73.88448,,,80-50 BAXTER AVENUE


In [35]:
# drop rows with NaN values for certain columns
collisions.dropna(axis = 0, subset = ['CRASH DATE', 'CRASH TIME', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME'], inplace = True)
collisions.index = np.arange(0,len(collisions))
print(collisions.shape)
collisions.head()

(1192049, 7)


Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME
0,09/26/2017,18:18,40.73706,-73.85266,LONG ISLAND EXPRESSWAY,,
1,10/17/2017,15:23,40.58571,-73.91276,BELT PARKWAY,,
2,10/17/2017,20:17,40.665916,-73.92547,EAST NEW YORK AVENUE,BUFFALO AVENUE,
3,10/18/2017,18:30,40.73574,-73.872665,90 STREET,56 AVENUE,
4,10/01/2017,16:30,40.63007,-73.982414,17 AVENUE,48 STREET,


In [36]:
# some basic mapping code
map_nyc = folium.Map(location=[40.79,-73.97])
folium.Marker([40.79,-73.97], popup='Manhattan').add_to(map_nyc)
map_nyc.save("map1.html")
map_nyc

In [37]:
# grab a smaller subset of collisions to plot
select_collisions = collisions.iloc[0:200]

# make a plotly map of the manhattan area
map_nyc = folium.Map(location=[40.79,-73.97])

# put a marker on the map for each collision in the smaller dataset
for i in range(select_collisions.shape[0]):
    long = select_collisions['LONGITUDE'].iloc[i]
    lat = select_collisions['LATITUDE'].iloc[i]
    time = select_collisions['CRASH TIME'].iloc[i]
    folium.Marker([lat, long], popup=time).add_to(map_nyc)
map_nyc.save("map2.html")
map_nyc

### Section 2
&emsp; In this section of code, we added and manipulated columns in the dataset in order to prepare for the Neural Network. We converted the crash date to days of the week and year by using the specialized pandas function to_datetime(). Then we replaced the crash time with categories: Morning was 6am-12pm and designated 1; Afternoon was 12pm-6pm and designated 2; Evening was 6pm-12am and designated 3; Night was 12am-6am and designated 4. We felt that this categorization and classification was the best way to break up the day into meaningful sections where traffic patterns would be similar. This also allowed us to have consistency with the traffic volume dataset.

In [38]:
# Don't run this cell more than once!! Takes forever to load 
# convert crash date to day of week
date_time_values = pd.to_datetime(collisions['CRASH DATE'])
weekdays_column = date_time_values.dt.weekday

In [39]:
# add the column to the dataset
collisions['WEEKDAY'] = weekdays_column

In [40]:
# Monday = 0 
# Tuesday = 1
# ...
# Sunday = 6
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,WEEKDAY
0,09/26/2017,18:18,40.73706,-73.85266,LONG ISLAND EXPRESSWAY,,,1
1,10/17/2017,15:23,40.58571,-73.91276,BELT PARKWAY,,,1
2,10/17/2017,20:17,40.665916,-73.92547,EAST NEW YORK AVENUE,BUFFALO AVENUE,,1
3,10/18/2017,18:30,40.73574,-73.872665,90 STREET,56 AVENUE,,2
4,10/01/2017,16:30,40.63007,-73.982414,17 AVENUE,48 STREET,,6


In [41]:
# Adding "Part of Day" column to tie it to the volume dataset

time_strings = collisions['CRASH TIME']
time_strings.index = np.arange(0,len(time_strings))

    
time_strings = list(time_strings)
time_strings = [i.replace(':','') for i in time_strings]
time_strings = [int(i[:len(i)-2]) for i in time_strings]
for i in range(len(time_strings)):
    if time_strings[i] in range(0,6):
        time_strings[i] = 'NIGHT'
        
    elif time_strings[i] in range(6,12):
        time_strings[i] = 'MORNING'

    elif time_strings[i] in range(12,18):
        time_strings[i] = 'AFTERNOON'
        
    elif time_strings[i] in range(18,24):
        time_strings[i] = 'EVENING'
        
time_strings = pd.Series(time_strings)
collisions['CRASH TIME'] = time_strings
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,WEEKDAY
0,09/26/2017,EVENING,40.73706,-73.85266,LONG ISLAND EXPRESSWAY,,,1
1,10/17/2017,AFTERNOON,40.58571,-73.91276,BELT PARKWAY,,,1
2,10/17/2017,EVENING,40.665916,-73.92547,EAST NEW YORK AVENUE,BUFFALO AVENUE,,1
3,10/18/2017,EVENING,40.73574,-73.872665,90 STREET,56 AVENUE,,2
4,10/01/2017,AFTERNOON,40.63007,-73.982414,17 AVENUE,48 STREET,,6


In [42]:
collisions['CRASH DATE'] = collisions['CRASH DATE'].str[-4:]
collisions.head()

Unnamed: 0,CRASH DATE,CRASH TIME,LATITUDE,LONGITUDE,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,WEEKDAY
0,2017,EVENING,40.73706,-73.85266,LONG ISLAND EXPRESSWAY,,,1
1,2017,AFTERNOON,40.58571,-73.91276,BELT PARKWAY,,,1
2,2017,EVENING,40.665916,-73.92547,EAST NEW YORK AVENUE,BUFFALO AVENUE,,1
3,2017,EVENING,40.73574,-73.872665,90 STREET,56 AVENUE,,2
4,2017,AFTERNOON,40.63007,-73.982414,17 AVENUE,48 STREET,,6


### Section 3: Working with the Traffic Flow Data
&emsp; In the next section, we loaded in the data on traffic volume counts using pandas and numpy and cleaned it to increase the ease of use. We did this by adding 4 columns for specific times of day: morning, afternoon, evening, and night. Then we summed the traffic counts for the relevant times of day and updated the columns. This will make it more convenient to be able to use a neural network model for each collision as we will have time categories for the time of collision and the volume of traffic there. We also ensured that the time classification matched that of the time of day of the collision in the previous dataset.
&emsp;Then, we manipulated the volume dataset to be more usable by summing the rows that had the same street weekday, and year and grouping the dataframe with only the relevant rows and columns. We added a week day column in the same manner as for the collision dataset and made the road names all upper case for consistency.

In [43]:
#load in the traffic volume counts and merge the dataframes to gain flow insights
# https://data.cityofnewyork.us/Transportation/Traffic-Volume-Counts-2014-2018-/ertz-hr4r
volume = pd.read_csv('Traffic_Volume_Counts__2014-2018_.csv')

#Adding columns to sum the flow values for each time of day
volume["MORNING"] = np.zeros_like(volume['6:00-7:00PM'])
volume["AFTERNOON"] = np.zeros_like(volume['6:00-7:00PM'])
volume["EVENING"] = np.zeros_like(volume['6:00-7:00PM'])
volume["NIGHT"] = pd.Series(volume['6:00-7:00PM'])

volume.iloc[:,7:31] = volume.iloc[:,7:31].fillna(value = 0)
volume["NIGHT"] = volume.iloc[:,7] + volume.iloc[:,8] + volume.iloc[:,9] + volume.iloc[:,10] + volume.iloc[:,11] + volume.iloc[:,12]
volume["MORNING"] = volume.iloc[:,13] + volume.iloc[:,14] + volume.iloc[:,15] + volume.iloc[:,16] + volume.iloc[:,17] + volume.iloc[:,18] 
volume["AFTERNOON"] = volume.iloc[:,19] + volume.iloc[:,20] + volume.iloc[:,21] + volume.iloc[:,22] + volume.iloc[:,23] + volume.iloc[:,24] 
volume["EVENING"] = volume.iloc[:,25] + volume.iloc[:,26] + volume.iloc[:,27] + volume.iloc[:,28] + volume.iloc[:,29] + volume.iloc[:,30] 

volume = volume.drop(['12:00-1:00 AM', '1:00-2:00AM', '2:00-3:00AM', '3:00-4:00AM',
       '4:00-5:00AM', '5:00-6:00AM', '6:00-7:00AM', '7:00-8:00AM',
       '8:00-9:00AM', '9:00-10:00AM', '10:00-11:00AM', '11:00-12:00PM',
       '12:00-1:00PM', '1:00-2:00PM', '2:00-3:00PM', '3:00-4:00PM',
       '4:00-5:00PM', '5:00-6:00PM', '6:00-7:00PM', '7:00-8:00PM',
       '8:00-9:00PM', '9:00-10:00PM', '10:00-11:00PM', '11:00-12:00AM'], axis = 1)

volume.head()


Unnamed: 0,ID,Segment ID,Roadway Name,From,To,Direction,Date,MORNING,AFTERNOON,EVENING,NIGHT
0,2,70376,3 Avenue,East 154 Street,East 155 Street,NB,09/13/2014,1521.0,3269,2392,915.0
1,2,70376,3 Avenue,East 155 Street,East 154 Street,SB,09/13/2014,1513.0,2263,1710,725.0
2,56,176365,Bedford Park Boulevard,Grand Concourse,Valentine Avenue,EB,09/13/2014,984.0,1529,1035,430.0
3,56,176365,Bedford Park Boulevard,Grand Concourse,Valentine Avenue,WB,09/13/2014,1065.0,1478,1023,437.0
4,62,147673,Broadway,West 242 Street,240 Street,SB,09/13/2014,3008.0,4487,2957,1025.0


In [44]:
# Standardize street names so that they're upper case
volume['Roadway Name'] = volume['Roadway Name'].str.upper() 

# convert date to day of the week in a new column
date_time_values = pd.to_datetime(volume['Date'])
weekdays_column = date_time_values.dt.weekday
volume['WEEKDAY'] = weekdays_column
volume.head()

Unnamed: 0,ID,Segment ID,Roadway Name,From,To,Direction,Date,MORNING,AFTERNOON,EVENING,NIGHT,WEEKDAY
0,2,70376,3 AVENUE,East 154 Street,East 155 Street,NB,09/13/2014,1521.0,3269,2392,915.0,5
1,2,70376,3 AVENUE,East 155 Street,East 154 Street,SB,09/13/2014,1513.0,2263,1710,725.0,5
2,56,176365,BEDFORD PARK BOULEVARD,Grand Concourse,Valentine Avenue,EB,09/13/2014,984.0,1529,1035,430.0,5
3,56,176365,BEDFORD PARK BOULEVARD,Grand Concourse,Valentine Avenue,WB,09/13/2014,1065.0,1478,1023,437.0,5
4,62,147673,BROADWAY,West 242 Street,240 Street,SB,09/13/2014,3008.0,4487,2957,1025.0,5


In [45]:
# Use a substring to keep just the year from the date
volume['Date'] = volume['Date'].str[-4:]

# sum traffic volume counts for rows with the same name and year
volume = volume.groupby(['Roadway Name', 'Date', 'WEEKDAY'])['MORNING', 'AFTERNOON', 'EVENING', 'NIGHT'].sum().reset_index()
volume.head(10)

  volume = volume.groupby(['Roadway Name', 'Date', 'WEEKDAY'])['MORNING', 'AFTERNOON', 'EVENING', 'NIGHT'].sum().reset_index()


Unnamed: 0,Roadway Name,Date,WEEKDAY,MORNING,AFTERNOON,EVENING,NIGHT
0,1 AVENUE,2015,0,10247.0,10182,10299,4123.0
1,1 AVENUE,2015,1,10586.0,10694,11752,4733.0
2,1 AVENUE,2015,2,10682.0,11748,13024,5668.0
3,1 AVENUE,2015,3,10876.0,11413,13358,6086.0
4,1 AVENUE,2015,4,10476.0,12150,13563,7438.0
5,1 AVENUE,2015,5,18126.0,22644,25008,15622.0
6,1 AVENUE,2015,6,15302.0,22128,21422,16411.0
7,1 AVENUE,2016,0,30703.0,34968,28886,9568.0
8,1 AVENUE,2016,1,29717.0,34426,30774,12070.0
9,1 AVENUE,2016,2,32748.0,34481,33617,12572.0


### Section 4: Preparing Training and Testing Data
&emsp;In the next section we finished preparing the data for the Neural Network. First we imported the function train_test_split from sklearn in order to spli our dataframe into training and testing data. Due to the extremely large size of our dataset and runtime constraints, we started with pretty small train and test sizes. Once we split into training and testing data, we further split those into labels and vectors. Then for each training and testing data we calculated a column of traffic flows based on the year, street, and weekday. We dropped rows from the data that did not have a match and then ensured lengths of vectors and labels were compatible to successfully complete our dataset merge.

In [46]:
#Neural Network code
#import necesssary module to implement a Neural Network using sklearn
from sklearn.model_selection import train_test_split

# split whole dataframe into training and testing data
train, test= train_test_split(collisions, test_size = .001, train_size = .005)

# split into features and labels by dropping columns
train_labels = train.drop(['CRASH DATE', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME'], axis = 1)
test_labels = test.drop(['CRASH DATE', 'LATITUDE', 'LONGITUDE', 'ON STREET NAME', 'CROSS STREET NAME', 'OFF STREET NAME'], axis = 1)
train_vectors = train.drop(['WEEKDAY', 'CRASH TIME', 'CROSS STREET NAME', 'OFF STREET NAME'], axis = 1)
test_vectors = test.drop(['WEEKDAY', 'CRASH TIME', 'CROSS STREET NAME', 'OFF STREET NAME'], axis = 1)

In [47]:
print(train_vectors.shape, train_labels.shape, test_vectors.shape, test_labels.shape)

(5960, 4) (5960, 2) (1193, 4) (1193, 2)


In [48]:
# calculate a column with time-specific traffic flow to the collisions data for train_vectors
flows = []
# loop through the training data and see if we have a matching street with traffic flow info
for i in range(train_vectors.shape[0]):
    road = train_vectors['ON STREET NAME'].iloc[i]
    matches = volume[volume['Roadway Name']==road.rstrip()]
    if len(matches) !=0:
        year = train_vectors['CRASH DATE'].iloc[i]
        match = matches[matches['Date']==year.rstrip()]
        if len(match)!= 0:
            time = train_labels['CRASH TIME'].iloc[i]
            flows.append(match[time].iloc[0])
        else:
            flows.append(np.nan)
    else:
        flows.append(np.nan)

In [49]:
# add the column to vectors and labels to keep lengths consistent
train_vectors['TRAFFIC FLOW'] = flows
train_labels['TRAFFIC FLOW'] = flows

# will drop na values from traffic flow column in both vectors and labels
train_vectors.dropna(axis = 0, subset = ['TRAFFIC FLOW'], inplace = True)
train_labels.dropna(axis = 0, subset = ['TRAFFIC FLOW'], inplace = True)

# drop irrelevant columns for the labels and vectors
train_labels.drop(['TRAFFIC FLOW'], axis = 1, inplace = True)
train_vectors.drop(['ON STREET NAME'], axis = 1, inplace = True)
train_vectors.head()

Unnamed: 0,CRASH DATE,LATITUDE,LONGITUDE,TRAFFIC FLOW
814175,2014,40.67736,-73.886861,10678.0
970512,2014,40.732281,-73.851822,4652.0
871117,2014,40.680315,-73.842019,49800.0
947853,2014,40.681448,-73.946438,3750.0
91212,2018,40.752853,-73.99299,17074.0


In [50]:
# calculate a column with time-specific traffic flow to the collisions data for train_vectors
flows = []
# loop through the training data and see if we have a matching street with traffic flow info
for i in range(test_vectors.shape[0]):
    road = test_vectors['ON STREET NAME'].iloc[i]
    matches = volume[volume['Roadway Name']==road.rstrip()]
    if len(matches) !=0:
        year = test_vectors['CRASH DATE'].iloc[i]
        match = matches[matches['Date']==year.rstrip()]
        if len(match)!= 0:
            time = test_labels['CRASH TIME'].iloc[i]
            flows.append(match[time].iloc[0])
        else:
            flows.append(np.nan)
    else:
        flows.append(np.nan)

In [51]:
# add the column to vectors and labels to keep lengths consistent
test_vectors['TRAFFIC FLOW'] = flows
test_labels['TRAFFIC FLOW'] = flows

# will drop na values from traffic flow column in both vectors and labels
test_vectors.dropna(axis = 0, subset = ['TRAFFIC FLOW'], inplace = True)
test_labels.dropna(axis = 0, subset = ['TRAFFIC FLOW'], inplace = True)

# drop irrelevant columns for the labels and vectors
test_labels.drop(['TRAFFIC FLOW'], axis = 1, inplace = True)
test_vectors.drop(['ON STREET NAME'], axis = 1, inplace = True)
test_vectors.head()

Unnamed: 0,CRASH DATE,LATITUDE,LONGITUDE,TRAFFIC FLOW
1013635,2014,40.683514,-73.975951,10546.0
769944,2015,40.756266,-73.823514,5016.0
90396,2017,40.680088,-73.94398,5932.0
578039,2017,40.836555,-73.94306,29899.0
391134,2017,40.713554,-73.98086,3158.0


&emsp;Next, we made sure all the values in our dataset were numeric, in order for the Neural Network to be able to perform mathematical functions lke matrix dot products, etc. We converted categorical data to numeric categorical data, used the pandas astype function to make everything floats, and then converted the pandas dataframes to numpy arrays.

In [52]:
# change time of day to integer representation for test_labels
train_labels.loc[train_labels['CRASH TIME']=='MORNING', 'CRASH TIME'] = 1
train_labels.loc[train_labels['CRASH TIME']=='AFTERNOON', 'CRASH TIME'] = 2
train_labels.loc[train_labels['CRASH TIME']=='EVENING', 'CRASH TIME'] = 3
train_labels.loc[train_labels['CRASH TIME']=='NIGHT', 'CRASH TIME'] = 4
train_labels.head(5)

Unnamed: 0,CRASH TIME,WEEKDAY
814175,3,4
970512,2,3
871117,2,2
947853,1,4
91212,3,1


In [53]:
# change time of day to integer representation for test_labels
test_labels.loc[test_labels['CRASH TIME']=='MORNING', 'CRASH TIME'] = 1
test_labels.loc[test_labels['CRASH TIME']=='AFTERNOON', 'CRASH TIME'] = 2
test_labels.loc[test_labels['CRASH TIME']=='EVENING', 'CRASH TIME'] = 3
test_labels.loc[test_labels['CRASH TIME']=='NIGHT', 'CRASH TIME'] = 4
train_labels.head()

Unnamed: 0,CRASH TIME,WEEKDAY
814175,3,4
970512,2,3
871117,2,2
947853,1,4
91212,3,1


In [54]:
# convert the data to all floats in numpy arrays to prepare for the neural network
train_vectors = train_vectors.astype(float)
test_vectors = test_vectors.astype(float)
train_labels = train_labels.astype(float)
test_labels = test_labels.astype(float)
train_vectors = train_vectors.to_numpy()
test_vectors = test_vectors.to_numpy()
train_labels = train_labels.to_numpy()
test_labels = test_labels.to_numpy()
print(train_vectors.shape, train_labels.shape, test_vectors.shape, test_labels.shape)

(961, 4) (961, 2) (200, 4) (200, 2)


&emsp;We then had to scale the training and testing data to make sure it was standardized by dividing by the max (or min for negative entries).

In [55]:
# normalize the training data by scaling
train_vectors[:,0]/= np.max(train_vectors[:,0])
train_vectors[:,1]/= np.max(train_vectors[:,1])
train_vectors[:,2]/= np.min(train_vectors[:,2])
train_vectors[:,3]/= np.max(train_vectors[:,3])
maxlabel1 = np.max(train_labels[:,0])
maxlabel2 = np.max(train_labels[:,1])
train_labels[:,0]/= maxlabel1
train_labels[:,1]/= maxlabel2
train_labels

array([[0.75      , 0.66666667],
       [0.5       , 0.5       ],
       [0.5       , 0.33333333],
       ...,
       [0.75      , 0.66666667],
       [0.25      , 0.5       ],
       [1.        , 0.66666667]])

In [56]:
#normalize the testing data by scaling
test_vectors[:,0]/= np.max(test_vectors[:,0])
test_vectors[:,1]/= np.max(test_vectors[:,1])
test_vectors[:,2]/= np.min(test_vectors[:,2])
test_vectors[:,3]/= np.max(test_vectors[:,3])
maxlabel1 = np.max(test_labels[:,0])
maxlabel2 = np.max(test_labels[:,1])
test_labels[:,0]/= maxlabel1
test_labels[:,1]/= maxlabel2
test_labels

array([[0.25      , 0.66666667],
       [0.75      , 0.        ],
       [0.5       , 0.33333333],
       [0.25      , 1.        ],
       [0.5       , 0.        ],
       [0.5       , 0.16666667],
       [0.25      , 0.16666667],
       [0.5       , 0.33333333],
       [0.5       , 0.5       ],
       [0.5       , 0.33333333],
       [0.75      , 0.        ],
       [0.75      , 0.83333333],
       [0.5       , 1.        ],
       [0.5       , 0.        ],
       [0.5       , 0.33333333],
       [0.25      , 0.83333333],
       [0.25      , 0.66666667],
       [0.5       , 0.66666667],
       [0.5       , 1.        ],
       [0.25      , 0.66666667],
       [0.5       , 0.33333333],
       [0.5       , 0.33333333],
       [0.5       , 0.66666667],
       [0.5       , 0.16666667],
       [0.5       , 0.33333333],
       [0.75      , 0.66666667],
       [0.25      , 0.83333333],
       [0.75      , 1.        ],
       [0.75      , 0.16666667],
       [0.25      , 0.33333333],
       [1.

### Section 5: Training and Testing the Neural Network
&emsp; In this part of the code, we adapted a neural network and trainer class from the github project Neural Networks Demystified by Stephen Welch. This was important because it abstracted the complex mathematical operations going on behind the scenes, and allowed us to just focus on training, testing, and debugging the actual instances of the Neural Network and trainer.

In [57]:
# %load partSix.py
# Adapted from Neural Networks Demystified
#
# Stephen Welch
# @stephencwelch

## ----------------------- Part 5 ---------------------------- ##

class Neural_Network(object):
    def __init__(self, layer_i = 4, layer_o = 1, layer_h = 2):        
        #Define Hyperparameters
        self.inputLayerSize = layer_i
        self.outputLayerSize = layer_o
        self.hiddenLayerSize = layer_h
        
        #Weights (parameters)
        self.W1 = np.random.randn(self.inputLayerSize,self.hiddenLayerSize)
        self.W2 = np.random.randn(self.hiddenLayerSize,self.outputLayerSize)
        
    def forward(self, X):
        #Propogate inputs though network
        self.z2 = np.dot(X, self.W1)
        self.a2 = self.sigmoid(self.z2)
        self.z3 = np.dot(self.a2, self.W2)
        yHat = self.sigmoid(self.z3) 
        return yHat
        
    def sigmoid(self, z):
        #Apply sigmoid activation function to scalar, vector, or matrix
        return 1/(1+np.exp(-z))
    
    def sigmoidPrime(self,z):
        #Gradient of sigmoid
        return np.exp(-z)/((1+np.exp(-z))**2)
    
    def costFunction(self, X, y):
        #Compute cost for given X,y, use weights already stored in class.
        self.yHat = self.forward(X)
        J = 0.5*sum((y-self.yHat)**2)
        return J
        
    def costFunctionPrime(self, X, y):
        #Compute derivative with respect to W and W2 for a given X and y:
        self.yHat = self.forward(X)
        
        delta3 = np.multiply(-(y-self.yHat), self.sigmoidPrime(self.z3))
        dJdW2 = np.dot(self.a2.T, delta3)
        
        delta2 = np.dot(delta3, self.W2.T)*self.sigmoidPrime(self.z2)
        dJdW1 = np.dot(X.T, delta2)  
        
        return dJdW1, dJdW2
    
    #Helper Functions for interacting with other classes:
    def getParams(self):
        #Get W1 and W2 unrolled into vector:
        params = np.concatenate((self.W1.ravel(), self.W2.ravel()))
        return params
    
    def setParams(self, params):
        #Set W1 and W2 using single paramater vector.
        W1_start = 0
        W1_end = self.hiddenLayerSize * self.inputLayerSize
        self.W1 = np.reshape(params[W1_start:W1_end], (self.inputLayerSize , self.hiddenLayerSize))
        W2_end = W1_end + self.hiddenLayerSize*self.outputLayerSize
        self.W2 = np.reshape(params[W1_end:W2_end], (self.hiddenLayerSize, self.outputLayerSize))
        
    def computeGradients(self, X, y):
        dJdW1, dJdW2 = self.costFunctionPrime(X, y)
        return np.concatenate((dJdW1.ravel(), dJdW2.ravel()))

def computeNumericalGradient(N, X, y):
        paramsInitial = N.getParams()
        numgrad = np.zeros(paramsInitial.shape)
        perturb = np.zeros(paramsInitial.shape)
        e = 1e-4

        for p in range(len(paramsInitial)):
            #Set perturbation vector
            perturb[p] = e
            N.setParams(paramsInitial + perturb)
            loss2 = N.costFunction(X, y)
            N.setParams(paramsInitial - perturb)
            loss1 = N.costFunction(X, y)
            #Compute Numerical Gradient
            numgrad[p] = (loss2 - loss1) / (2*e)
            #Return the value we changed to zero:
            perturb[p] = 0
            
        #Return Params to original value:
        N.setParams(paramsInitial)

        return numgrad 
        
## ----------------------- Part 6 ---------------------------- ##
from scipy import optimize

class trainer(object):
    def __init__(self, N):
        self.N = N
        
    def callbackF(self, params):
        self.N.setParams(params)
        self.J.append(self.N.costFunction(self.X, self.y))   
        
    def costFunctionWrapper(self, params, X, y):
        self.N.setParams(params)
        cost = self.N.costFunction(X, y)
        grad = self.N.computeGradients(X,y)
        return cost, grad
        
    def train(self, X, y):
        #Make an internal variable for the callback function:
        self.X = X
        self.y = y
        #Make empty list to store costs:
        self.J = []
        params0 = self.N.getParams()
        options = {'maxiter': 200, 'disp' : True}
        _res = optimize.minimize(self.costFunctionWrapper, params0, jac=True, method='BFGS', \
                                 args=(X, y), options=options, callback=self.callbackF)

        self.N.setParams(_res.x)
        self.optimizationResults = _res


&emsp;We then instantiated and trained two neural networks. One was trained to predict the time of day of the accident and the other was trained to predict the weekday, both based on location, traffic flow, and year.  After training each, we tested the models using the testing vectors and comparing the predictions to the actual labels.

In [58]:
# code to actually run the neural network for predicting the time of day of accident

# get just the time labels from training data
time_label = train_labels[:,0]
time_label = time_label.reshape(time_label.shape[0], 1)

# make an instance of the neural network and trainer classes
NN = Neural_Network(4, 1, 2)
T = trainer(NN)

#input the train vectors and labels into the trainer and then the neural net to make predictions
T.train(train_vectors,time_label)
y1 = NN.forward(train_vectors)
print("Trained Output",y1)

         Current function value: 23.193660
         Iterations: 200
         Function evaluations: 256
         Gradient evaluations: 256
Trained Output [[0.50047228]
 [0.52389416]
 [0.5       ]
 [0.53316961]
 [0.50000887]
 [0.51601535]
 [0.80740866]
 [0.5107351 ]
 [0.51841673]
 [0.50039331]
 [0.50090418]
 [0.5       ]
 [0.52411129]
 [0.50019109]
 [0.50534936]
 [0.7519886 ]
 [0.50000036]
 [0.50000035]
 [0.50372113]
 [0.50144675]
 [0.50957854]
 [0.50032606]
 [0.58203872]
 [0.55400865]
 [0.50206665]
 [0.50899355]
 [0.50079147]
 [0.50000106]
 [0.71716054]
 [0.60680467]
 [0.50036915]
 [0.5191974 ]
 [0.50001774]
 [0.50232425]
 [0.50026805]
 [0.50002957]
 [0.50854192]
 [0.5       ]
 [0.59382015]
 [0.50015746]
 [0.628273  ]
 [0.50000006]
 [0.50214871]
 [0.7979775 ]
 [0.64826473]
 [0.5       ]
 [0.6151963 ]
 [0.50021428]
 [0.5       ]
 [0.5000031 ]
 [0.50019664]
 [0.50000228]
 [0.5       ]
 [0.5375148 ]
 [0.50000043]
 [0.50002519]
 [0.5       ]
 [0.51725562]
 [0.50000078]
 [0.56688746]
 [0.507

In [59]:
#test the ANN on training data for time using mean squared error
from sklearn.metrics import mean_squared_error as mse
print('Mean Squared error for predicting the time of day', mse(time_label, y1))

Mean Squared error for predicting the time of day 0.048269843588134055


In [60]:
# code to actually run the neural network for predicting the day of week of accident

# get just the day labels from training data
day_label = train_labels[:,1]
day_label = day_label.reshape(day_label.shape[0], 1)

# make an instance of the neural network and trainer classes
NN2 = Neural_Network(4, 1, 2)
T2 = trainer(NN2)

#input the train vectors and labels into the trainer and then the neural net to make predictions
T2.train(train_vectors,day_label)
y1 = NN2.forward(train_vectors)

#test the ANN on training data for weekday using mean squared error
print('Mean Squared error for predicting the day of the week', mse(day_label, y1))

         Current function value: 48.592460
         Iterations: 200
         Function evaluations: 226
         Gradient evaluations: 226
Mean Squared error for predicting the day of the week 0.1011289483940624


In [61]:
# test the first neural network using testing data
test_time = test_labels[:,0]
test_time = test_time.reshape(test_time.shape[0], 1)

# make predictions but do NOT train the network
pred_labels = NN.forward(test_vectors)

#test the ANN on training data for weekday using mean squared error
print(mse(test_time, pred_labels))
print("Testing Data error for predicting time of day in test data", np.sum((test_time - pred_labels)*(test_time-pred_labels))/len(test_vectors))

0.05613933361967641
Testing Data error for predicting time of day in test data 0.05613933361967641


In [62]:
# test the second neural network using testing data
test_day = test_labels[:,1]
test_day = test_day.reshape(test_day.shape[0], 1)

# make predictions but do NOT train the network
pred_labels = NN2.forward(test_vectors)

#test the ANN on training data for weekday using mean squared error
print(mse(test_day, pred_labels))
print("Testing Data error for predicting day of week in test data", np.sum((test_day - pred_labels)*(test_day-pred_labels))/len(test_vectors))

0.09575353702088286
Testing Data error for predicting day of week in test data 0.09575353702088286


## Results

&emsp; Our model matched the data pretty well. The mean squared error after training the model for time of day, was 0.052. For time of day there were four options, normalized to be 0.25, 0.5, 0.75 and 1, so this small error represents a pretty successful pattern of prediction based on our training features. For the neural network predicting day of the week, the mean squared error was slightly larger at 0.101. There were 7 options normalized out of 1 as well, so this smallish error represents a slightly less successful prediction. 

&emsp; As mentioned, the attempts to model the collision patterns with folium mapping was partly successful and there were some visible patterns on certain roads, but as the number of collisions to map in a small space was high, the map markers were very all very compact.

&emsp; We did split our data into training and testing; our test size was 0.001 of the original dataset and our train size was 0.005 of the original dataset. This size was reduced when we dropped rows that did not have traffic flow values. 

&emsp; In conclusion, we were able to use an Artificial Neural Network to predict when traffic collisions will occur from a given location, year, and flow of traffic. Based on the dataset we used, and the comparative testing errors of the neural networks that predicted day of week vs time of day, we can infer that there is a stronger pattern or correlation around collision factors and time of the day. 


## Conclusions

&emsp; The comparative errors and pattern above make sense because Monday through Friday are likely pretty similar in terms of traffic flow because they are workdays, while Saturday and Sunday are similar to each other. This makes prediction of the day of week slightly more challenging. This issue was difficult to combat because the differences in days are likely small no matter how much the size of data increases.

&emsp; Another slight issue is the mean squared error of  the neural networks. The errors could likely be improved if the sizes of the data were increased; however there is a trade off decision with runtime that we had to make. Specifically, the code that we used to calculate and merge the traffic flow data with the collision data had a high time cost, that increased more than linearly as the size of the dataset increased. We started with over 1 million rows in our collision dataset, so looping through all of them was not a feasible option.

&emsp; Future paths to take with this project could be changing the feature vectors used to make the predictions in order to discern which factors were most important. It would also be interesting to see if a neural network could predict location of accidents, instead of just when. This neural networking strategy could also be applied to different cities if we had the data about their street locations and traffic flows, which would broaden the scope of our project's significance and potentially help find problematic areas of traffic.

### Sources:
1  https://www.nsc.org/road-safety/safety-topics/fatality-estimates  
2 https://nypost.com/2019/01/02/nyc-traffic-injuries-are-up-despite-drop-in-fatalities/  
3 https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data  
4 https://data.cityofnewyork.us/Transportation/Traffic-Volume-Counts-2014-2018-/ertz-hr4r,   
