# Accident Analysis in Metro Nashville: 2013-2017 (Part 2)

### This notebook will:
* read in the cleaned data for accidents in Metro Nashville for 2013-2017,
* create data points for non-crash data,
* create a dataframe containing bootstrapped samples of crash/non-crash data,
* store that dataframe in a csv to be used for forecast model in Part 3.


#### Import packages

In [1]:
import glob
import os
import random
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline 
import datetime
import time as tm
import numpy as np
from time import gmtime, strftime, localtime

pd.options.mode.chained_assignment = None


#### Run script from Part 1 to get functions, variables into this notebook

In [2]:
%run ./'Nashville Accident Analysis -- Cleanup'.ipynb

CPU times: user 1.03 s, sys: 156 ms, total: 1.18 s
Wall time: 1.2 s


#### Read in data from CSV created in Part 1

In [3]:
non_empty_crash_data_df = pd.read_csv('data/Clean_Crash_Data.csv')
non_empty_crash_data_df.drop('Unnamed: 0',axis=1,inplace=True)
non_empty_crash_data_df

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Accident Number,Date and Time,Number of Motor Vehicles,Number of Injuries,Number of Fatalities,Property Damage,Hit and Run,Reporting Officer,Collision Type,Collision Type Description,...,Hour,Year,Week,Weekday,Daytime,Rush Hour Morning,Rush Hour Afternoon,Ramp,Intersection,Interstate
0,20130000050,2013-01-01 00:15:00,2.0,3.0,0.0,N,N,414722,4.0,ANGLE,...,0,2013,1,1,0,0,0,0,1,0
1,20130000270,2013-01-01 00:30:00,2.0,0.0,0.0,N,Y,887608,5.0,SIDESWIPE - SAME DIRECTION,...,0,2013,1,1,0,0,0,0,1,1
2,20130000128,2013-01-01 00:43:00,2.0,0.0,0.0,N,Y,716877,5.0,SIDESWIPE - SAME DIRECTION,...,0,2013,1,1,0,0,0,0,1,0
3,20130000123,2013-01-01 00:45:00,2.0,0.0,0.0,N,N,834804,1.0,REAR END,...,0,2013,1,1,0,0,0,0,0,0
4,20130000160,2013-01-01 00:45:00,2.0,1.0,0.0,N,N,717708,1.0,REAR END,...,0,2013,1,1,0,0,0,0,0,0
5,20130000142,2013-01-01 01:00:00,2.0,0.0,0.0,N,N,217537,1.0,REAR END,...,1,2013,1,1,0,0,0,0,0,0
6,20130000138,2013-01-01 01:05:00,1.0,1.0,0.0,N,N,468160,0.0,NOT COLLISION W/MOTOR VEHICLE-TRANSPORT,...,1,2013,1,1,0,0,0,0,1,0
7,20130000206,2013-01-01 01:25:00,2.0,0.0,0.0,N,N,728059,1.0,REAR END,...,1,2013,1,1,0,0,0,0,0,0
8,20130000306,2013-01-01 01:25:00,2.0,0.0,0.0,N,N,VUPD531,1.0,REAR END,...,1,2013,1,1,0,0,0,0,1,0
9,20130000393,2013-01-01 01:26:00,1.0,0.0,0.0,N,Y,834804,0.0,NOT COLLISION W/MOTOR VEHICLE-TRANSPORT,...,1,2013,1,1,0,0,0,0,1,0


We only have crash data, so trying to run it through a model by itself will be difficult, since the model only knows what constitutes crashes. To fix this, we'll derive non-crash data. Assuming the data we have is the full population of crashes, then we can take random timestamps and locations, pair them together, then go to the crash dataframe and filter on that pair. Most of the time, there won't be a crash at a randomly chosen time and place, so we record all those in a "random non-crashes" dataframe. Sometimes we will get lucky and actually grab a pair for which there was an accident(s). If we find those, we record them in a new "random crashes" dataframe. Then we concatenate them. Because the likelihood of grabbing a time/location pairing that results in finding accidents is so rare, we will have to bootstrap sample many many times to get enough crashes to get a representative set of crashes.

#### Grab a random 100,000 locations and timestamps, with replacement.

In [4]:
sample_size = 100000

In [5]:
random_locs = random.choices(list(non_empty_crash_data_df['Location']), k=sample_size)

In [6]:
random_times = random.choices(list(non_empty_crash_data_df['Rounded Date and Time']), k=sample_size)

#### For each location/timestamp pair, filter the crashes dataframe and if there are crashes, add them to a random crashes dataframe; otherwise, add a line to a random non-crashes dataframe that contains the time, lat/long, and best guess of street address for that location.

In [18]:
%%time

random_crashes_df = pd.DataFrame()
random_non_crashes_df = pd.DataFrame()

for idx in list(range(sample_size)):
    temp_df = non_empty_crash_data_df[(non_empty_crash_data_df['Rounded Date and Time'] == random_times[idx]) & 
                            (non_empty_crash_data_df['Location'] == random_locs[idx]) 
                           ]
    if temp_df.shape[0] > 0:
        random_crashes_df = random_crashes_df.append(temp_df.loc[:,['Rounded Date and Time',
                                                                    'Location',
                                                                    'Street Address',
                                                                    'Weather Description',
                                                                    'Precinct',
                                                                    'Zip',
                                                                    'City'
                                                                   ]])
    else:
        weather_description = non_empty_crash_data_df[(non_empty_crash_data_df['Rounded Date and Time'] == random_times[idx])]['Weather Description'].mode()[0]
        precinct = non_empty_crash_data_df[(non_empty_crash_data_df['Location'] == random_locs[idx])]['Precinct'].mode()[0]
        zip_code = non_empty_crash_data_df[(non_empty_crash_data_df['Location'] == random_locs[idx])]['Zip'].mode()[0]
        city = non_empty_crash_data_df[(non_empty_crash_data_df['Location'] == random_locs[idx])]['City'].mode()[0]
        street_address = non_empty_crash_data_df[(non_empty_crash_data_df['Location'] == random_locs[idx])]['Street Address'].mode()[0]
        random_non_crashes_df = random_non_crashes_df.append({'Rounded Date and Time': random_times[idx], 
                                                                          'Location': random_locs[idx],
                                                                          'Street Address': street_address,
                                                                          'Weather Description': weather_description,
                                                                            'Precinct': precinct,
                                                                            'Zip': zip_code,
                                                                            'City': city
                                                                          }, ignore_index=True)



CPU times: user 2h 14min 32s, sys: 58.2 s, total: 2h 15min 30s
Wall time: 2h 16min 49s


In [19]:
random_crashes_df

Unnamed: 0,Rounded Date and Time,Location,Street Address,Weather Description,Precinct,Zip,City
87364,2016-03-18 16:00:00,"(36.140190000000004, -86.7258)",I 24 & UNKNOWN RAMP,CLEAR,HERMIT,37210.0,NASHVILLE
147115,2017-12-22 15:00:00,"(36.09483, -86.70694)",MM 55 0 I 24,CLOUDY,SOUTH,37211.0,NASHVILLE
22516,2013-12-04 04:00:00,"(36.226315146602396, -86.60320828671992)",SAUNDERSVILLE RD & SHUTE LN,NO ADVERSE CONDITIONS,HERMIT,37076.0,HERMITAGE
122540,2017-04-07 17:00:00,"(36.263059999999996, -86.7573)",DICKERSON PKE & WESTCHESTER DR,CLEAR,MADISO,37207.0,NASHVILLE
32545,2014-05-11 15:00:00,"(36.03888, -86.78281)",I 65 S & OLD HICKORY BLVD,NO ADVERSE CONDITIONS,MIDTOW,37027.0,BRENTWOOD
60698,2015-05-18 07:00:00,"(36.140190000000004, -86.7258)",I 24 & UNKNOWN RAMP,CLOUDY,HERMIT,37210.0,NASHVILLE
60715,2015-05-18 07:00:00,"(36.140190000000004, -86.7258)",I 24 & UNKNOWN RAMP,RAIN,HERMIT,37210.0,NASHVILLE
140611,2017-10-18 17:00:00,"(36.080090000000006, -86.72446)",HARDING PL & SEVEN MILE CREEK HARDING MA,CLEAR,SOUTH,37211.0,NASHVILLE
126223,2017-05-15 13:00:00,"(36.17904, -86.7737)",MM 47 7 I 24,CLEAR,EAST,37207.0,NASHVILLE
100151,2016-08-08 00:00:00,"(36.1693663201, -86.6798927725)",LEBANON PKE & MCGAVOCK PKE,CLEAR,HERMIT,37214.0,NASHVILLE


#### For the two new dataframes, add a column for Crash Recorded and add a 1 for all crashes and a 0 for all non-crashes.

In [20]:
random_crashes_df['Crash Recorded'] = 1
random_non_crashes_df['Crash Recorded'] = 0

In [21]:
random_crashes_df.shape

(133, 8)

In [22]:
random_non_crashes_df.shape

(99872, 8)

#### Because there are so many more non-crashes than crashes, upsample the crashes to maintain class balance.

In [24]:
#this will upsample:
random_crashes_upsample_df = pd.DataFrame(random.choices(random_crashes_df.values, k=sample_size), columns=list(random_crashes_df.columns))
random_crashes_upsample_df

Unnamed: 0,Rounded Date and Time,Location,Street Address,Weather Description,Precinct,Zip,City,Crash Recorded
0,2015-08-14 15:00:00,"(36.15477, -86.77942)",US HWY 41 & US HWY 31,CLEAR,CENTRA,37203.0,NASHVILLE,1
1,2017-10-18 17:00:00,"(36.080090000000006, -86.72446)",HARDING PL & SEVEN MILE CREEK HARDING MA,CLEAR,SOUTH,37211.0,NASHVILLE,1
2,2017-05-31 17:00:00,"(36.14439, -86.69475)",MM 215 0 I 40,CLEAR,HERMIT,37214.0,NASHVILLE,1
3,2016-10-14 15:00:00,"(36.10742, -86.71966)",I24 W EXT RAMP & I 24,RAIN,SOUTH,37211.0,NASHVILLE,1
4,2017-05-31 17:00:00,"(36.14439, -86.69475)",MM 215 0 I 40,CLEAR,HERMIT,37214.0,NASHVILLE,1
5,2017-11-18 08:00:00,"(36.14895, -86.77942)",MM 209 8 I 40,CLOUDY,CENTRA,37203.0,NASHVILLE,1
6,2015-06-11 16:00:00,"(36.16905, -86.77226999999999)",S 1ST ST & WOODLAND ST,CLEAR,CENTRA,37213.0,NASHVILLE,1
7,2013-02-28 17:00:00,"(36.14723, -86.74364)",I 40 W & FESSLERS LN,NO ADVERSE CONDITIONS,HERMIT,37210.0,NASHVILLE,1
8,2017-11-09 09:00:00,"(36.085650688200005, -86.69982457620002)",CANE RIDGE RD & BELL RD,CLEAR,SOUTH,37013.0,ANTIOCH,1
9,2015-12-17 17:00:00,"(36.15463, -86.77916)",8TH AVS & LAFAYETTE ST,CLEAR,CENTRA,37203.0,NASHVILLE,1


#### ONE-HOT ALL: Merge the two dataframes into one, and then extract features.

In [25]:
def final_data_prep(df1, df2):
    df_out = pd.concat([df1, df2])
    df_out['Rounded Date and Time'] = pd.to_datetime(df_out['Rounded Date and Time'])
    extract_time_features(df_out)
    extract_address_features(df_out)
    df_temp = df_out.loc[:,['Weather Description', 'Precinct', 'Zip', 'City', 'Day Of Week', 
                            #'Day Of Month',
                            'Month', 'Hour', 
                            #'Week'
                           ]].astype('category')
    df_temp_2 = df_out.loc[:,['Weekday', 'Daytime', 'Rush Hour Morning', 'Rush Hour Afternoon', 'Ramp',
            'Intersection', 'Interstate','Crash Recorded']]
    df_temp = pd.get_dummies(df_temp)
    df_out = pd.concat([df_temp, df_temp_2],axis=1)
    return df_out

In [26]:
model_data_df = final_data_prep(random_non_crashes_df,random_crashes_upsample_df)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


#### Write this out to a CSV that can be used later to train a model.

In [27]:
model_data_df.to_csv('data/Random_Model_Data_All_One_Hot.csv')

In [28]:
model_data_df

Unnamed: 0,Weather Description_BLOWING SAND/SOIL/DIRT,Weather Description_BLOWING SNOW,Weather Description_CLEAR,Weather Description_CLOUDY,Weather Description_FOG,Weather Description_NO ADVERSE CONDITIONS,Weather Description_OTHER (NARRATIVE),Weather Description_RAIN,Weather Description_SEVERE CROSSWIND,"Weather Description_SLEET, HAIL",...,Hour_22,Hour_23,Weekday,Daytime,Rush Hour Morning,Rush Hour Afternoon,Ramp,Intersection,Interstate,Crash Recorded
0,0,0,1,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,1,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,1,1,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
5,0,0,1,0,0,0,0,0,0,0,...,0,0,1,1,0,1,0,0,1,0
6,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
7,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
8,0,0,1,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,1,0,0
9,0,0,0,0,0,0,0,1,0,0,...,0,0,1,1,0,1,0,1,0,0
