# Data preprocessing: NYC Crashes

In this project, a dataset with information about car crashes in New York is cleaned and preprocessed for Machine Learning. 

Let us begin with importing Pandas.

In [21]:
import pandas as pd
import numpy as np

Import the (test) data as a Pandas dataframe and take a first look.

In [7]:
df = pd.read_csv('data_100000.csv')
df.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18T00:00:00.000,23:10,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06T00:00:00.000,13:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,...,,,,,3665311,Sedan,,,,
2,2017-04-27T00:00:00.000,17:15,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09T00:00:00.000,20:10,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18T00:00:00.000,14:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 29 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   crash_date                     100000 non-null  object 
 1   crash_time                     100000 non-null  object 
 2   borough                        64974 non-null   object 
 3   zip_code                       64966 non-null   float64
 4   latitude                       91965 non-null   float64
 5   longitude                      91965 non-null   float64
 6   location                       91965 non-null   object 
 7   on_street_name                 73991 non-null   object 
 8   off_street_name                47125 non-null   object 
 9   cross_street_name              25967 non-null   object 
 10  number_of_persons_injured      100000 non-null  int64  
 11  number_of_persons_killed       100000 non-null  int64  
 12  number_of_pedestrians_injured  

There are a lot of null values and it is tempting to remove them. But in order to get a better understanding for which data is important, redundant, double, I will first clean it up a bit. We candon't want to throw away data if it's not necessary.

## Data types
Let us begin checking that every feature has the correct dtype. 

To start, the **date and time** are just strings now. We could fix that when importing creating the dataframe, so let's do that now there's still the chance. 

In [26]:
# reimport the data
df = pd.read_csv('data_1000.csv', parse_dates=[[0,1]])
df.rename(columns={'crash_date_crash_time': "crash_datetime"}, inplace=True)  # simplify column name
df.head()

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18 23:10:00,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06 13:00:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,...,,,,,3665311,Sedan,,,,
2,2017-04-27 17:15:00,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09 20:10:00,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,1,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18 14:00:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


Another very obvious thing is that the `zip_code` is interpreted as a float in stead of an int. [Apparently](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na) Numpy doesn't support nullable integers or something like that, so we have to use a special Pandas type `pd.Int##Dtype()` for this. Another option would be storing the zip codes as strings. 

In [37]:
df["zip_code"]=df["zip_code"].astype(pd.UInt16Dtype())
df.head()

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18 23:10:00,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06 13:00:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,...,,,,,3665311,Sedan,,,,
2,2017-04-27 17:15:00,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09 20:10:00,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,1,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18 14:00:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


## Duplicates

`longitude` and `latitude` are also kept as object in `location`. Before dropping location, let us quickly check if they are always equal.

In [20]:
# create series with location strings
fake_location=df.apply(lambda x: f"({x['latitude']}, {x['longitude']})", axis=1)

# filter all rows where they are not equal
df[df['location']!=fake_location]

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
24,2018-02-17T00:00:00.000,14:30,MANHATTAN,10019.0,,,,,,450 W 57th St,...,Unspecified,,,,3848714,Sedan,Sedan,,,
25,2019-12-04T00:00:00.000,15:30,STATEN ISLAND,10309.0,,,,,,715 sharrotts rd,...,,,,,4263561,Tractor Truck Diesel,,,,
26,2017-09-12T00:00:00.000,17:59,,,,,,PARK ROW,,,...,,,,,3749038,Bus,,,,
27,2017-05-07T00:00:00.000,15:00,,,,,,CENTER BOULEVARD,,,...,,,,,3668217,Station Wagon/Sport Utility Vehicle,,,,
36,2017-04-25T00:00:00.000,19:30,,,,,,THROGS NECK EXPRESSWAY,HARDING AVENUE,,...,Unspecified,,,,3658008,Sedan,Sedan,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99935,2019-11-21T00:00:00.000,17:00,,,,,,CROSS ISLAND PARKWAY,149 STREET,,...,Unspecified,,,,4245413,Sedan,,,,
99944,2019-11-15T00:00:00.000,13:52,QUEENS,11373.0,,,,HORACE HARDING EXPRESSWAY,QUEENS BOULEVARD,,...,Driver Inattention/Distraction,,,,4241598,Station Wagon/Sport Utility Vehicle,Box Truck,,,
99965,2020-03-04T00:00:00.000,23:47,,,,,,WAKEFIELD AVENUE,BULLARD AVENUE,,...,Unspecified,Unspecified,,,4297737,Sedan,Pick-up Truck,Sedan,,
99982,2019-11-07T00:00:00.000,17:08,,,,,,BRONX WHITESTONE BRIDGE,,,...,Unspecified,,,,4237898,Station Wagon/Sport Utility Vehicle,Sedan,,,


Only the NaN locations are not equal, so we can safely drop **location**.

In [23]:
df.drop(["location"], axis=1,inplace=True)
df.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18T00:00:00.000,23:10,STATEN ISLAND,10312.0,40.536728,-74.193344,,,243 DARLINGTON AVENUE,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06T00:00:00.000,13:00,BRONX,10472.0,40.829052,-73.85038,CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,...,,,,,3665311,Sedan,,,,
2,2017-04-27T00:00:00.000,17:15,QUEENS,11420.0,40.677303,-73.804565,135 STREET,FOCH BOULEVARD,,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09T00:00:00.000,20:10,,,40.624958,-74.145775,FOREST AVENUE,RICHMOND AVENUE,,1,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18T00:00:00.000,14:00,BRONX,10456.0,40.828846,-73.90312,,,1167 BOSTON ROAD,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,
