# Data preprocessing: NYC Crashes

In this project, a dataset with information about car crashes in New York is cleaned and preprocessed for Machine Learning. 

Let us begin with importing Pandas.

In [41]:
import pandas as pd
import numpy as np

Import the (test) data as a Pandas dataframe and take a first look.

In [24]:
df = pd.read_csv('data_1000.csv')
df.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18T00:00:00.000,23:10,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06T00:00:00.000,13:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,...,,,,,3665311,Sedan,,,,
2,2017-04-27T00:00:00.000,17:15,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09T00:00:00.000,20:10,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18T00:00:00.000,14:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   crash_date                     1000 non-null   object 
 1   crash_time                     1000 non-null   object 
 2   borough                        603 non-null    object 
 3   zip_code                       603 non-null    float64
 4   latitude                       901 non-null    float64
 5   longitude                      901 non-null    float64
 6   location                       901 non-null    object 
 7   on_street_name                 763 non-null    object 
 8   off_street_name                474 non-null    object 
 9   cross_street_name              236 non-null    object 
 10  number_of_persons_injured      1000 non-null   int64  
 11  number_of_persons_killed       1000 non-null   int64  
 12  number_of_pedestrians_injured  1000 non-null   in

There are a lot of null values and it is tempting to remove them. But in order to get a better understanding for which data is important, redundant, double, I will first clean it up a bit. We candon't want to throw away data if it's not necessary.

## Data types
Let us begin checking that every feature has the correct dtype. 

To start, the **date and time** are just strings now. We could fix that when importing creating the dataframe, so let's do that now there's still the chance. 

In [26]:
# reimport the data
df = pd.read_csv('data_1000.csv', parse_dates=[[0,1]])
df.rename(columns={'crash_date_crash_time': "crash_datetime"}, inplace=True)  # simplify column name
df.head()

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18 23:10:00,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06 13:00:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,...,,,,,3665311,Sedan,,,,
2,2017-04-27 17:15:00,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09 20:10:00,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,1,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18 14:00:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


Another very obvious thing is that the `zip_code` is interpreted as a float in stead of an int. [Apparently](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na) Numpy doesn't support nullable integers or something like that, so we have to use a special Pandas type `pd.Int##Dtype()` for this. Another option would be storing the zip codes as strings. 

In [27]:
df["zip_code"]=df["zip_code"].astype(pd.UInt16Dtype())
df.head()

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18 23:10:00,STATEN ISLAND,10312.0,40.536728,-74.193344,"(40.536728, -74.193344)",,,243 DARLINGTON AVENUE,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06 13:00:00,BRONX,10472.0,40.829052,-73.85038,"(40.829052, -73.85038)",CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,...,,,,,3665311,Sedan,,,,
2,2017-04-27 17:15:00,QUEENS,11420.0,40.677303,-73.804565,"(40.677303, -73.804565)",135 STREET,FOCH BOULEVARD,,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09 20:10:00,,,40.624958,-74.145775,"(40.624958, -74.145775)",FOREST AVENUE,RICHMOND AVENUE,,1,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18 14:00:00,BRONX,10456.0,40.828846,-73.90312,"(40.828846, -73.90312)",,,1167 BOSTON ROAD,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


## Duplicates

`longitude` and `latitude` are also kept as object in `location`. Before dropping location, let us quickly check if they are always equal.

In [28]:
# create series with location strings
fake_location=df.apply(lambda x: f"({x['latitude']}, {x['longitude']})", axis=1)

# filter all rows where they are not equal
df[df['location']!=fake_location]

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
24,2018-02-17 14:30:00,MANHATTAN,10019,,,,,,450 W 57th St,0,...,Unspecified,,,,3848714,Sedan,Sedan,,,
25,2019-12-04 15:30:00,STATEN ISLAND,10309,,,,,,715 sharrotts rd,0,...,,,,,4263561,Tractor Truck Diesel,,,,
26,2017-09-12 17:59:00,,,,,,PARK ROW,,,0,...,,,,,3749038,Bus,,,,
27,2017-05-07 15:00:00,,,,,,CENTER BOULEVARD,,,0,...,,,,,3668217,Station Wagon/Sport Utility Vehicle,,,,
36,2017-04-25 19:30:00,,,,,,THROGS NECK EXPRESSWAY,HARDING AVENUE,,0,...,Unspecified,,,,3658008,Sedan,Sedan,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
978,2017-04-25 00:00:00,BROOKLYN,11249,,,,WILLIAMSBURG STREET EAST,KENT AVENUE,,0,...,Unspecified,,,,3657456,Sedan,Box Truck,,,
986,2017-05-05 13:20:00,,,,,,69 ROAD,GRAND CENTRAL PARKWAY,,0,...,Unspecified,,,,3663933,Station Wagon/Sport Utility Vehicle,Station Wagon/Sport Utility Vehicle,,,
989,2019-06-24 07:30:00,,,,,,BEACH 100 STREET,SHORE FRONT PARKWAY,,0,...,Unspecified,,,,4157842,Station Wagon/Sport Utility Vehicle,Sedan,,,
991,2018-02-16 09:55:00,MANHATTAN,10019,,,,WEST 57 STREET,12 AVENUE,,0,...,Unspecified,,,,3847917,Sedan,Sedan,,,


Only the NaN locations are not equal, so we can safely drop **location**.

In [29]:
df.drop(["location"], axis=1,inplace=True)
df.head()

Unnamed: 0,crash_datetime,borough,zip_code,latitude,longitude,on_street_name,off_street_name,cross_street_name,number_of_persons_injured,number_of_persons_killed,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2017-04-18 23:10:00,STATEN ISLAND,10312.0,40.536728,-74.193344,,,243 DARLINGTON AVENUE,0,0,...,Unspecified,,,,3654181,Station Wagon/Sport Utility Vehicle,,,,
1,2017-05-06 13:00:00,BRONX,10472.0,40.829052,-73.85038,CASTLE HILL AVENUE,BLACKROCK AVENUE,,1,0,...,,,,,3665311,Sedan,,,,
2,2017-04-27 17:15:00,QUEENS,11420.0,40.677303,-73.804565,135 STREET,FOCH BOULEVARD,,0,0,...,Unspecified,,,,3658491,Sedan,Sedan,,,
3,2017-05-09 20:10:00,,,40.624958,-74.145775,FOREST AVENUE,RICHMOND AVENUE,,1,0,...,Unspecified,Unspecified,,,3666554,Motorcycle,Sedan,Bus,,
4,2017-04-18 14:00:00,BRONX,10456.0,40.828846,-73.90312,,,1167 BOSTON ROAD,0,0,...,Unspecified,,,,3653269,Sedan,Station Wagon/Sport Utility Vehicle,,,


Now we look at the vehicle types – how many unique ones are there.

In [44]:
pd.unique(df.loc[:,"vehicle_type_code1":"vehicle_type_code_5"].values.ravel())

array(['Station Wagon/Sport Utility Vehicle', nan, 'Sedan', 'Motorcycle',
       'Bus', 'Taxi', 'Box Truck', 'Pick-up Truck', 'PAS', 'Ambulance',
       'Tractor Truck Diesel', 'Bike', 'Garbage or Refuse', 'deliv',
       'unkno', '3-Door', 'Van', 'Dump', 'Convertible', 'tower',
       'Carry All', '4 dr sedan', 'Tanker', 'PK', 'School Bus', 'FOOD',
       'Flat Bed', 'Moped', 'uhaul', 'Motorscooter', 'TRAIL', 'e-350',
       'Armored Truck', 'PASSENGER VEHICLE', 'SMALL COM VEH(4 TIRES) ',
       'Tractor Truck Gasoline', 'RV', 'SPORT UTILITY / STATION WAGON',
       'FDNY', 'Chassis Cab', 'sedan', 'USPS', 'self'], dtype=object)

This is way to extensive. We can start by removing capitalizing overlap.

In [60]:
lower_case=df.loc[:,"vehicle_type_code1":"vehicle_type_code_5"].fillna("").applymap(lambda x: x.lower()) #x.lower() if  else np.nan)

In [62]:
pd.unique(lower_case.values.ravel())

array(['station wagon/sport utility vehicle', '', 'sedan', 'motorcycle',
       'bus', 'taxi', 'box truck', 'pick-up truck', 'pas', 'ambulance',
       'tractor truck diesel', 'bike', 'garbage or refuse', 'deliv',
       'unkno', '3-door', 'van', 'dump', 'convertible', 'tower',
       'carry all', '4 dr sedan', 'tanker', 'pk', 'school bus', 'food',
       'flat bed', 'moped', 'uhaul', 'motorscooter', 'trail', 'e-350',
       'armored truck', 'passenger vehicle', 'small com veh(4 tires) ',
       'tractor truck gasoline', 'rv', 'sport utility / station wagon',
       'fdny', 'chassis cab', 'usps', 'self'], dtype=object)

Same for the contributing factors.

In [45]:
pd.unique(df.loc[:,"contributing_factor_vehicle_1":"contributing_factor_vehicle_5"].values.ravel())

array(['Driver Inattention/Distraction', 'Unspecified', nan,
       'Failure to Yield Right-of-Way', 'Unsafe Lane Changing',
       'Passing or Lane Usage Improper', 'Other Vehicular',
       'Passing Too Closely', 'Backing Unsafely',
       'Traffic Control Disregarded', 'Driver Inexperience',
       'Unsafe Speed', 'Following Too Closely', 'Obstruction/Debris',
       'Turning Improperly',
       'Pedestrian/Bicyclist/Other Pedestrian Error/Confusion',
       'Pavement Slippery', 'Aggressive Driving/Road Rage',
       'Reaction to Uninvolved Vehicle', 'Steering Failure',
       'Oversized Vehicle', 'View Obstructed/Limited',
       'Traffic Control Device Improper/Non-Working', 'Fell Asleep',
       'Glare', 'Passenger Distraction', 'Accelerator Defective',
       'Failure to Keep Right', 'Alcohol Involvement',
       'Outside Car Distraction', 'Brakes Defective',
       'Pavement Defective', 'Driverless/Runaway Vehicle'], dtype=object)