# Working with Missing Data

we'll handle missing data without having to drop rows and columns using data on motor vehicle collisions released by New York City and published on the [NYC OpenData website](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95). There is data on over 1.5 million collisions dating back to 2012, with additional data continuously added.

We'll work with an extract of the full data: Crashes from the year 2018. 

# Libraries

In [1]:
import pandas as pd

In [2]:
mvc = pd.read_csv("../datasets/nypd_mvc_2018.csv")
mvc.head()

Unnamed: 0,unique_key,date,time,borough,location,on_street,cross_street,off_street,pedestrians_injured,cyclist_injured,...,vehicle_1,vehicle_2,vehicle_3,vehicle_4,vehicle_5,cause_vehicle_1,cause_vehicle_2,cause_vehicle_3,cause_vehicle_4,cause_vehicle_5
0,3869058,2018-03-23,21:40,MANHATTAN,"(40.742832, -74.00771)",WEST 15 STREET,10 AVENUE,,0,0,...,PASSENGER VEHICLE,,,,,Following Too Closely,Unspecified,,,
1,3847947,2018-02-13,14:45,BROOKLYN,"(40.623714, -73.99314)",16 AVENUE,62 STREET,,0,0,...,SPORT UTILITY / STATION WAGON,DS,,,,Backing Unsafely,Unspecified,,,
2,3914294,2018-06-04,0:00,,"(40.591755, -73.9083)",BELT PARKWAY,,,0,0,...,Station Wagon/Sport Utility Vehicle,Sedan,,,,Following Too Closely,Unspecified,,,
3,3915069,2018-06-05,6:36,QUEENS,"(40.73602, -73.87954)",GRAND AVENUE,VANLOON STREET,,0,0,...,Sedan,Sedan,,,,Glare,Passing Too Closely,,,
4,3923123,2018-06-16,15:45,BRONX,"(40.884727, -73.89945)",,,208 WEST 238 STREET,0,0,...,Station Wagon/Sport Utility Vehicle,Sedan,,,,Turning Improperly,Unspecified,,,


A summary of the columns and their data is below:

* **`unique_key`**: A unique identifier for each collision.
* **`date, time`**: Date and time of the collision.
* **`borough`**: The borough, or area of New York City, where the collision occurred.
* **`location`**: Latitude and longitude coordinates for the collision.
* **`on_street, cross_street, off_street`**: Details of the street or intersection where the collision occurred.
* **`pedestrians_injured`**: Number of pedestrians who were injured.
* **`cyclist_injured`**: Number of people traveling on a bicycle who were injured.
* **`motorist_injured`**: Number of people traveling in a vehicle who were injured.
* **`total_injured`**: Total number of people injured.
* **`pedestrians_killed`**: Number of pedestrians who were killed.
* **`cyclist_killed`**: Number of people traveling on a bicycle who were killed.
* **`motorist_killed`**: Number of people traveling in a vehicle who were killed.
* **`total_killed`**: Total number of people killed.
* **`vehicle_1 through vehicle_5`**: Type of each vehicle involved in the accident.
* **`cause_vehicle_1 through cause_vehicle_5`**: Contributing factor for each vehicle in the accident.

## Missing Values

In [3]:
mvc.isna().sum()

unique_key                 0
date                       0
time                       0
borough                20646
location                3885
on_street              13961
cross_street           29249
off_street             44093
pedestrians_injured        0
cyclist_injured            0
motorist_injured           0
total_injured              1
pedestrians_killed         0
cyclist_killed             0
motorist_killed            0
total_killed               5
vehicle_1                355
vehicle_2              12262
vehicle_3              54352
vehicle_4              57158
vehicle_5              57681
cause_vehicle_1          175
cause_vehicle_2         8692
cause_vehicle_3        54134
cause_vehicle_4        57111
cause_vehicle_5        57671
dtype: int64

To give us a better picture of the null values in the data, let's calculate the percentage of null values in each column.

In [5]:
pd.DataFrame({'null_counts':mvc.isna().sum(), 'null_pct':mvc.isna().sum()/mvc.shape[0] * 100}).T.astype(int)

Unnamed: 0,unique_key,date,time,borough,location,on_street,cross_street,off_street,pedestrians_injured,cyclist_injured,...,vehicle_1,vehicle_2,vehicle_3,vehicle_4,vehicle_5,cause_vehicle_1,cause_vehicle_2,cause_vehicle_3,cause_vehicle_4,cause_vehicle_5
null_counts,0,0,0,20646,3885,13961,29249,44093,0,0,...,355,12262,54352,57158,57681,175,8692,54134,57111,57671
null_pct,0,0,0,35,6,24,50,76,0,0,...,0,21,93,98,99,0,15,93,98,99


About a third of the columns have no null values, with the rest ranging from less than 1% to 99%! To make things easier, let's start by looking at the group of columns that relate to people killed in collisions.