# Dealing with Missing Data

### Imports
Let's go ahead and important the libraries we will need. In this particular instance, we will be relying on `pandas`, a popular data analytics library in the Python programming langauge.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Data Loading
Now that we've imported all the tools we will need, we will need to load our data into a tabular data representation, known as a DataFrame. In this particular exercise, we will be using data recording by the Chicago Police Department. The data contains a collection of crimes that have been committed in the city of Chicago since January 1st, 2015. Note, that I filtered data based on this date myself. You can find the data in `data/crimes.csv`.

In [None]:
crimes = pd.read_csv("data/crimes.csv")

In [None]:
crimes.head(5)

### Identifying Columns with Missing Data
Now that we have loaded our data, we will need to execute some queries on the data to determine which rows contain missing data.

In [None]:
crimes.isnull().sum()

Yikes! Looks like we've got quite a problem with our longitudinal and latitudianl data. We are going to need to fix that!

In [None]:
missing_lat_lng = crimes[crimes.Latitude.isnull()]

In [None]:
missing_lat_lng.head(5)

That's a fair amount of data! Let's find out which percentage of the data set contains missing latitude and longitude values. That is to say, if we decide to omit these rows from our data set, how much are we losing?

In [None]:
(len(missing_lat_lng.index) / len(crimes.index)) * 100

So we would have to drop a little under 2% of our data. That's not consequential. But let's assume that we do want to retain the missing data and try to fill the values. But first, let's think a little bit about _why_ this might be the case. What about the way the data is collected might result in this? Is this data missing at random or not? Let's find out by extracting the rows where the latitude and longitude values are missing and plotting when they occur!

In [None]:
missing_lat_lng['Date'] = pd.to_datetime(missing_lat_lng['Date'])

In [None]:
missing_lat_lng.set_index('Date').resample('D', how='count').plot(y = 'ID')

So it appears that there are a couple of crimes that get reported without any latitudes or longitudes throughout the month. There is a rather interesting spike in crimes that don't have latitude and longitude data through winter and spring of 2016. This is something that we should take note of and investigate further. For now, let's assume that we are going to fill in our missing values with artificial data, and account for the fact that it is artifical.

### Filling In With Artificial Data

Let's fill in some of the data with artifical latitude and longitude values. In this specific case, these values will be the latitude and longitude of the city, specifically (41.8781136, -87.6297982). We'll also add a column to our data set that represents whether or not the latitude and longitude of the row are artificial.

In [None]:
latitude = 41.8781136
longitude = -87.6297982

In [None]:
crimes['Latitude'] = crimes['Latitude'].fillna(latitude)

In [None]:
crimes['Longitude'] = crimes['Longitude'].fillna(latitude)

In [None]:
crimes['Artificial'] = crimes['Location'].fillna(True)

In [None]:
crimes['Location'] = crimes['Location'].fillna('(' + str(latitude) +  ',' + str(longitude) + ')')

In [None]:
crimes.isnull().sum()

In [None]:
crimes.head(10)

### Filling In With Approximate Data

This particular section will be left as an exercise to the reader. While the latitude and longitude values are missing for some rows, the block of each criminal occurence is not. We can use this data and a geocoding API to fill in the missing values with approximate data. Follow the instructions below to complete this task.

Need help while working on this section? Send an email to safia@safia.rocks!

**1. Create a copy of the 'Block' column where the 'X's in the building number are replaced with 0s.**