# Data Cleaning
## Overview
- Load in data used in the analyses
- Understand the structure, granularity, and quality of the data
- Clean up the data for further inspection

## Datasets
1. **Oakland neighborhoods**
    - GeoJSON format data to plot different neighborhoods in Oakland. The idea here is to be able to aggregate any values of interest (e.g., incidents of grafitti) by easily-identifiable neighborhoods.
1. **Oakland city service requests**
    - CSV file containing requests (e.g., potholes, illegal dumping, etc.) from residents.
1. **Residential zones within 300 feet of industrial zones**
    - GeoJSON data that shows areas in which residents live very close to industrial zones.

In [1]:
# Standard tools for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Tools specific for geospatial data analysis
from shapely.geometry import shape, mapping, Point, Polygon
import geopandas as gpd

# Tools from the Python Standard Library
import os
import sys
import re

from IPython.display import display
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)

First, let's take a look at files that are available:

In [None]:
DATADIR = '../data/'
!ls $DATADIR

## 1. Oakland neighborhood classification
- A GeoJSON file classifying the areas in Oakland into neighborhoods was pulled from https://data.oaklandnet.com/Property/Oakland-Neighborhoods/7zky-kcq9 on 22/11/17.
- The downloaded file was renamed, replacing spaces with underscores, and setting all letters to lowercase for convenience.

Let's read in this GeoJSON file:

### Note
It wasn't trivial to get this working! I spent quite some time hacking to get this to come out right (ultimately, it just came down to finding the right versions to use). Reading GeoJSON into Geopandas DataFrames should be straightforward, but I was met with quite a few errors. After doing some tracking, I found that the right combination was using installing `shapely` and `geopandas` from `pip` instead of `conda-forge`.

In [None]:
neighborhoods = gpd.read_file(DATADIR + 'oakland_neighborhoods.geojson')

In [None]:
display(neighborhoods.head())
display(neighborhoods.tail())

There's an empty `description` column that we can remove:

In [None]:
del neighborhoods['description']

As we see, this GeoDataFrame contains both points and polygons. The points are used as markers (e.g., on interactive maps), while the polygons actually give us the shape. Let's move all the points to a separate column called `centers`, so we can have a single point to characterize the neighborhood, if needed.

In [None]:
# Make a temporary DataFrame only containing points
neighborhood_centers = neighborhoods[(neighborhoods.geom_type == 'Point')]

# Remove points from the neighborhood DataFrame
neighborhoods = neighborhoods[~(neighborhoods.geom_type == 'Point')]

# Reset the indexing to 0
neighborhoods.reset_index(inplace=True)
del neighborhoods['index']

Get the longitude and latitude from each center coordinate, and add that to the main neighborhoods DataFrame:

In [None]:
center_lon = neighborhood_centers['geometry'].apply(lambda x: x.coords[0][0])
center_lat = neighborhood_centers['geometry'].apply(lambda x: x.coords[0][1])

In [None]:
neighborhoods['center_lon'] = center_lon
neighborhoods['center_lat'] = center_lat

In [None]:
neighborhoods.head()

Let's check these neighborhoods out!

In [None]:
f, ax = plt.subplots()
neighborhoods.plot(ax=ax);

Now save this to file for future use:

In [None]:
# The .shp format saves several files, so let's make a special subdirectory for that
if not os.path.exists(DATADIR + 'neighborhoods'):
    os.mkdir(DATADIR + 'neighborhoods')


neighborhoods.to_file(DATADIR + 'neighborhoods/oakland_neighborhoods_clean.shp',
                      driver='ESRI Shapefile')

### Maps

Notice how this just gives us the shapes of our neighborhoods without any context. Later on, I will plot these neighborhoods on top of a map (e.g., satellite images) so we get an idea of where these are located with respect to other features we are familiar with. To do this, we will use `basemap`. It's worth noting that `basemap` is being replaced with `cartopy`, but since I'm somewhat familiar with `basemap` for the time being :).

We will use the same coordinates from the above plot to get our $x$ and $y$ limits. In `basemap`, these will be referred to as the lower (upper) left (right) corner latitude (longitude) or llcrnlat, etc.

## 2. City Service Requests
- An Excel Spreadsheet containing service requests to the city was pulled on the afternoon of 11 December 2017. This record is regularly updated and contains the following information of interest:
    1. Request ID
    1. Date and time at which the request was registered in the system
    1. Source of service request (i.e., call, email, website)
    1. Description
    1. Request category (e.g., GRAFFITI)
    1. Request location (address and/or GPS coordinates)
    1. Status (open, closed, cancelled, etc.)
    1. Date and time at which the request was closed in the system
- The purpose for considering this data is to look at the frequency at which requests are made in different neighborhoods, as well as the distribution of times it takes to resolve issues.
- Now that I think about it, this is quite a rich dataset in itself, and could be the basis of this project alone!

In [None]:
service_requests = pd.read_csv(DATADIR + 'Service_requests_received_by_the_Oakland_Call_Center.csv')

In [None]:
service_requests.head()

For now, I'll drop the `REFERREDTO`, `SRX`, `SRY`, `COUNCILDISTRICT`, and `BEAT` columns.

In [None]:
service_requests.drop(columns=['REFERREDTO', 'SRX', 'SRY', 'COUNCILDISTRICT', 'BEAT'], inplace=True)

In [None]:
service_requests.head()

Just from looking at these, we can see that many requests look pretty boring (e.g., reporting missing trash pickup). Let's explore this more to see if we can find data that can tell us more.

In [None]:
display(service_requests['REQCATEGORY'].unique())
display(service_requests['SOURCE'].unique())
display(service_requests['STATUS'].unique())

Let's take a look at the unfunded requests:

In [None]:
service_requests[service_requests['STATUS'] == 'UNFUNDED'].head()

### Get GPS Coordinates

We will want the GPS coordinates for easier plotting later on. Let's do some regex-ing to clean those addresses/coordinates up:

In [None]:
service_requests['REQADDRESS'].head()

Here's a regex expression that can be used to pull GPS coordinates from the address column:

In [None]:
re.findall('[\d.]+, [\-\d.]+', service_requests['REQADDRESS'].iloc[3])

We can use a function to perform this search and also return a tuple of GPS coordinates for us. I included this function in a script located at `scripts/oaktext.py`. The purpose for that is reuse, and the ability to test.*

<sub>* This is a bit of a contrived example for unit testing. </sub>

In [2]:
# Load the module that the scripts are stored in
sys.path.append("../scripts")

# Import the module for getting coordinates
from oaktext import get_coords

In [None]:
get_coords(service_requests['REQADDRESS'].iloc[1])

In [None]:
service_requests['coordinates'] = service_requests['REQADDRESS'].apply(get_coords)

In [None]:
service_requests.head()

In [None]:
service_requests.loc[1]

In [None]:
print('Number of entries with coordinates:', service_requests[service_requests['coordinates'].notnull()].shape[0])
print('Number of entries without coordinates:', service_requests[service_requests['coordinates'].isna()].shape[0])

Let's limit ourselves to the entries that have coordinates for us to plot (using the addresses provided is completely possible, but beyond the scope of our current analysis). As we see above, we still have a good amount of data to play with. It's worth noting that selecting only those with GPS coordinates may in fact be introducing a bias in our results, or it may in fact be a negiligble effect (e.g., the people responsible for data entry may not have entered the coordinates...).

In [None]:
service_requests = service_requests[service_requests['coordinates'].notnull()]
print(service_requests.shape)

In [None]:
service_requests.head()

And since we are not using the address, let's just drop that column:

In [None]:
service_requests.drop(columns='REQADDRESS', inplace=True)

### Convert dates/times to datetime

Another piece of data that may be of interest to us is how long it took to close the request. For example, we can geospatially map out how long it took to close requests. To do this, let's first make sure the times are all in a common format.

In [None]:
service_requests.loc[:, 'DATETIMEINIT'] = pd.to_datetime(service_requests['DATETIMEINIT'],
                                                         format="%m/%d/%Y %I:%M:%S %p")
service_requests.loc[:, 'DATETIMECLOSED'] = pd.to_datetime(service_requests['DATETIMECLOSED'],
                                                           format="%m/%d/%Y %I:%M:%S %p")

What's the range of the dates here?

In [None]:
service_requests['DATETIMEINIT'].min(), service_requests['DATETIMEINIT'].max()

Neat, we have about 8 years of data here.

In [None]:
service_requests.loc[:, 'time_to_close'] = (service_requests['DATETIMECLOSED']
                                            - service_requests['DATETIMEINIT']).astype('timedelta64[D]')

In [None]:
service_requests.time_to_close.head()

In [None]:
service_requests.sort_values(by='time_to_close', ascending=False).head()

We see that sometimes it takes years to close these! However, these are just the ones that are still open. If a request had never been closed, its value for `DATETIMECLOSED` will be `NaT` (i.e., not a time). As a result, we probably want a variable that gives us the time since opening. To do this, we can find the difference in time from when this was downloaded and when the request was opened.

In [None]:
# This dataset was downloaded on 11 Dec. 2017
t_0 = pd.datetime(2017, 12, 11)

In [None]:
service_requests.loc[:, 'time_since_init'] = (t_0 - service_requests['DATETIMEINIT']).astype('timedelta64[D]')

In [None]:
service_requests.head()

We are in a good state to run further analyses on these data. Let's save it for further inspection:

In [None]:
RESULTSDIR = '../results/'
if not os.path.exists(RESULTSDIR):
    os.mkdir(RESULTSDIR)

In [None]:
service_requests.to_hdf(RESULTSDIR + '01-service_requests.h5', 'service_requests')

## 3. Residential areas within 300 feet of industrial zones
- A GeoJSON file classifying the areas in Oakland into neighborhoods was pulled from https://data.oaklandnet.com/Economic-Development/Residential-Zones-300-ft-of-Industrial-Areas/d3re-jdqr on 22/11/17.
- The downloaded file was renamed, replacing spaces with underscores, and setting all letters to lowercase for convenience.

Let's read in this GeoJSON file:

In [None]:
residential_industrial = gpd.read_file(DATADIR + 'residential_zones_300_ft_of_industrial_areas.geojson')

In [None]:
residential_industrial.head()

Let's overlay these zones on the neighborhood map:

In [None]:
residential_industrial.plot(color='r', ax=ax);
f

This shows residential areas, within the blue neighborhoods, that are close (less than 300 feet!) to industrial zones. This will be of particular interest when considering air quality data.