# Lab 1: Data exploration

In HW1, we did some basic data exploration and visualization. In this lab, we will continue digging through the data sets on the open data portal for more data wrangling and exploration practice. 


In addition to general data exploration, we also focus on geographical visualizations.

To start we'll need to install the following python libraries:
- geopandas
- folium

Install them with pip/conda:

<code>pip install geopandas
pip install folium
</code>

## Import libraries

In [2]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

As you've probably noticed while doing your homework, different neighborhoods/zipcodes have different distributions of 311 requests. We will attempt to visualize these differences.

Before we start, we'll need an the boundaries of the Chicago zipcodes, which we can get from: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-ZIP-Codes/gdcf-axmw

First let's load our datasets and the zipcode geojson.

In [None]:
size = 20000
# Change these filepaths
df = pd.read_csv('../data/311_Service_Requests_Graffiti_2019.csv',nrows=size)
geo_df = gpd.read_file('../data/chi_boundaries.geojson')

In [None]:
# TODO: you might need to do preprocessing, convert columns to different types (date, string, integer, etc.)
# Code


## Choropleth Maps with geopandas
geopandas.read_file takes in a geojson file and creates a GeoDataFrame. You can read more about it from the [geopandas api](http://geopandas.org/data_structures.html). The GeoDataFrame can then be plotted right off the bat:

In [None]:
# Todo: get the first few lines in chi_boundaries.geojson dataset
# Code


In [None]:
geo_df.plot()

This is not too interesting so let's try to make the plot tell us something about each of the zipcodes. We can do this by setting the column parameter which will then shade the zipcode block according to that column. You can supply a [matplotlib colormap](https://matplotlib.org/examples/color/colormaps_reference.html) string for the cmap parameter for different types of color gradients.

In [None]:
geo_df.plot(column='shape_area', cmap='OrRd')

Now, let's augment this geoDataFrame with a new column for the number of 311 requests of graffiti complaints to see something a bit more interesting. To do this we'll create a dataframe for the number of 311 requests that each zipcode received in aggregate, and merge this to geo_df

In [None]:
# Get the counts
zip_counts = df.groupby('zip').count()
# Make a smaller dataframe with two columns: "zip" and "count"
zipcounts = pd.DataFrame({'zip': zip_counts.index, 'count': zip_counts['STATUS']})
print(zipcounts.head())

In [None]:
# Join them onto geo_df
joined = geo_df.join(zipcounts, on='zip', how='left', lsuffix='l', rsuffix='r').dropna()    

In [None]:
# Plot the map color coded by number of graffiti 311 requests
joined.plot(column='count', cmap='OrRd')
plt.title('graffiti requests map')
plt.show()

## Heatmaps with folium

While these visualizations are useful in summarizing where we might expect more graffiti requests to come from, they're a bit coarse because zipcode blocks can be pretty big. To get a finer grained view of where these requests happen, we'll learn how to plot a heatmap of these 311 requests using folium.

Recall that our 311 data contains the latitude and longitude of these values.

In [1]:
# TODO: get the latitude and longtitude data from 311_Service_Requests_Graffiti_2019.csv dataset
# Code
xy = df[['LATITUDE', 'LONGITUDE']]

In [None]:
import folium
from folium.plugins import HeatMap

In [None]:
print(xy.mean())

In [None]:
hmap = folium.Map(location=[41.87, -87.69], zoom_start=10)
hm_rod = HeatMap(list(zip(xy['LATITUDE'].values, xy['LONGITUDE'].values)), radius=13, blur=20)
hmap.add_child(hm_rod)

Play around with the zoom_start, radius and blur parameters to get a better sense of how they affect the resulting visualizations.

## Data Exploration Tips

Here are some things you want to do during data exploration:

1. Distributions of different variables
2. Correlations between variables - you can do a correlation matrix and turn it into a heatmap
3. Changes and trends over time - how does the data and the entities in the data change over time
4. Missing values - are there lots of missing values? is there any pattern there?
5. looking at outliers - this can be done using clustering (that we will cover later) but also using other methods by plotting distributions.
6. cross-tabs, describing how the different types of entities are different.

It's good to have code that does each of the things above. The exercises below are a start in helping you create that for yourself.


## Exercises:
Now that we've seen how to create some simple geographical visualizations you should aggregate the 311 requests by zipcode and visualize request frequency, average request completion time by location, and any other things you find interesting. Some specific questions that might be good to explore:


### Do certain neighborhoods get certain graffiti requests completed faster than others?




In [None]:
#code


### Is there any seasonality to the requests? what about seasonal variations in completion times?

In [None]:
#code

### Are there any outliers in terms of time periods or neighborhoods?

In [None]:
# code

### Which neighborhoods are the most similar in terms of graffiti service requests being reported?

In [None]:
# code

### References
http://pandas.pydata.org/pandas-docs/stable/timeseries.html

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

http://geopandas.org/mapping.html

http://geopandas.org/mergingdata.html

http://python-visualization.github.io/folium/docs-v0.5.0/index.html

### Examples
http://blog.yhat.com/posts/interactive-geospatial-analysis.html

https://alcidanalytics.com/p/geographic-heatmap-in-python