## Module 5 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class. We'll do an extension of the random forests classifier, looking at a continuous variable.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Classification: NYC evictions
We'll look at the factors that are associated with evictions in New York City. Perhaps a machine learning model can identify the types of places that are vulnerable to eviction, and target renter assistance programs more effectively?

#### Loading in the data

Let's start by loading in the [eviction dataset](https://data.cityofnewyork.us/City-Government/Evictions/6z8x-wfk4) via Socrata.

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Import the data from Socrata via the API into a pandas DataFrame.
</div>

*Hints*:
- Look back at Week 1 if you need a refresher on using Socrata
- There are about 70,000 rows in the dataset. So remember to add `?$limit=100000` to the end of the URL that you pass to `requests.get()`. Otherwise, you'll just get the first 1,000 rows. (The limit can be anything comfortably above 70000.)

In [None]:
import requests
import json
import pandas as pd
import geopandas as gpd

# your code here

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Convert your dataframe to a GeoDataFrame, using the latitude and longitude columns.

In [None]:
# your code here 

Now let's import some census data. We could use `cenpy` or the Census Bureau API. But to keep things simple so that we can focus on the spatial joins and the machine learning, I downloaded the block group-level 2019 ACS data for New York from the [Census Bureau](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-data.html). To save space, I clipped it to the 5 NYC counties.

It's in your repository, and we can load it in as follows. If you aren't familiar with a GeoPackage (GPKG) format, think of it as a "new and improved shapefile." [Here's a good overview.](https://towardsdatascience.com/why-you-need-to-use-geopackage-files-instead-of-shapefile-or-geojson-7cb24fe56416)

In [None]:
bgs = gpd.read_file('data/nyc_bgs.gpkg')
bgs.head()

Note that the variables aren't particularly carefully selected - I just threw in many of the demographic and housing variables. 

Nor are the variable names particularly informative, but the full names are in a file in the repository.

In [None]:
# note it is tab-sepated, not comma separated
# so we use the sep='\t' argument

col_names = pd.read_csv('data/BG_METADATA_2019.txt', sep='\t', index_col='Short_Name')
col_names.head()

So you can see the definition of the column like this. (I don't recommend renaming the `bg` column names, because the full names are so long.)

In [None]:
col_names.loc['B01001e1']

#### Spatial join
Now let's do the spatial join. Again, let's follow our three step process.

1. Use a spatial join to add the `GEOID` column to the evictions dataframe. *Hint:* Check your projections.
2. Group by `GEOID` to get a count of evictions per block group. If you have a `Series`, give it a name - maybe `n_evictions`
3. Join those counts back - a tabular join based on the index

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Add a count of evictions per census block group to your <strong>bgs</strong> GeoDataFrame, using the 3-step process above.
</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Do a quick-and-dirty map of the number of evictions. This will help identify any data holes.
</div>

In [None]:
# your code here

#### Random forests regressor
Now we have our data set. Let's estimate a random forests model.

In contrast to the examples in the lecture, we are trying to predict a continuous variable - the number of evictions. So our classifier isn't appropriate. 

However, there is a similar model: the [random forest regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor). It works almost identically to the classifier. The main difference from a user perspective is assessing model performance - a confusion matrix doesn't work here.

You'll need to follow the following steps:
- choose your x variables. (Your y variable will be `n_evictions`)
- Drop Null values if needed
- split your dataset into training and testing portions
- estimate (fit) the model

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Estimate a random forest regressor model to predict the number of evictions per census tract.</div>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Examine some of your trees in the random forest. What do they tell you?</div>

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Experiment with different model hyperparameters and variables. Discuss your rationale and the results with a neighbor.</div>

In [None]:
# your code here

The following questions relate to some of the material in Module 6. You might want to wait until watching those lectures. Then come back and complete these tasks.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Assess the fit of your model.</div>

Remember, the confusion matrix and accuracy scores don't apply to continuous data. Some ideas for continuous variables are [here](https://stackoverflow.com/questions/50789508/random-forest-regression-how-do-i-analyse-its-performance-python-sklearn). You could also plot actual vs predicted values.

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Which variables are most important in your predictions? Plot the forest importances.</div>

In [None]:
# your code here


<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Get more practice with spatial joins and Socrata.</li>
  <li>Learn how to estimate a random forests model for continuous data.</li>
</ul>
</div>