## Homework 3: Machine learning

In this assignment, we'll practice the classification skills from machine learning. We'll use the precinct-level voting data to predict support for Prop 21 (rent control) on the 2020 ballot. For example, we might expect the share of renters to be an important predictor.

We'll also review joins as we prepare the data.

Start by loading the 2020 elections results from LA County into a `pandas` dataframe, `voteDf`. (This is exactly the same data as we will use next week in the clustering lectures; I put another copy of the data file in the assignment GitHub folder to make things easier.)

### Policy on ChatGPT / AI

This is the same as in HWs 1 and 2. Please review those guidelines.

Please help me grade by observing the following:
 
* Do not rename this notebook (that messes up the autograder)
* Do not include large sections of output (that makes it hard to find your code). For example, use `df.head()` to show the first few rows, rather than printing an entire dataframe. The same goes for printing long strings.

In [None]:
import pandas as pd

voteDf = 999 # replace with your dataframe

### BEGIN SOLUTION
fn = 'c037_g20_sov_data_by_g20_srprec.csv'
voteDf = pd.read_csv(fn)
### END SOLUTION
print(len(voteDf))
print(voteDf.head())

In [None]:
# Autograding tests - do not edit
assert len(voteDf) == 4313
assert isinstance(voteDf, pd.DataFrame)

To do some prediction, we'll want to add variables from (say) the census or other sources.
For that, we need the lookup file that matches precincts to census blocks and tracts. [You can find it here](https://statewidedatabase.org/d10/g20_geo_conv.html), or just use the file `c037_g20_sr_blk_map.csv` in your GitHub repository. (Note that there are several types of precincts; the ones that we are using here are called `srprec`.) 

Each precinct intersects with many census blocks. The `pctsrprec` column tells you how much of the precinct lies within that block. For example, in the first few rows of `c037_g20_sr_blk_map.csv`, you'll see 49 different rows for precinct `0050003A`, each matching to a different census block, with the `pctsrprec` column adding up to 100.

Our aim is to create a new dataframe with the vote counts (for all of the propositions and other races) aggregated to census tract. This is a multi-stage process, so let's do this step by step.

In this step, you should:
- load in the lookup data into a new dataframe, `lookupDf`
- join the voting dataframe to the lookup dataframe using `srprec`, to create a new dataframe called `joinDf`. This is a 1:many join, since there are many census blocks per precinct. Do an inner join, as the Null values are not going to be useful to us. (In other words, throw away any lookups that don't match a precinct.)
- make sure that `srprec` is the index

In [None]:
lookupDf = 999 # your code here
joinDf = 999

### BEGIN SOLUTION
fn2 = 'c037_g20_sr_blk_map.csv'

lookupDf = pd.read_csv(fn2)

# inner join to our voting data
joinDf = voteDf.set_index('srprec').join(lookupDf.set_index('srprec'), how='inner')

### END SOLUTION

In [None]:
print(len(lookupDf))
print(len(joinDf))
print(joinDf.county.count())
print(joinDf.TOTREG.sum())
assert joinDf.index.name=='srprec'
assert len(lookupDf)==77704
assert len(joinDf)==77703
assert joinDf.county.count()==77703
assert joinDf.TOTREG.sum()==168427815

Now let's calculate vote shares on Prop 21 and in the presidential race for each census tract. 

This is slightly tricky, because your data frame `joinDf` will have multiple rows per tract (because the precinct geography does not match the census geography). For example, the following code shows you which precincts intersect with tract 119342. 

13.65% of the first precinct listed, `9004204A`, is in tract 119342.

In [None]:
joinDf[joinDf.tract==119342][['tract','pctsrprec']].sort_index(ascending=False)

So to aggregate to tracts, you should:
- for each relevant column, multiply the number of votes by `pctsrprec`, and divide by 100 (because `pctsrprec` is a percentage, not a fraction)
- group by census tract and sum those relevant columns, to create a new dataframe called `tractVotes`. It should have columns `PR_21_N`, `PR_21_Y`, `PRSDEM01`, `PRSREP01`, etc.

This will give us our estimate of votes at the tract level.

*Hint*: You can pass multiple columns to `groupby`. E.g. `df.groupby('groupcol')[['col1','col2','col3']].sum()`

In [None]:
tractVotes = 999 # your code here

### BEGIN SOLUTION

# get list of columns from the original dataframe, excluding the ones that aren't about voting
cols = [col for col in voteDf.columns if col not in ['county','srprec','addist', 'cddist','sddist','bedist']]

# number of votes per block
for col in cols:
    joinDf[col] = joinDf[col] * joinDf.pctsrprec / 100

# aggregate to census tract
tractVotes = joinDf.groupby('tract')[cols].sum()

### END SOLUTION

In [None]:
print(len(tractVotes))
print(tractVotes.PR_21_Y.sum())

# Autograding tests - do not edit
assert len(tractVotes)==2338
assert tractVotes.PR_21_Y.sum().round() == 2021487

Now let's get a dataframe of some relevant census variables, using the Census Bureau API. Check back to the Week 1 example and the first homework.

Create a dataframe, `censusDf`, with ACS 2019 (5 year) tract-level data for LA County, and variables for Tenure (B25003_001E, B25003_002E, B25003_003E) and median household income (B19013_001E). Add a column with the percent of renters, called `pct_renter`.

Rename the median HH income column `median_hh_income`, which is more meaningful.

Why use ACS 2019 rather than a more recent vintage? Well, the census tract boundaries change after each decennial census (i.e., in 2020), and the precinct-to-tract files we used above map to the pre-2020 census boundaries.

As a reminder, here's the Census API [list of tables](https://api.census.gov/data/2019/acs/acs5/variables.html), and [here are examples that you can adapt](https://api.census.gov/data/2019/acs/acs5/examples.html). 

*Hint:* Make sure to restrict your data request by state AND county if you want to keep it to a manageable size! You shouldn't need to request an API key for a small number of queries..

*Hint:* Look at your data if you get the wrong answer to median income! For example, use `censusDf.describe()`, or `censusDf.sort_values(by='med_hh_income').head()`. You might need to replace some values with `np.nan`.

In [None]:
import requests
import json
import numpy as np
censusDf = 999 # your code here

### BEGIN SOLUTION

r = requests.get('https://api.census.gov/data/2019/acs/acs5?get=NAME,B25003_001E,B25003_002E,B25003_003E,B19013_001E&for=tract&in=state:06&in=county:037')
censusdata = r.json()
censusDf = pd.DataFrame(censusdata[1:], columns=censusdata[0])

# if you tried to get calculate the percentage at this point, you'd have probably got an error because it's a string
for col in ['B25003_001E','B25003_002E','B25003_003E','B19013_001E']:
    censusDf[col] = censusDf[col].astype(float)

censusDf['pct_renter'] = censusDf['B25003_003E'] / censusDf['B25003_001E'] * 100
censusDf.rename(columns={'B19013_001E':'median_hh_income'}, inplace=True)

# if you look at your data, you'll see there are a lot of median incomes of -666666666 
# - probably because these tracts have no residents who are workers
# so let's replace them with NaNs
censusDf.loc[censusDf.median_hh_income<0,'median_hh_income'] = np.nan

### END SOLUTION

In [None]:
print (len(censusDf))
print (censusDf.pct_renter.mean())
print(censusDf.median_hh_income.mean())

# Autograding tests - do not edit
assert len(censusDf) == 2346
assert censusDf.pct_renter.mean().round() == 53
assert censusDf.median_hh_income.mean().round()==73243

Create a new dataframe, `joinedDf`, with both your voting and census data, through a left join to the voting data. 

*Hint*: It will be easiest to join on the `tract` column (which is your index in `tractVotes`). Since everything is in LA County, you don't need to worry about the `state` or `county` fields.

*Hint*: You'll need to convert the `tract` column in `censusDf` to an integer first.

In [None]:
joinedDf = 9999

### BEGIN SOLUTION
censusDf['tract'] = censusDf.tract.astype(int)
joinedDf = tractVotes.join(censusDf.set_index('tract'))
### END SOLUTION

In [None]:
print(joinedDf.pct_renter.count())
print(joinedDf.pct_renter.mean())

# Autograding tests - do not edit
assert joinedDf.pct_renter.count() == 2318
assert joinedDf.pct_renter.mean().round() == 53

Let's start with a simple random forests model with the following *x* variables:

* Median HH income
* Percent of HHs that are renters
* Presidential vote (2-party share of Democrat voters, i.e. the percent voting for Biden vs Trump, with other candidates ignored)

And the following *y* variable
* Whether Prop 21 won (received a majority) in that census tract. This should be `True` if the Yeses got more votes than the Nos.

(Yes, vote share in each tract would be better to predict rather than a binary variable - hold off on that for the challenge problem.)

Create the relevant columns, `pct_biden` and `PR_21_won`, in your `joinedDf` dataframe. 

Then split your dataframe into a training sample (75%) and a test sample (25%). *Hint*: Drop the `NaNs` first.

In [None]:
# your code here
X_train, X_test, y_train, y_test = 999, 999, 999, 999

### BEGIN SOLUTION
joinedDf['pct_biden'] = joinedDf.PRSDEM01 / (joinedDf.PRSDEM01 + joinedDf.PRSREP01) * 100
joinedDf['PR_21_won'] = joinedDf.PR_21_Y > joinedDf.PR_21_N
xvars = ['pct_biden', 'pct_renter', 'median_hh_income']
yvar = 'PR_21_won'

df_to_fit  = joinedDf[xvars+[yvar]].dropna()

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_to_fit[xvars], df_to_fit[yvar], test_size = 0.25)
### END SOLUTION

In [None]:
print(len(X_train))
print(len(X_test))
print(X_train.pct_biden.mean())
print(y_train.mean())

# Autograding tests - do not edit
assert len(X_train) == 1731
assert len(X_train.columns) == 3
assert len(X_test) == 578
assert len(X_train) == len(y_train)
assert len(X_test) == len(y_test)
assert X_train.pct_biden.mean().round() == 74
assert y_train.mean().round(1) == 0.6


Estimate a random forests model, and assign the predicted *y* values from your *test* sample to `y_pred`.

In [None]:
from sklearn.ensemble import RandomForestClassifier 

### BEGIN SOLUTION

# initialize the random forest regressor object
rf = RandomForestClassifier(n_estimators = 50, random_state=1)

# now fit the model
rf.fit(X_train, y_train)

# predict values
y_pred = rf.predict(X_test)

### END SOLUTION


In [None]:
print(len(y_pred))
print(y_pred.mean())

# Autograding tests - do not edit
assert len(y_pred)==len(y_test)
assert y_pred.mean().round(1) == 0.6

Let's look at some measures of fit. Plot the confusion matrix.

In [None]:
# your code here

### BEGIN SOLUTION
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
### END SOLUTION

Finally, plot the importance of each of the 3 predictor variables, in the same way as we did in class.

In [None]:
# your code here

### BEGIN SOLUTION
# more-or-less directly taken from the docs: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
import numpy as np
import matplotlib.pyplot as plt

importances = rf.feature_importances_
std = np.std([
    tree.feature_importances_ for tree in rf.estimators_], axis=0)
forest_importances = pd.Series(importances, index=X_train.columns)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
### END SOLUTION

Comment on your interpretation of the results and the confusion matrix. What do they tell you about:
- your predictive accuracy
- which variables are important
- how you might refine the model

Write a few bullet points here.

# Challenge Problem
Remember, you need to do at least two of these challenge problems this quarter.

This challenge problem is open ended for you to take in a direction that you are most interested in. Here are some suggestions (do 1 or 2 of these):

* Extend the random forests model to predict vote share on Prop 21, rather than a binary yes/no, and using additional variables. See suggestions below. 
* Use a neural network instead. How much does this improve the predictions? Use charts to compare the predictions to the random forests model.
* Examine the geographic distribution of the predictions, through mapping the prediction errors. Where does your model perform best? Does this give you pointers as to how to improve your model?

In all cases, write some brief interpretation in a markdown cell.

*Predicting a continuous variable*

Classification problems are typically binary or categorical - which category do you predict a given observation to fall into. In some cases, however, we might want to predict a continuous variable, such as the percentage of "yes" votes on Prop 21. For this we can use `RandomForestRegressor`, which works very similarly to `RandomForestClassifier`. You can follow exactly the same steps: create the `rf` object, fit the model, and predict using the test sample.

How do you evaluate model performance? Since we have a continuous variable, we can't use the confusion matrix. But we can look at the absolute error (each predicted value minus the true value for each of our test precincts). I.e., `abs(y_pred-y_test)`. You can also do a scatter plot of the predicted values against the true values. The divergence from the 45 degree line is a good indication of how well the model fits.