# Lab 7: Predicting Voter Turnout

Welcome to Lab 7! This week, we will be constructing models to predict voter turnout. We will be using the same dataset as last week's lab (note that I removed `treat`).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab7/gerber_huber_2014_data.csv")
data.head()

Unnamed: 0,id,voted06,voted08,voted09,voted10,voted11,voted12,voted13,voted14,age,race_afam,race_hispanic,race_other,race_white,female
0,1989703,0,0,0,0,0,0,0,0,26.0,0,0,0,1,1
1,555323,0,0,0,0,0,0,0,0,37.0,0,0,0,1,1
2,915202,1,1,0,1,0,0,0,1,26.0,1,0,0,0,0
3,839095,0,1,0,0,0,0,0,1,46.0,0,0,0,1,1
4,197647,0,0,0,0,0,0,0,0,21.0,0,0,0,1,0


Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (voted, yes, success, etc.) or 0 (did not vote, no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) (probability that a person voted) as a function of X (many different covariates).

(You could use other algorithms to build a model. We are using logistic regression because it is fast and easy.)

Our goal is to predict whether an individual voter *i* voted in the 2014 election as a function of other features we know about them (age, past vote history, race, and gender). With a model of voting in the 2014 election, a campaign in the future (such as in 2018) could better target voters.

To get started, let's see if there are any big differences in who votes in 2014. Run the below code.

In [None]:
data.groupby('voted14').mean()

Unnamed: 0_level_0,id,voted06,voted08,voted09,voted10,voted11,voted12,voted13,age,race_afam,race_hispanic,race_other,race_white,female
voted14,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,1202580.0,0.227614,0.409396,0.0,0.161182,0.0,0.216056,0.0,38.491964,0.049669,0.027084,0.100088,0.823158,0.698015
1,1214615.0,0.396483,0.661373,0.0,0.470221,0.0,0.723908,0.0,45.64962,0.04325,0.017867,0.143222,0.795661,0.766024


**Question 1.** What do you find? Do any individual variables seem to be more or less predictive of voting in 2014? Interpret the above table.


**Answer the question here.**


We are now going to build a simple model. We first need to define our outcome variable (did someone vote in 2014 or not, call this *Y*) and our set of predictor variables (call this a matrix *X*). In this first simple case, we will only include `voted12` and `age` in X.

In [None]:
cols = ['voted12', 'age', 'intercept']
data['intercept'] = 1 # add a column of 1's for the intercept term
X = data[cols]
X.head()

Unnamed: 0,voted12,age,intercept
0,0,26.0,1
1,0,37.0,1
2,0,26.0,1
3,0,46.0,1
4,0,21.0,1


In [None]:
y = data['voted14']
y.head()

Unnamed: 0,voted14
0,0
1,0
2,1
3,1
4,0


We will now build our model. Note that we are now using a new module in Python called `statsmodels`. You can learn more about this module [here](https://www.statsmodels.org/stable/index.html).

In [None]:
import statsmodels.api as sm
## ignore the warning; nothing to worry about
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.442076
         Iterations 6
                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: voted14          Pseudo R-squared: 0.193     
Date:               2024-11-06 16:16 AIC:              26284.7836
No. Observations:   29722            BIC:              26309.6825
Df Model:           2                Log-Likelihood:   -13139.   
Df Residuals:       29719            LL-Null:          -16285.   
Converged:          1.0000           LLR p-value:      0.0000    
No. Iterations:     6.0000           Scale:            1.0000    
------------------------------------------------------------------
              Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
------------------------------------------------------------------
voted12       2.1814    0.0314   69.4384  0.0000   2.1198   2.2430
age           0.0145    0.0008   17.2811  0.0000   0.0129   0.

**Question 2.** Can you interpret this output? What do you think this means? (You might want to give [this](https://www.juanshishido.com/logisticcoefficients.html) a read.)

The output gives the coefficient, std. error, z-scores, p-values, and confidence intervals for each variable. A coefficient of 2.1814 for voted12, with a p-value of 0.00 which means complete statistical significance since theres no probability that the observed results are due to chance, says that if someone voted in 2012, there likelihood of voting in 2014 is a lot higher than someone who failed to vote in 2012. Age, also have a positive coefficient and p-value but has a smaller positive impact on voting in 2014 than voted12.

With the below code, we can construct model scores. Remember that the output of this can be understood as the probability that someone votes in 2014 as a function of their age and whether they voted in 2012.

In [None]:
y_pred = result.predict(X)
y_pred.head()

Unnamed: 0,0
0,0.082433
1,0.095372
2,0.082433
3,0.107281
4,0.077096


**Question 3.** Make a plot showing the relationship between your predicted model (`y_pred`) and whether someone actually voted in 2014. In text, make sure you interpret this plot.

*Hint.* You will probably want to use the [Pandas.cut() method](https://www.geeksforgeeks.org/pandas-cut-method-in-python/). This will allow you to create bins of `y_pred`. Remember the figures we discussed in the Likely Voter slides. You probably want to make a similar plot to this, like the below figure. (You don't need to exactly copy this, just an example of what you could do.) **This is a challenging question.**

![](https://github.com/joshuakalla/data_science_campaigns/blob/master/Colab/Lab7/sample_plot.png?raw=1)

To more formally assess this model, let's make a 2x2 confusion matrix, like we did in class. The confusion matrix needs to have 4 cells: number of true positives, number of true negatives, number of false positives, number of false negatives. (If you want a reminder on a confusion matrix, see the slides from lecture or read [this](https://www.python-course.eu/confusion_matrix.php).)

For this exercise, we are going to define our threshold as 0.5. That means that if someone's predicted turnout (`y_pred`) is greater than 0.5, we are going to say that we predicted they vote. If their score is less than or equal to 0.5, we are going to say we predicted they did not vote. (Note that this threshold is somewhat arbitrary. We could define different cut-offs.)

**Question 4.** Make the confusion matrix. Calculate the number of true positives, number of true negatives, number of false positives, number of false negatives. Fill in the table.

In [None]:
# insert your code here

Fill in the below table with the correct numbers.

- Actual Negative and Predicted Negative (what do we call this?):
- Actual Negative and Predicted Positive (what do we call this?):
- Actual Positive and Predicted Negative (what do we call this?):
- Actual Positive and Predicted Positive (what do we call this?):

**Question 5.** Based on this table, calculate the model's accuracy, precision, and recall. Interpret this.

In [None]:
# insert your code here

**Put your interpretation here.**

**Question 6.** Can a different model do better? It is now your turn to build a model from scratch. Follow the same steps as above. Select your predictor variables. Build your model. Construct a confusion matrix. Calculate accuracy, precision, and recall. How does this model do? Does it do better or worse than the original model? Would you use it?

In [None]:
# insert your code here

**Put your interpretation here.**

**Question 7.** Find a way to plot the differences between the two models we built.

In [None]:
# insert your code here

## Are we overfitting?

If you don't remember what overfitting means, review your notes from class or give this a read: https://www.ibm.com/topics/overfitting

Let's see how these models do out-of-sample. Are we overfitting?

To test this, you will need to run your model on a new data set (test set). You can then assess the confusion matrix and accuracy/precision/recall of your model on that test set.

Note that a logistic regression is written as:

$p(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x)}}$, where $e$ is Euler's number, approximately 2.71828. You can get Euler's number in Python using `np.exp()`.

We will need to manually re-create this logistic regression. From my model, the coefficients are:

- voted12: 2.1814
- age: 0.0145
- intercept: -2.7879

We could therefore recreate our logistic regression by calculating:

$p(voted14) = \frac{1}{1+e^{-(-2.7879 + 0.0145*age + 2.1814*voted12)}}$

Let's see this in action.

In [None]:
# Calculate y_pred by hand; call it y_pred2
y_pred2 = 1/(1+np.exp(-(-2.787942 + 2.181396 *X["voted12"] + .0145461 * X["age"])))

# Confirm that y_pred is the same as y_pred2
y_preds = pd.concat([pd.DataFrame(y_pred), pd.DataFrame(y_pred2)], axis = 1)
y_preds.columns = ['y_pred', 'y_pred2']
y_preds['diff'] = abs(y_preds['y_pred']) - y_preds['y_pred2']

# Print out this data frame
print(y_preds.head())
print(y_preds["diff"].mean()) # Small differences due to rounding, but this is essentially a 0
# These columns are the same

I now want you to test if your model is over-fit. Don't cheat by recreating your model. Just use what you have above. I care about the process, not the actual performance.

The data you used above came from South Dakota. I want you to see how the model performs in Wisconsin.

In [None]:
test = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab7/test_data.csv")
test.head()

**Question 8.** Evaluate how your model performs out-of-sample using `test_data.csv` and the coefficients from your model. Create a confusion matrix and calculate the accuracy/precision/recall. How does your model perform out-of-sample?

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

**To turn in your lab, you will need to submit a PDF through Canvas.**