# Analysis of Voter Turnout in Indiana Pre- and Post- Voter Identification Law
### Authors: Christopher Lefrak, Hannah Li, George Yang, and Kuai Yu
### PSTAT 235

## Introduction

Thirty-five of the fifty states of the U.S. have passed stricter voter ID laws that require or request voters to present a form of identification at the polls. 
The remaining fifteen states do not require voters to present any documentation to vote at the polls. States such as Indiana, Wisconsin, and Tennessee have strict photo ID laws for voters, while states such as Minnesota, Nebraska, North Carolina, and Pennsylvania have no requirements for voter identification. A visualization of the levels of strictness of voter photo identification laws for each state can be seen in the map in **Fig. 1** below.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1-bKVXaRl_j3trfCus6iWYylx3RX4ACPz" style="width:100%">
<figcaption align = "center"> Fig. 1: Strictness Levels of US Voter ID Laws </figcaption>
</figure>

Advantages of implementing stricter voter identification requirements include preventing voter impersonation, thus  increasing public confidence in election processes. Disadvantages of implementing stricter laws unnecessarily burdens voters and administrators.

## Goals
In this project, we conduct a case study to see if implementing laws that inhibit the convenience of voting can ultimately lead to decreased voter participation. We focus our investigations of voter identification laws on the state of Indiana, which implemented a strict voter identification law in 2008. We seek to analyze how much voter turnout would have decreased or increased without the implentation of the law. 

> Project Goals
> - Conduct logistic regression
> - Propensity score 
> - Doubly Robust Estimation
> - Strengthen our PySpark data analysis skills, collaborative skills, and project organization skills

## Voter Data

### Dataset Overview

Our voter data is obtained from the course's VM2Uniform folder. We primarily use the dataset corresponding to Indiana. At a glance, the dataset contains 726 columns and 946908 rows, records beginning from 2000 and ending in 2021.

### Exploratory data analysis

Let's begin by exploring the correlation between voter turnout and various factors in the 2008 Indiana election.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1hI1piik6eexYMQLtGctS8cxILRM056a4" style="width:80%">

<figcaption align = "center"> Fig. 2: Vote turnout relation with income and housing value in 2008 general election </figcaption>
</figure>

Based on the information presented in the left graph, it appears that there is a positive correlation between income and voter turnout in Indiana in 2008. Specifically, individuals with higher incomes were more likely to vote.

Similarly, the right graph suggests that there is a positive correlation between housing value and voter turnout. In other words, individuals who owned homes with higher values were more likely to participate in the election.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1Yh27OWK0ieuKvNub0CORVTqvIYJnW7tF" style="width:80%">

<figcaption align = "center"> Fig. 3:  Vote turnout relation with houisng own situation and education years in 2008 general election </figcaption>
</figure>

From the left graph, it appears that voter turnout is higher among housing owners compared to renters.

The right graph suggests that there is a positive correlation between education level and voter turnout. Specifically, individuals with higher education levels (measured in years of education) were more likely to vote in the election.

Then, let us shift our focus to Wisconsin to examine how the voter ID law impacted the voter turnout rates in different ethnic groups.

<figure>
<img src="https://drive.google.com/uc?export=view&id=11ksw39LD2tSbkxgx5X-wooZoa2t663ly" style="width:80%">

<figcaption align = "center"> Fig. 4:  Voter turnout rates in different ethnic groups </figcaption>
</figure>

Upon analyzing the graphs, it is evident that there are notable differences in voter behavior following the passing of Wisconsin's voter ID law in 2011. Specifically, the percentage of individuals identified as "Likely African-American" appears to behave differently than other ethnic groups.

For voters who participated only in the general or primary election, there was a significant decline in the percentage of "Likely African-American" individuals, while other groups experienced a sharp increase. However, for those who did not vote in either the primary or general election, there was no noticeable change in the percentages of people in each ISPSA category before and after 2011. The only exception to this was a sharp increase in the percentage of "Likely African-American" individuals after 2011.

### Input Variables

We subsetted the dataset to focus on a narrower set of voter attributes. The columns we selected from the original dataset can be seen in the following section.

In the table below we display the columns we keep from the original dataset

| Column Name | Data Type | Values |
| --- | --- | --- |
| Voters_Gender | categorical | 'F', 'M', 'Missing' |
| Voters_BirthDate | --- | format 99/99/9999 |
| Residence_Families_HHCount | numerical | 1 through 10 |
| Residence_HHGender_Description | categorical | 'Cannot Determine', 'Female Only Household', 'Male Only Household', 'Mixed Gender Household' |
| Mailing_Families_HHCount | numerical |  ... |
| Mailing_HHGender_Description | categorical |  'Cannot Determine', 'Female Only Household', 'Male Only Household', 'Mixed Gender Household' |
| Parties_Description | categorical | 'Republican', 'Unknown', 'Other', 'Democratic', 'Non-Partisan' |
| CommercialData_PropertyType | categorical | 'Apartment', 'Mobil Home', 'Residential', 'Unknown', 'Condominium', 'Missing', 'Triplex', 'Duplex' |
| AddressDistricts_Change_Changed_CD | categorical | 'Between 6 Months and 1 Year Ago', 'Has Not Changed Within Last 2 Years', 'Between 1 and 2 Years Ago', 'Within Last 6 Months' |
| AddressDistricts_Change_Changed_SD | categorical | 'Between 6 Months and 1 Year Ago', 'Has Not Changed Within Last 2 Years', 'Between 1 and 2 Years Ago', 'Within Last 6 Months' |
| AddressDistricts_Change_Changed_HD | categorical | 'Between 6 Months and 1 Year Ago', 'Has Not Changed Within Last 2 Years', 'Between 1 and 2 Years Ago', 'Within Last 6 Months' |
| AddressDistricts_Change_Changed_County | categorical | 'Between 6 Months and 1 Year Ago', 'Has Not Changed Within Last 2 Years', 'Between 1 and 2 Years Ago', 'Within Last 6 Months' |
| Residence_Addresses_Density | numerical | ... |
| CommercialData_EstimatedHHIncome | categorical | '175000-199999', '250000+', '1000-14999', '100000-124999', '75000-99999', '125000-149999', '25000-34999', '200000-249999', '50000-74999', '150000-174999', '35000-49999', '15000-24999', 'Unknown' |
| CommercialData_ISPSA | categorical | None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 |
| CommercialData_AreaMedianEducationYears | numerical |  ... |
| CommercialData_AreaMedianHousingValue | numerical | ... |
| CommercialData_AreaPcntHHMarriedCoupleNoChild | categorical | ... |
| CommercialData_AreaPcntHHMarriedCoupleWithChild | categorical | ... |
| CommercialData_AreaPcntHHSpanishSpeaking | categorical | ... |
| CommercialData_AreaPcntHHWithChildren | categorical | ... |
| CommercialData_StateIncomeDecile | categorical | None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 |
| EthnicGroups_EthnicGroup1Desc | categorical |  'East and South Asian', 'Eurpoean', 'Hispanic and Portuguese', 'Likely African-American', 'Missing', 'Other' |
| CommercialData_DwellingType | categorical | 'Multi-Family Dwelling', 'Missing', 'Single Family Dwelling Unit' |
| CommercialData_PresenceOfChildrenCode | categorical | 'Modeled Likely to have a child', 'Not Likely to have a child', 'Modeled Not as Likely to have a child', 'Known Data', 'Missing' |
| CommercialData_DonatesToCharityInHome | categorical | 'Y', 'U' |
| CommercialData_DwellingUnitSize | categorical | '5-9', '2-Duplex', '50-100', '1-Single Family Dwelling', '10-19', 'Missing', '4', '101+', '3-Triplex', '20-49' |
| CommercialData_ComputerOwnerInHome | categorical | 'Y', 'U' |
| CommercialData_DonatesEnvironmentCauseInHome | categorical | 'Y', 'U' |
| CommercialData_Education | categorical | 'Grad Degree - Extremely Likely', 'Grad Degree - Likely', 'Bach Degree - Extremely Likely', 'HS Diploma - Likely', 'Less than HS Diploma - Ex Like', 'Some College - Likely', 'Vocational Technical Degree - Extremely Likely', 'Some College -Extremely Likely', 'HS Diploma - Extremely Likely', 'Less than HS Diploma - Likely', 'Bach Degree - Likely', 'Missing' |


### Other Variables
The table below shows other control variables that we expect to be highly associated with the response variable.

| Column Name | Data Type | Values |
| --- | --- | --- |
| General_2000 | categorical | Y/null |
| General_2004 | categorical | Y/null |
| PresidentialPrimary_2000 | categorical | Y/null |
| PresidentialPrimary_2004 | categorical | Y/null |


The table below shows the response variable.

| Column Name | Data Type | Values |
| --- | --- | --- |
| General_2008 | categorical | Y/null |

### Other States
In the table below are states that do not have strict voter identification laws.

| State | Abbreviation |
| --- | --- |
| California | CA | 
| Illinois | IL |
| Massachusetts | MA | 
| Maryland | MD | 
| Maine | ME |
| Minnesota | MN | 
| North Carolina | NC | 
| Nebraska | NE |
| New Jersey | NJ | 
| New Mexico | NM | 
| Nevada | NV |
| New York | NY | 
| Oregon | OR | 
| Pennsylvania | PA |
| Vermont | VT | 

**Note:** Throughout this report, when we refer to "other states", we mean all states from this list above.

### Data Cleaning

Many of the columns contain symbols including `$` and `%`, so we remove those symbols.

Many columns are also missing data. In numerical columns, we impute these values in with the mean value to minimize any changes to z-scores of the given data. This is desirable because we are going to standardize our values when implement our machine learning algorithms. We don't want to throw away valuable data and discard the nulls entirely, but we also don't want our imputations to artifically influence/inflate the effect that the predictors may have with the response.

Many of the categorical variables in the dataset already have an "Unknown" level. As far as we are concerned, if the value is missing, then the value is unknown. Though, this is different for the case that voter participation data may be missing for a given election if the given individual is not old enough to have voted in that election.

Our data consists of registered voters as of 2021, so it also contains many individuals who were not old enough to vote in the 2008 general election (they were below the 18 year old age requirement). We removed the rows corresponding to these voters. Additionally, we cleaned the columns for voter participation such as `General_2008` to reflect the votes of those that are eligible to vote at the time. We then converted the `General_2008` variable to be numerical. 

When fitting our models in PySpark, we used `RFormula` to create a `features` column that is a vector of all relevant predictors (numerical and categorical).

## Logistic Regression

Logistic regression is a statistical method that allows us to estimate the probability that an event occurs, in this case, if an individual voted or not, given a set of independent variables $X_1, X_2,...,X_k$. If the $i$th individual has a probability $\pi_i$ of voting, then the model is that the log odds of voting, $\ln (\frac{\pi_i}{1-\pi_i})$, can be expressed as a linear combination of the covariates $X_{i1}, X_{i2},...,X_{ik}$:

$$
\begin{align}
\ln(\frac{\pi_i}{1-\pi_i})&=\beta_0+\beta_1 X_{i1}+\beta_2 X_{i2}+...\beta_k X_{ik}
\end{align}
$$

The coefficients of this function are $\beta_0, \beta_1, \beta_2,\dots, \beta_k$, and indicate the relative effect of the corresponding variables $X_1,X_2,...,X_k$ on the response variable. The larger the coefficient $\beta_j$, the more a varying value of $X_j$ changes the log-odds. The optimal coefficients maximize the function to find the best fit.

Of course, this idea of logistic regression generalize to more than just voting; it applies whenever our concerned response variable is binary.

## Model 1 - Logistic Regression on Voter Turnout

For a first attempt in identitfying if Indiana's voter turnout in 2008 was affected by the strict voter ID law that was implemented during that year, we first fit a logistic regression to the voter turnout variable `General_2008`. In particular, we trained the model to predict whether a given individual voted during the 2008 election. However, the model was only given data from states without voter ID laws to learn from, i.e., all of our working states except for Indiana.

The idea is that, if Indiana's law was key in shaping voter behavior of Indiana residents in 2008, then our model would have a poorer prediction accuracy when predicting voter turnout in Indiana than compared to a test set from the states without the same identification requirements.

#### Resulting Model

**Fig. 2** shows a lineplot of the sorted array of the resulting $\beta$ coefficients against their index in the array. We notice that there are around 120 coefficients, but we only used around 37 predictors. The reason for these extra coefficients is because many of our 37 predictors are categorical. We need to be able to turn our categorical predictors into numbers, so one categorical variables gets converted into many variables. For example, If there are five distinct values for a categorical variable, it will be converted into 4 different indicators, with the 5 corresponding to $\beta_0$.

We can see that a majority of these predictors had coefficients close to 0, indicating that many of these predictors are not useful in determining voter turnout.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1nHp-AuG_SA0MpNHA0AmFNKE4dmSbYFzN" style="width:100%">
<figcaption align = "center"> Fig. 5: Logistic Regression Beta Coefficients </figcaption>
</figure>

**Fig. 3** shows all of the predictor's with significantly large coefficients. I.e., predictors that have a significant role in predicting whether or not a given individual voted in the General 2008 election. Since many coefficients had coefficients that were nearly zero, the figure only displays predictors whose associated coefficient has a magnitude larger than 0.25. This allows us to read the names of the predictors that have meaningful correlation with voter turnout.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1N7jCemVQwY1DYzseHOW690yiw99Jc7l4" style="width:100%">
<figcaption align = "center"> Fig. 6: Logistic Regression Ranked Coefficients </figcaption>
</figure>

Perhaps unsurprisingly, it seems that an individual's party description plays a big role in their likelihood of voting. We see that more a lot of progressive or liberal-esque party description resulted in a lower odds of voting in the 2008 election. However, we see that those who identified with the "Women's Equality party" as well as African Americans had an increased odds of voting. This could be due to the historical implications the 2008 election had on the civil rights movement since it involved the first ever African American presidential candidate.



# The Performance

The summary of the performance metrics of the model is given in the table below

| Metric | Other State | Indiana |
| --- | --- | --- |
| Overall Accuracy | 79.2% | 78.6% |
| False Positive Rate | 13.3% | 14.9% |
| False Negative Rate | 29.2% | 28.4% |
| Area Under ROC | 86.2% | 86.0% |

Overall, every metric between the control test set and the Indiana data are very comparable. It was verified that the model's performance was similar on the training data itself, indicating that overfitting was not an issue. 

Since the model could predict whether an individual from Indiana voted in the 2008 election just as well as it could predict someone voting in another state, it may seem that the voter ID law is unimportant in determinine voter turnout.

## Propensity Score
Our goal is to predict if a voter has passed the law or not.

We want to be able to find the probability if someone votes if they did not have the voter identification law implented.

Assumption: People outside of Indiana are representative of the people in indiana.


> Variables:
> - T = whether they have the law
> - Y = whether they voted in 2008
> - P = predicted T

To compare the voter data with and without the implementation of a voter identification law, we observe that the difference in means $$E[Y|T=1]-E[Y|T=0]$$ is a biased estimate of the true causal effect. Using a simple difference in averages cannot truly measure the causal effect of a voter ID law, as the people in states outside of Indiana are very different from those in Indiana. 

To adjust for this bias, we can use the propensity score estimator. The propensity score is the conditional probability of receiving the treatment, the implementation of the voter identification law. Using this score means that we do not have to achieve conditional independence $(Y_1,Y_0) \perp |X$. In other words, we do not have to condition on the whole $X$ to achieve independence of potential outcomes of the treatment. Instead, it is sufficient to control confounders $X$ for a propensity score $$P(x)=P(T|X)=E[T|X]$$ to achieve $(Y_1,Y_0) \perp |P(x)$.

The propensity score essentially converts $X$ into the treatment $T$, acting as a middleman between $X$ and $T$. Initially, we cannot compare treated and non-treated osbervations. However, we can compare a treated and a non-treated observation if they have the same probability of receiving the treatment since receiving or not receiving the treatment would be attributed to randomness. Thus, we hold the propensity score constant to make the data appear more random.


### Propensity Weighting and Estimation

We write the difference in means again, but we now condition on $X$: 

$$
\begin{align}
& E[Y|X,T=1]-E[Y|X,T=0]\\
& =E[\frac{Y}{P(x)}|X,T=1]P(x)-E[\frac{Y}{(1-P(x))}|X,T=0](1-P(x))
\end{align}
$$

In other words, the propensity score serves as a way of matching people in Indiana who had similar probability of being in a state with a voter ID law. Thus, we can compare people with similar propensity scores. And, if it is actually the case that the remaining chance of being treated (in a state with a voter ID law) is due to chance, and both people have the same propensity score, then we can then take the difference in their observations. 

We can simplify our propensity score weighting estimator to $E[Y\frac{T-P(x)}{P(x)(1-P(x)}]$, where $P(x)$ and $(1-P(x))$ both must be greater than $0$. Thus, each voter must have some probability of both receiving and not receiving the treatment of the implementation of a voter identification law.

We estimate the true propensity score $P(x)$ with $\hat{P}(x)$ using logistic regression. 

Our result gives an estimated average treatment effect $\hat{ATE} = 2.18$. This result is nonsensical. We discuss this result later in the summary and discussion section, but in short, because our response variable is from 0 to 1, an ATE of 2.18 would mean an individual with a voter ID law has a 218 percent greater chance of voting, which is obvious ridiculous. 

Under another two runs of the propensity score estimator, we had an estimate of -19 and +11. Since the outcome variable is either 0 (voted) or 1 (did not vote), that would imply that the effect of the law could have caused a 1100% increase in voter turnout or a 1900% decrease in turnout, clearly implausible numbers.

One of the reasons we believe the propensity score estimator fails to capture the true causal effect and gives massive standard errors is that we do not have a very representative pool of individuals in states without voter ID laws. In particular, looking at the graph below, we can see that the logistic regression actually does a good job of separating Indiana out from the rest of the states in the U.S. In ordinary prediction settings, this would be a good outcome. However, in the case where we are trying to determine causal effects, this is not optimal, as it means that we have very few individuals living in states without voter ID laws that are actually comparable to those in Indiana. This is a common problem in propensity score matching and can introduce bias and massive variance, something that has clearly happened here.

## Doubly Robust Estimation

So far we have used logistic regression to directly model voter turnout, and we have used propensity score weighting to estimate $E[Y|T=1] - E[Y|T=0] | X$ (i.e., the expected difference in liklihood for an Indianan vs other state resident to vote in the 2008). However, we can combine the advantages of both techniques into a single estimator known as the *doubly robust estimator*. The estimator for the average treatment effect is given by

$$\hat{ATE} = \frac{1}{N}\sum \bigg( \dfrac{T_i(Y_i - \hat{\mu_1}(X_i))}{\hat{P}(X_i)} + \hat{\mu_1}(X_i) \bigg) - \frac{1}{N}\sum \bigg( \dfrac{(1-T_i)(Y_i - \hat{\mu_0}(X_i))}{1-\hat{P}(X_i)} + \hat{\mu_0}(X_i) \bigg)$$

where $\hat{\mu_0}(x)$ is an estimate to $E[Y| X, T=0]$ (probability someone with given quality votes given they are from another state), $\hat{\mu_1}(x)$ is an estimate to $E[Y| X, T=1]$ (probability someone with given quality votes given they are from Indiana), and $\hat{P}(x)$ is an estimation of the propensity score.

In implementation, we can think of these quantities as follows:

- $\hat{\mu_0}(X_i)$: predicted probabilities of voting for all voters given a model is trained on other states.
- $\hat{\mu_1}(X_i)$: predicted probabilities of voting for all voters given a model is trained on just Indiana.
- $\hat{P}(X_i)$: predicted probabilities of having a voter ID law for all voters given a model is trained on all states.

In implementation, our initial run gave us a value of $\hat{ATE} = 0.047$ which would indicate that someone in Indiana is expected to be 4.7 percent *more* likely to vote than someone from another state. Of course, this number is a point estimate, and we are unsure that about it statistical significance. What's more, it is of the opposite sign that we would expect! **Fig. 4** shows the distribution of $\hat{P}(x)$ broken down by whether the state had a voter ID law (Indiana) or not (other states). 

One of the challenges with running the doubly robust estimator has to do with label balance. We have many more voters living in non-voter-ID states than in voter-ID states. Consequently, each time we run logistic regression, we censor the data to balance the labels. Otherwise, the intercept of the logistic regression would move towards zero as we include more data from non-voter-ID law states, and our propensity score would be pushed downwards to zero. Thus, we sample from non-voter-ID states to get the same number of observations as Indiana. Ideally, we would bootstrap this procedure and obtain a confidence interval via quantile estimates. We, therefore, suspect that the doubly robust estimator is imperfectly estimating the true average treatment effect, as the effect is much larger than we expect (to be able to shift voter turnout by 4-5 percent in a presidential election has massive political ramifications) and the effect has a sign that is opposite to our priors.

<figure>
<img src="https://drive.google.com/uc?export=view&id=1Y4Obcx_1xCqewEqJyMOodQ6cWN11GxzQ" style="width:80%" align = "center">
<figcaption align = "center"> Fig. 7: Distribution of Propensity Scores </figcaption>
</figure>

## Summary and Conclusion

Using the propensity score estimator, we obtain an (implausible) average treatment effect of 220 percent. Using the doubly robust estimator, we obtain an average treatment effect of 4.7 percent.

This is interesting because the result is actually in the opposite sign of what we would expect. It has the effect of increasing voter turnout rather than decreasing it. One of the ways this could be confounded is by an omitted variable bias. In particular, if Indiana in 2008, for example, was a particularly competitive election compared to other areas that had similar explanatory variables as Indiana, then we would expect the results for Indiana to be skewed upwards.

In addition, note that we focus here on presidential elections for the sake of simplicity. Other studies of this law using smaller datasets have focused instead on the primary election in 2008, immediately after the law was passed.

Other studies have also had a hard time dealing with getting small enough standard errors. For example, Erikson et al. present a reasonable upper bound on the absolute magnitude of the causal effect as 0.02. They explain that with Current Population Survey (CPS) data, it is extremely difficult to obtain confidence intervals using present econometric methods that come near to that. We had originally hoped that with such a large dataset, with our total sample being in the millions, we would be able to construct standard errors that are more tight and therefore use the law of large numbers to estimate effects that are more precise than those of prior studies. Unfortunately, we were unable to accomplish that here.

The benefit of the doubly robust estimator is that it seeks to correct the bias by accurately estimating the counterfactual. However, in our case, the estimator's effectiveness was limited. Firstly, the estimated effect's sign was opposite to what we expected. Secondly, since the doubly robust estimator requires the estimation of three different models, it is expensive to run on large datasets, and we could only run it once.

## Resources
https://www.ncsl.org/elections-and-campaigns/voter-id#undefined 

https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html

https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.

Erikson, R. S., & Minnite, L. C. (2009). Modeling problems in the voter identification—voter turnout debate. Election Law Journal, 8(2), 85-101.