# Predicting the 2018 Midterm Elections
*Group 42*

*Elise Penn, Manish Vuyyuru, Yajaira Gonzalez, Victor Sheng*


Fork it on Github!


https://github.com/WildTangles/ac209a_project.git

## Table of Contents
1. [Overview](#Overview)

2. [Motivation](#Motivation)

3. [Description of Data and EDA](#DescriptionOfDataAndEDA)

    1. [FEC Data](#FECData)

    2. [Polling Data](#PollingData)
    
    3. [Demographic Data](#DemographicData)
    
    4. [Geographic Data](#GeographicData)
 
4. [Literature Review/Related Work](#LiteratureReviewRelatedWork)

    1. [Model Development](#ModelDevelopment)
    
    2. [Redistricting](#Redistricting)
    
5. [Modeling Approach](#ModelingApproach)
    
6. [Models Used](#ModelsUsed)

    1. [Baseline 1](#Baseline1)
    
    2. [Baseline 2: Extended](#Baseline2)
    
7. [Variable Selection](#VariableSelection)

8. [Model Extensions](#ModelExtensions)

9. [Changes in Project Goals](#ChangesinProjectGoals)

10. [Results](#Results)

    1. [Evaluation of Datasets](#EvaluationofDatasets)
    
    2. [Evaluation of Models](#EvaluationofModels)
    
11. [Future Work](#FutureWork)

12. [Conclusions and Summary](#ConclusionsandSummary)

13. [References](#References)


#Overview<a name="Overview"></a>

The goal of the project is to predict the winners of the House of Representatives in each congressional district for the 2018 election. 
The House of Representatives is divided into 435 representatives, with each representative elected by people living in a congressional district. Each congressional district represents ~711,000 constituents, except in the case of states which have less than 711,000 residents - these states simply get one representative. 

<br />

In 2018, all three branches of government were controlled by the Republican party. However, in the midterm election directly following a presidential election, the party of the president typically loses seats in the House of Representatives. This fact, combined with the low popularity of the current president and the record-smashing fundraising of the democrats this year, led many people to predict a “blue wave” in the congress this year. 

<br />

Every 10 years, each state redraws the borders of its congressional district based on the latest update of the U.S. Census. The borders of the districts are often manipulated to favor whichever party holds more power at the time of redistricting - this is known as “gerrymandering.” Gerrymandering in order to create an advantage for a particular party is typically tolerated by the courts, but if gerrymandering disenfranchises a particular race of people, the courts may order the state to redraw their lines. Thus, redistricting happened in some state for almost every year in our study. 

#Motivation<a name="Motivation"></a>

Predictions of elections are not only interesting as a media sensation, but also vital for candidates to strategize their campaigns. By affecting the allocation of resources, these predictions not only report on the elections, but also affect the outcome of elections. (Our model will, fortunately, not affect any elections.) 

Election predictions are challenging because the outcome is affected by so many variables, but there are very few observations (i.e., elections) relative in proportion to the number of predictors. They are also challenging because the game changes almost every election. In addition to swings in political favor of the population, the party in power has the ability to literally redraw the board to favor themselves via redistricting. The combination of these two factors makes predicting elections a particularly compelling and challenging problem.

# Description of Data and EDA<a name="DescriptionOfDataAndEDA"></a>

1. **FEC Data** <br />
Vote counts from prior elections in each district. <br /> *Years available: 2002-2018 *
2. **Demographics Data ** <br />
Information from the U.S. Census about the demographics of each district. 
Due to time constraints, used limited demographics data. <br />
*Years available: 2010-2018*
3. **Polling Data ** <br />
Polls conducted during the election season. 
Due to time constraints, we used only national aggregates. <br />
*Years available: 2002-2018*
4. **Geographic Data** <br />
Used for modifying above data. District borders were used to impute Demographics and FEC data where necessary. <br />
*Years available: 2002-2018*



## FEC Data<a name="FECData"></a>
FEC data comprises of several potential predictors that are provided on a per district per state per year basis. The information can be summarized as either candidate-level information (e.g. names of candidates) and district-level information (e.g. total number of votes). Candidate-level information such as campaign financing/scandals would have been interesting to consider and there is evidence that they are likely strong predictors of the outcome of a race (see fivethirtyeight). For district-level information, of particular interest were two observations, of the number of votes and the total number of votes cast for each candidate per district per state per year. From this information, we computed the winning candidate per district per state per year and by joining on the candidate-level information (party affiliations of candidates), we arrived at our response variable (winning party per district per state per year).

<br />

Also, from the fraction of votes garnered by each candidate and their political affiliation, we compute several metrics which served as predictors. In particular for a given response in year $t$, we compute metrics from the results of the election in $t-2$. The following metrics were computed:

<br />

Let $w_{t}$ be the vote fraction earned by winner in year $t$, and $l_{t}$ the vote fraction earned by the 2nd place candidate. For a response variable in year $t$,

**metric 1:** $$ w_{t-2} - l_{t-2}$$
**metric 2:** $$ w_{t-2} / l_{t-2}$$

Let $d_{t}, r_{t}$ be the vote fractions earned by the democrat and republican. For a response variable in year $t$,

**metric3:** $$ d_{t-2} - r_{t-2}$$
**metric4:** $$ d_{t-2}/r_{t-2}$$.

<img src="https://drive.google.com/uc?id=1tonymvR5GXiWQc3KZE7Hz8Z3kZAspB7d" >
<img src="https://drive.google.com/uc?id=1-QC__Diuq2AYcJEKRE80gtiMIRwVDSao" >

**Figure 1:** Margin of victory (see margin 1) for elections with flips (defined herein as an election where the party that wins a district changes year on year) and without flips. Notice that most of the flips are observed when the prior election was a closer race (i.e. lower margin of victory).

<br />


<img src="https://drive.google.com/uc?id=1NIT_9_WtcSplPck1mmu7pnEo_dbP1oV4" >

**Figure 2:** Identifying congressional districts that experienced a change in parties in the last 2 midterm elections. **Left:** Which party won each of the last 3 midterm elections for each congressional district that experienced at least 1 part flip (Blue :Democrat, Red: Republican). **Right:** Heatmap of margin of victory (Democrat vote % - Republican vote %), where blue and red correspond to Democrat and Republican, respectively.


### FEC Cleaning
The 2004-2016 FEC data was compiled by the MIT Election Data and Science Lab. Some of the party names were filled NaNs, so we filled parties for these candidates by looking up their profiles on Wikipedia. We also calculated the winner of each election using the number of votes. However, for two states (Louisiana and Georgia), the election goes to a runoff if no candidate achieves 50%. For elections where a runoff is happening, our current model could be training on the wrong winner. The results of the elections for 2018 were scraped from Wikipedia and were already sufficiently clean (surprisingly!).


## Polling Data<a name="PollingData"></a>
Because of limited time and limited availability of free polling data online, we used only national-level polls. As polls are one of the top predictors used in actual models (see literature review), we believe that including district-level polls would considerably improve our estimates

<br />

We used the “General Congressional Vote” poll. This is the simplest poll, which simply asks respondents “If you were to vote today, would you vote for a Democrat or Republican?” These data were scraped from Real Clear Politics (see references), which included estimates from variety of pollsters. For each year, we scraped the polls in the last 2 weeks leading up to the election and took the average of those polls. Then, we calculated: 
$$ polling\_margin = \% Democrat - \% Republican $$

<br />

We included this national polling margin for each district in the year in which the poll was taken. 

## Demographic Data<a name="DemographicData"></a>

Political science research has widely covered the associations between demographics and voting behavior in the United States. Initially, we considered the idea of using "state" as a categorical predictor to use as a proxy for the demographics of a given region, but quickly realized that we needed more precision, since the demographics and voting behaviors of constituent districts within a state can vary widely. We aimed to gather demographic data at the congressional district level from the Census Bureau for as many years as possible.

<br />

While the classic Decennial Census takes place every 10 years, the Census Bureau introduced the American Community Survey (ACS) in 2005, which now includes many of the detailed questions that were previously only available in the Decennial Census "long-form" questions. The ACS is primarily available in two formats: 1-year estimates and 5-year estimates. The main trade-off between these two formats is currency vs. precision of estimates. Compare to the 1-year estimates, the 5-year estimates are collected over a longer range of time and computed using a larger sample size, reducing the margin of error. The trade-off is that the 5-year estimates are less current than the 1-year estimates, which can be problematic for regions experiencing rapid changes. As a rule of thumb, the Census Bureau recommends using the 5-year estimates when analyzing smaller populations, where precision is more important, while the 1-year estimates may be better utilized for analyzing larger populations. Also, since our problem is primarily a predictive one rather than inferential, we decided that using more current data for each year was more important.

<br />

Our data collection efforts started through a new tool launched by the Census Bureau called My Congressional District, a web-based tool that allows you to view and download data for each congressional district in the United States, with categories ranging from People, Workers, Housing, Socio-Economic, Education, and Business. Unfortunately, while this web interface is intuitive to use, it only allows access to the most recent 2017 data. We used the tool to pick out a subset of characteristics of interest based on existing research (and our own intuition and curiosity) and looked for alternative means to obtain the data.

<br />

While the Census Bureau collects an incredible amount of useful data for a wide range of impactful applications, it is still in the process of making this data easily obtainable, which admittedly, is not an easy task. We spent an extensive amount of time researching how to get access to the range of predictors we were looking for, first trying to make sense of the unintuitive API (which actually only allows access up to 2011 data), looking for Python wrappers to the API, and finally stumbling upon another web interface from the Census called the American Fact Finder.

<br />

With further experimentation, we discovered that many of the demographic factors we were looking for were available through a named report known as "S0201 - Selected Population Profile in the United States". Still, we encountered the issue that the format of the report prior to 2010 (the last Decennial Census) was sufficiently different and also not possible to download as a whole at the congressional district level. As a result, we decided to download all that we could (2010 to 2017) and compile it into one dataset.

<br />

What remained was an extensive data wrangling/cleaning effort, since even though the report format was similar across years, the same demographic characteristic across different years would show up with a different name and identifying code. We had to manually map these names across years, rename them to a common, significantly shortened name, and then concatenate each year's data into a single data frame. Also, the list of available characteristics would change year-to-year, so we had to find ones that were common across all years (even though they have slightly different names!). Finally, we extract the state and district from a single combined text column, generating a key that we could use to join with our other datasets (FEC and polling).

<br />

To be conservative, we ended up picking a fairly large subset of demographic characteristics (~20), planning to try them all and perform further variable selection to find out if any of them were either not particularly useful or significantly collinear with other predictors. Here are the ones we extracted out of 300+ possible predictors in the "Selected Population Profile in the United States" report:

<img src="https://drive.google.com/uc?id=1xZ0lG0cBLa7sK4WpCUqeDPEeTDSt4aQL" >

<br />

As a whole, the selection of these predictors reflected our interest in investigating whether gender, age/generation, household type, educational level, veteran status, proportion of new vs old residents, proportion of non-US born inhabitants, language, unemployment, urban vs. rural status, and income can help us predict how a congressional district may vote in a House race. In our modeling phase, we further cut down on this list since we were aware that many of these characteristics could be strongly related to one another.

<br />

One unfortunate fact about the "Selected Population Profile in the United States" report that we used was that it doesn't include information about race/ethnicity. These details are scattered across different report formats for different years, which may or may not be available at an aggregated congressional district level. Although we would've loved to use racial demographics, in the interest of time, we chose a few predictors that might serve as a very rough proxy.

<br />

Another primary issue was the fact that we were only able to obtain data back until 2010. For years before 2010, we decided that we could either just use 2010's data for all of those past years or try to impute the values by looking at historical trends or leveraging information regarding redistricting.


## Geographic Data<a name="GeographicData"></a>
### Motivation
District borders are not constant. Changes in district borders are rarely neutral. In most states, district borders are drawn by the state legislature. As a result, census districts are usually drawn with specific political goals in mind (i.e., to favor the party in power). Thus, understanding changes in district borders is important for predicting elections. 

<br />

There are two cases when the borders of districts change: 1) every 10 years, the entire country draws new borders to reflect  changes in population distribution (this occurred between 2010 and 2012) and 2) if a state’s borders are declared illegal by the courts, their districts must be redrawn, even though population remains constant. 

<br />

In general, our model makes the assumption that District IDs are always correlated with the same population of people from year to year. However, this assumption is flawed. There are three ways a district can change between years: 1) the borders may change, 2) the district may move to an entirely new location within the state, such that it represents none of the same people, or 3) the district may cease to exist or start to exist (this case only happens every 10 years based on population change). All 3 of these cases were common in our dataset. More than 14% of our districts experienced redistricting throughout our study period, most of them between 2010 and 2012 when almost every district was redrawn. 

### Methods

If the borders of a district changed, we used a “population-weighted average” of the data we wished to impute:

$$ x_{current} = \sum_{1}^{435} w_i  x_{prev} $$ 

Where $i$ represents each district *in the previous election year*, and  $w_i$ is intended to estimate the fraction of the population in the previous district which comes from the new district, as follows:

$$ w_i = \frac{A_{overlap}}{A_{current}}  \frac{1}{A_{prev}}  $$

Where $A_{current}$ is the area of the current year’s district, $A_{prev}$ is the area of the previous year’s district, and $A_{overlap}$ is the overlap between the current district and the old district. $\frac{A_{overlap}}{A_{current}}$ represents the percentage of the new district’s area which comes from the old district. $\frac{1}{A_{prev}}$ represents the population density of the previous district. Given that each district has ~711,000 inhabitants, this is an acceptable approximation.  

<br />

This data was then used in three ways:
1. *To impute the margin of victory when redistricting occurred.* <br />
If the district moved entirely, we wanted this to be reflected in the “previous winner” and “previous winner margin” predictors. We imputed the previous winner where the area of the district was changed by more than 10% by area. 
2. *To impute demographics data to pre-2010 years.* <br /> 
Since we only had 5 election years of demographics data, we were limited to using just 4 elections to train our model. Imputing demographics data backward allowed us to use the full 8 years of FEC data along with some reasonable estimates of demographics. 
3. *To drop districts where redistricting had occurred.*<br /> 
We tested this approach, but it was difficult to determine its success. For 2018, it involved dropping the entire state of Pennsylvania. It was hard to tell if our predictions really improved, or if we had just removed a “swing state” for which we already had poor predictions. Thus, it is not included in this report. 


# Literature Review/Related Work<a name="LiteratureReviewRelatedWork"></a>

## Model Development<a name="ModelDevelopment"></a>
We learned a lot from FiveThirtyEight’s comprehensive description of their model for the 2018 midterm elections [3]. Primarily, we learned which predictors we should emphasize when collecting data for our model. These were the top four predictors according to FiveThirtyEight in order of importance. The following list is a direct quote from [3]:

>* **The incumbent’s margin of victory in his or her previous election**, adjusted for the national political environment and whom the candidate was running against in the prior election.
>* **The generic congressional ballot**
>* **Fundraising**, based on the share of individual contributions for the incumbent and the challenger as of the most recent filing period.
>* **FiveThirtyEight partisan lean**, which is based on how a district voted in the past two presidential elections and (in a new twist) state legislative elections. In our partisan lean formula, 50 percent of the weight is given to the 2016 presidential election, 25 percent to the 2012 presidential election and 25 percent to state legislative elections.

<br />

Demographic data is not discussed in this ranking of predictors because FiveThirtyEight employs a system called CANTOR to handle their demographics. CANTOR classifies districts based on their demographics. When states redistrict, FiveThirtyEight uses CANTOR and a kNN to impute affected districts with the data from the districts closest to them in demographics. 

<br />

Our model largely represents our best attempt to replicate this model where we can, and find other solutions where we cannot. Because of the limited time frame, the amount of data we were able to find free online & clean in the time frame given was extremely limited. 

<br />

For the incumbent’s margin of victory in the previous election, we used FEC data. Unlike FiveThirtyEight, we did not directly adjust this number for the political environment and opposing candidate. We found the generic congressional ballot averaged over the entire nation, but were unable to find it at the district level. We found fundraising data, but were unable to find data for 2018 in time to process it, so we did not use this variable at all. Our equivalent to partisan lean was which party won in the previous year. And finally, instead of CANTOR, we used demographics directly as a predictor. 

<br />

With these different predictors, we ended up with a very different model than FiveThirtyEight, and a very different order of importance of our predictors. See the results for more details. 

<br />

Because we didn’t have sophisticated enough demographics data (i.e., we didn’t have demographics for very many years, and did not have important predictors like race), we were not able to impute using kNN as FiveThirtyEight does with CANTOR. Instead, we decided to use a “population overlap” model to estimate changes when redistricting occurs. 



## Redistricting <a name="Redistricting"></a>
The inspiration for the redistricting came from an unexpected source: regrinding satellite data. “Level 2” satellite data products include pixels which may be irregular shapes. In order to make these product useable for the average scientist, they must be placed on a regular grid of equally-sized grid boxes. This “gridded” product is known as “Level 3” satellite data.

<br />

Regridding satellite data and imputing data when congressional districts change are different problems, but they have the same fundamental geometrical solution. The weighting solution was inspired by the algorithm used by the NASA OMI Level 3 data product to put irregular pixels onto a regular grid [5]. Their formula: 
$$ w_j = w_{Ai} \cdot Q_{ij} $$
where w_j is the weight, $w_i$ is the inverse of the area of the pixel, and $Q_{ij}$ is the fractional overlap between the pixel and the grid cell, is very similar to the population overlap formula used here. Both algorithms use percent overlap. And instead of the size of the pixel, we are worried about population density… which happens to be inversely proportional to the area of the district. In contrast satellite regrinding, we determined that we did not need much precision; thus, we did not account for curvature of the earth in our calculations. 

<br />

Code, including the packages used and the structure of dictionaries, was also inspired by the Wisconsin Horizontal Interpolation Program for Satellites (WHIPS), an open-source program which employs the NASA regridding algorithm [6]. 



# Modeling Approach<a name="ModelingApproach"></a>

The aim of the project is to predict the district winners for the House of Representative in the 2018 election. Given that the parameter we are trying to estimate is a categorical value; whether the winner of a district is a democrat or a republican. This is a classification problem. Our modeling approach is to use a set of classification algorithms trained on different datasets and predictors to find the combination that most accurately predicts the district winners for 2018.  

###Baseline

Our baseline approach is to run a set of models with a limited number of predictors and assess the accuracy achieved. Using a small set of predictors reduces model complexity and decreases potential for overfitting. Simpler models also allow us to better understand the relationship between the predictors and the response variable. 

###Extensions

Our modeling extension is to incorporate more features into our baseline dataset and run the predefined set of models on each combination.  For this we use a combination of FEC Data, National Polling Data, Demographics Data and our approach for Re-districting. The goal of the extension model is to explore if by adding more predictors to the baseline dataset we are able to capture more information about the districts that do undergo a party change in the 2018 election compared to the previous election while at the same time keeping a good classification accuracy on the prediction for the districts whose winner is indeed from the same party as the previous election. 



## Models Used<a name="ModelsUsed"></a>

The table below shows a summary of the models chosen and the reason for their use.

<img src="https://drive.google.com/uc?id=1y09ctZcQfx7KLnLoJDe5GRE83KgOL5hJ">


## Baseline #1<a name="Baseline1"></a>

The baseline model refers to the use of a subset of predictors from the FEC data. The predictors for the baseline model are election results from the previous year and the state the district belongs to. Below is a summary of the accuracy of each model.


<img src="https://drive.google.com/uc?id=1Dv--K659WwweWNRwpp4C-7ZO_QFqGmlL" >

In the table above we see that even though the logistic regression model achieves ~88% accuracy on the test data, all the predictions that are correct are those for which the winner of the district is from the same party as the winner from the previous election. We call this approach the baseline model as for all observations in the test data the predicted party is always the same as the winner of the previous election. The QDA models shows something different, with a lower R2 score and lower test accuracy, QDA is able to predict 20% of the districts in the test data that whose party changed from the previous election winner. However we gain this at the expense of reducing classification accuracy on the districts whose winner is from the same party at the previous election. 

## Baseline #2: Extended<a name="Baseline2"></a>


The Baseline Extended Model attempts to capture more information about the districts whose winner party is different using a different subset of data from the FEC. We attempted to choose predictors which keep a good classification accuracy on the districts whose winner is from the same party as the winner from the previous election. This model uses the margin by which the winner of the previous election won over the loser party and the state the election took place in. 

The purpose of this exercise is to explore the limits of our baseline dataset. 

Below is a table summarizing the results. 

<img src="https://drive.google.com/uc?id=1-XCal5Q-0GpqN0xZlr3kgxakxtMi7lw7" >

With the baseline extended model we see that we are able to increase good predictions on the districts that flip by 90%, however this is at the expense of misclassifying more districts whose winner party do not change from the previous election. QDA results are very similar to the previous model, adding more predictors did not improve the QDA model. However, the rest of the models did see an improvement on the classification accuracy of the districts that do flip. 

### Baseline Extended Takeaway


The  majority of the districts vote for the same party that won on the previous election and only a small number of districts see a party switch. Because of loss function maximizes total predictions, and we have so few districts which switch states, it is challenging to create a model which predicts which districts will flip accurately. 

<br />

Given that the goal of the project is to predict the winner party for each district in 2018 we still need to do well on the districts that do not switch parties but we would like to capture at least some of the districts that do switch. Using only FEC data, the best model we can achieve is the boosting model, which achieves 87% overall test accuracy on 2018 and is able to predict correctly 97% of the districts that do not switch parties between elections and captures only 6% of the districts that do switch parties. 

<br />

Based on these two baseline models, we conclude that more data sources are needed in order to improve the accuracy of the model on states which flip. 

#**Model Extensions**<a name="ModelExtensions"></a>

In efforts to improve our prediction on the test set while being able to capture information about the districts whose winner party changes from the last election we explored a set of extensions that uses a combination of different datasets on the model described above. 

Below is a summary of the different extensions we looked at:

<img src="https://drive.google.com/uc?id=1vUPG-ZGY5wgs0lulPwLTwf4Tg8DnK56a">

## Changes in Project Goals<a name="ChangesinProjectGoals"></a>

The original baseline model which used logistic classification with a small set of predictors learned that most congressional districts do not change parties in consecutive elections and so for every district prediction it predicted the winner to be from the same party as the previous election. This gave us a good classification accuracy as the majority of the districts vote in the same way as they did in the previous election, however it was not able to predict a single district whose winner was from a different party as in the previous election. 

<br />

To improve the classification accuracy on districts whose party changes from the previous election we tried upsampling the number of observations where this event happens (which is very small) or using AdaBoost to assign greater weight to districts were incorrectly classified. Upsampling did not improve the classification accuracy, however AdaBoost was able to capture some of the districts that voted for a different party as they did in the previous election. For this reason we added AdaBoost to our list of models to try. 



## Variable Selection <a name="VariableSelection"></a>

Of the 24 possible predictors in the baseline model, we handpicked 10 of them as likely good predictors of the election. However, we wanted to make sure that none of the predictors were highly collinear. Collinearity would cause our models to find suboptimal solutions because one of the assumptions in logistic regression is that the variables are independent of one another. 

<img src="https://drive.google.com/uc?id=11rRkedylHiPwSA6I9Ef3DGkUCiJAKg_x" >

**Figure 3:** Collinearity of predictors considered. Color bar represents the correlation coefficient between the two predictors where 1.0 represents perfect correlation and 0.0 represents no correlation. 

<br />

Unsurprisingly, we saw that `margin_signed_minus_prev` (% of votes received by democrat - % of votes received by republican candidate in the previous year) and dem_win_prev (whether a democrat won last year) were highly correlated. Of the two, we chose to use only `margin_signed_minus_prev` because it provided additional information about how decisive the victory was, rather than simply who won. Immediately after we removed `dem_win_prev`, we saw increases in prediction of flipped races (from 0% to 20%+) without much change in the overall accuracy. 

<br />

The only other notable correlation we between `foreign_to_native_born_ratio` and `civilian_veteran_pct`. We thought this symbolized the urban/rural divide rather than any causation, but the correlation coefficient was high enough that we removed `civilian_veteran_pct` to reduce collinearity. 

<br />

Interestingly, `labor_force_unemployed_pct` was moderately correlated with `national_poll`. Since we were only looking at the last 3 years, the economy happened to be recovering from the 2008 recession (thus, decreasing unemployment everywhere) at the same time as the republicans were gaining support. Whether this is a causal relationship is up for debate. In the end, we removed `labor_force_unemployed_pct` not only because of this correlation, but also because it cannot be imputed to prior years (because unemployment changes quickly between years) in the same way as other demographics. 



#**Results**<a name="Results"></a>








## Evaluation of Datasets<a name="EvaluationofDatasets"></a>

A model which uses only FEC and polling data for eight years is shown in Figure 4 below. With only FEC and polling data for 8 years, we were able to attain a high level of overall accuracy, but this was primarily achieved by predicting that the previous year’s winner would win again. As a result, each model predicted almost all districts which did not flip correctly, but none of the states which flipped. Although we are using more years of data, this is still very similar to the baseline model. As discussed above with the baseline models, this indicates that more information is needed to improve the model. 

<br />

**Extension 1: Full FEC Data: 2004-2018** 

<img src="https://drive.google.com/uc?id=1cZcm7n0zkzXXCtL2QBZCQ903r6r_mkeu" >

**Figure 4:** This model uses only the FEC and polling data. States are not dropped if they were redistricted the prior year. The first panel shows accuracy for the training set, the testing set, and for flipped and nonflipped districts in the testing set. The second panel shows logloss. The third panel shows the percentages predicted by the Logistic Regression model. Above 0.5 on the y axis indicates a democrat was predicted, below 0.5 indicates a republican was predicted. Colors indicate the true value. Predictors used:
`national_poll`, `margin_signed_minus_prev`


<br />

Figure 5 below shows a model which uses both FEC and Demographics data, but only for four elections. With only four years of data, we were able to improve the prediction of districts which flipped from 0% to 40% (in the case of Logistic regression). 40% is still not a high rate of prediction. However, more important than whether the model made a correct prediction is whether the model was able to give reasonable probabilities for each result. This is interpreted by log loss, which was fairly low. 

<br />

The FEC & Demographics model using only four years (Figure 5) was by far our best, both in terms of raw prediction and in terms of applying acceptable probabilities to the predictions. This is the only model which correctly predicted that the democrats would win the house in terms of the raw prediction (to get a true estimate of which party it predicted, one would need to run this model multiple times and obtain the distribution of outcomes). 

<br />

**Extension 2:  FEC and Demographics Data: 2010-2018**

<img src="https://drive.google.com/uc?id=1pxEzsR2KJSNqY-_7pA8hpFy249Y63ERf" >

**Figure 5:** This model uses FEC and selected demographics data, but only trains on four elections. The first panel shows accuracy for the training set, the testing set, and for flipped and nonflipped districts in the testing set. The second panel shows logloss. The third panel shows the percentages predicted by the Logistic Regression model. Above 0.5 on the y axis indicates a democrat was predicted, below 0.5 indicates a republican was predicted. Colors indicate the true value. Predictors used: 
`national_poll` 
`margin_signed_minus_prev`
`female_pct`
`foreign_to_native_born_ratio`
`age18_24_pct`
`age25_34_pct`
*

<br />

Figure 6 below shows a model which used imputing to achieve a full dataset. See the section on geographic data for a description of the algorithm used to impute data. Interestingly, this model is similar to the model which used only four years of data. From the LogReg Results, we can see that for both of these models, most of the districts which were incorrectly predicted were within ±20% from the 50% line, indicating that the model did not have high confidence in its incorrect predictions. 

<br />

**Extension 3:  FEC and Demographics Data and Redistricting: 2004-2018**

<img src="https://drive.google.com/uc?id=1DfU_d5hh-XQI64QT907BTL-3femgVQlG">

**Figure 6:** This model uses FEC and selected demographics data. In addition, demographics data was imputed backward from 2010. Then, when states were redistricted, we imputed `margin_signed_minus_prev`. The first panel shows accuracy for the training set, the testing set, and for flipped and nonflipped districts in the testing set. The second panel shows logloss. The third panel shows the percentages predicted by the Logistic Regression model. Above 0.5 on the y axis indicates a democrat was predicted, below 0.5 indicates a republican was predicted. Colors indicate the true value.
Predictors: 
`national_poll`
`margin_signed_minus_prev`
`female_pct`
`foreign_to_native_born_ratio`
`age18_24_pct`
`age25_34_pct`


<br />

Interestingly, the model which only used four years of data (Figure 5) performed better than the model which imputed results across all years (Figure 6). This is despite having less data and incorrect data for any FEC data where the district was substantially redistricted. Perhaps this added “noise” forces the model to use more of the demographics, which creates a better result. On the other hand, perhaps imputing four years (that is, half our data) of demographics data added too much inaccurate, "estimated" demographics data for the model using imputed data to pick up on demographics trends. 

<br />

In all the models, there were a few elections which the model predicted with a high degree of confidence but was incorrect. This is the most worrying type of prediction, and it indicates that model completely missed this race. Most of these were districts which flipped, which we already know are the hardest to predict. However, that the model considered these "certain" indicates that we were missing some key information on these races. We discuss which data we would use in the Future Work section. 

## Evaluation of Models<a name="EvaluationofModels"></a>

We used two metrics to evaluate our models: Accuracy and log loss. For a probabilistic model such as the one we built, accuracy only tells part of the story. We acknowledge that because our information is incomplete, the best model we can build is a model that doesn’t always get the right answer, but which has a high degree of uncertainty on its prediction when it does make an incorrect prediction. Log loss is a metric which is designed to score a model based on the probability of its predictions. 

<br />

We scored these metrics on four subsets of the data: the full training set, the full testing set, the districts which didn’t flip in the testing set, and the districts which flipped in the testing set. Predictably, all metrics performed best on the training set, equal or slightly less well on the training set, extremely well on districts which didn’t flip, and poorly on districts which did flip. This result was uniform across models. 

<br />

However, LDA, QDA, and LogReg performed consistently better than kNN, Boosting, or Random Forest. These models, which allow for more complicated lines, either may overfit the training data, or they are too “good” at finding the best predictor (the winner of the previous year), at the expense of the districts which flipped. 

<br />

Interestingly, although the flip accuracy varied quite a lot between the different datasets for LDA, QDA, and LogReg, the flip logloss did not change at all. This may indicate that most of our models were quite similar, but that small variations in information allowed them to place flipped districts slightly more on one side of the 50% line than the other. 

# Future Work<a name="FutureWork"></a>

There are several directions that future work can explore.

<br />

On the front of data, the current set of features could be expanded to include finer, candidate-level information such as campaign financing that are reported in similar works to be strong predictors (see fivethirtyeight) and other features could certainly be explored (how common the candidate's name is). There are several other directions that this could take, such as using polling at levels below national (e.g. state-level polling) which are also reported in similar works as strong predictors. 

<br />

On the front of the modelling, it would be more appropriate to consider architectures that more faithfully replicate representations in the real world. This could again take several directions. Notice below, a plot of the geographical location of districts that flip between 2010-2016. It's clear that not all states are 'created equal' in the context of flipping districts. For example, the larger states (California, Florida, Texas, etc.) are typically very likely to contain a district that flips. Interestingly, it seems that there are also many states where we expect to see no districts flipping at all. It appears that flips are typically relatively clustered geographically. An example of how this could be accomplished would be to use a KNN model that identifies nearest neighbors by geographical distance.

<img src="https://drive.google.com/uc?id=1yiyKuMaNu4RSWrBRWaICV-38HxAKWq_i">

**Figure 7:** Geographical distribution of party flips across elections years.

 # Conclusions and Summary: <a name="ConclusionsandSummary"></a>

At the outset of the project, we naively hoped to create a model which would predict the 2018 elections with a high accuracy score. We soon learned that no matter what datasets we threw at the model, all of them were insufficient to create a model with a better *accuracy score* than our baseline model. Instead, we refocused our efforts to create a model which had more realistic *probabilities* on the outcome. This probabilistic model of the 2018 elections allowed us to use the information we had while also acknowledging the limitations of our data. 

<br />

**Successes: **

We **improved our prediction of flipped districts** over the baseline model by adding demographics data and using only the most meaningful data from prior elections. Demographics allowed us to further constrain if a district was likely to flip. 

We found a way to **address redistricting** without the comprehensive demographics data used by FiveThirtyEight. By estimating the percent of population which moved from one district scheme to the next using the overlap of district boundaries, we found a tolerable way to impute prior election data. 

The probabilistic models enabled us to gain a deeper understanding of how our models were working. In the end, we had only a few data points which were predicted with a high degree of accuracy but which were incorrect. 

<br />

**Challenges: **

Our final model ran into two fundamental problems: our **imbalanced dataset** and **insufficient data**. 

Because only a small number of districts flipped (i.e., voted for a different party than they did last year), the model could be highly successful by predicting that all or almost all of the districts would flip. This is the classic problem of the imbalanced dataset. Although we tried upsampling the prior dataset, we could try changing our performance metric to something specifically designed to combat this problem, such as Cohen’s Kappa. 

We were also plagued by insufficient data, both in terms of the number of years the data covered, and the number of meaningful predictors we had available. One issue that we ran into repeatedly was that running a new model was a 5-minute problem... but cleaning a new dataset is a 5-hour problem! Fortunately, the solution to this problem with more data does not require any ingenuity, only time. The datasets which we believe would improve our model the most are *district-level polls*, *fundraising* data, and *demographics data*, including race & ethnicity, for the full time period on which we trained our model. 


# References<a name="References"></a>


[1] FEC results from (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2) and (https://www.fec.gov/introduction-campaign-finance/election-and-voting-information/#election-results)

[2] 2018 House election results from https://en.wikipedia.org/wiki/2018_United_States_House_of_Representatives_elections (accessed 12072018)

[3] Methodology/Importance of Predictors from
https://fivethirtyeight.com/features/2018-house-forecast-methodology/

[4] Shape Files: 
*2002-2014:* (http://cdmaps.polisci.ucla.edu/) and *2016-2018:* (https://www.census.gov/geo/maps-data/data/tiger-line.html)

[5] Inspiration for shapefile algorithm: https://acdisc.gesdisc.eosdis.nasa.gov/data/Aura_OMI_Level3/OMNO2d.003/doc/README.OMNO2.pdf, Section 6

[6] Inspiration for shapefile code: https://nelson.wisc.edu/sage/data-and-models/software.php 

[7] Demographics from: https://www.census.gov/programs-surveys/acs/