<h1 align="center">A Candidate-based Model of (Re)Election in the US House of Representatives</h1>
<h3 align="center">Cameron MacDonald</h3>
<h4 align="center">PPOL564: Data Science 1</h4>
<h4 align="center">Final Project</h4>

<h2 align="center">Problem Statement and Background</h2>

<h3 align="center">When I say 'election model' you probably think of 538.</h3>

![image.png](attachment:image.png)

These sorts of models rely heavily on public-opinion polling.

Of course, these models have some **problems**.

- Fundamental issues:

    - Poll-based models are very responsive to the news-cycle. Maybe more so than the median voter.

    - More than that, pollsters have had trouble dealing with sampling bias in their polling. This problem became more pronounced during the pandemic.

    - In lots of cases, these models do not give us a great idea about the result until very close to the election.

So, what's the **solution**?

- This model aims to determine whether a representative will be re-elected using only data that is available when a representative begins their term in office.

<h3> Why? <h3>

- The earlier a political party or consulting firm can get an idea about the likelihood of (re)election, the earlier they can allocate their resources and plan.

<h3> How? <h3>

- The Political Science literature has an existing understanding of certain effects which are overriding, these include things like:

        - incumbency

        - long-term economic trends

        - specific political features, like midterm elections

<h2 align="center">Methods and Approaches</h2>

<h3 align="center">Considered</h3>

- Originally, the outcome of interest was the number of years an individual would serve in congress.

- Similarly, we might have considered using data from the first year of the representative's two-year term.

<h3 align="center">Utilized (to date)</h3>

### Tools

- Pandas (for manipulating the data frame)

- BeautifulSoup (for scraping html)

- matplotlib (for plotting and investigation)

<h2 align="center">Preliminary Results and Conclusions</h2>

First, we consider the results of data wrangling and how they inform the scope of our investigation.

- One of the major results of the data exploration and wrangling process has been determining the time range for investigation.

- The number of observations increases as we go further back in time, at the cost of the number of variables for each observation.

- We only need to go as far back as about 22 years to reach 10,000 observations. Going back to 1950 yields over 30,000 observations.

### Potential Variables (incomplete list)

- Representative name (string) ✓
- Representative district (discrete) ✓
- Year Elected (discrete) ✓
- Vote share (continuous) ✓
- Avg district population in state based on previous census (continuous) ✓
- Previous election turnout (continuous) 
- Previous election vote share (continuous) ✓
- Party affiliation (dummy) ✓
- President Party (dummy) ✓
- President approval rating (continuous) ✓
- Age (discrete) ✓
- Education (dummies) ✓
- Race (dummies)
- Gender (dummy)
- % Black in district (in year elected) (continuous)
- % Hispanic in district (in year elected) (continuous)
- Income level of district (continuous)
- Previous elected experience (dummy) ✓
- Partisan make-up of prior congress (discrete) ✓
- Partisan make-up of congress elected to serve in (discrete) ✓
- Number of congress (discrete or maybe dummies)
- Previously elected (dummy) ✓
- Defeated incumbent (dummy)
- Midterm (dummy) ✓

Now, we look at some early results.

![image.png](attachment:image.png)

Considering data from before 1900, this charts ilustrates the average amount of time each representative served.

<h2 align="center">Lessons Learned and Challenges</h2>

### Web-Scraping is Hard!

- Even relatively structured data sources, like wikitables in a series of wikipedia articles, can be relatively unpredictable.

- Some unsecured websites can have additional issues with ssl verification.

- Most of these problems can be avoided by ignoring especially old (pre-1900) data.

### Existing Data

- The data that does exist is often compiled at the level of states, not districts.

- Some data that is compiled at the level of districts exists only in very difficult to navigate file formats, or if the data is accessible, it is only available for very recent history (post-2000s).

- In these cases, finding very large data sets is helpful. In some cases, state level data has to serve as a stand-in for data we would rather have at the district level.