### **Introduction**

***Hypothesis***

<p>In this tutorial, we aim to predict the results of the upcoming 2024 presidential election, based on counties' votes in the past 4 elections. We will determine a county's vote based on factors such as demographics, median household income, education, and rural/urban landscape.</p>

> ***Why is this relevant?***
> <p>Perhaps data science's most vital real-world application is in the field of political predictions. The creation of this model can offer insight into  political data science and arguably, its most popular goal.</p> 

### **Step 1: Data Collection**
<p>In this section, we will explore retrieving and processing data for our model. Our analysis will be on a county basis; the relevant data can be sourced directly from the US Census. We will begin by importing some relevant Python libraries. </p>

In [2]:
# For retrieving election data using HTTP requests
import requests
from bs4 import BeautifulSoup
# For data processing
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


The factors we hypothesize to affect a county's vote include:
- Demographics
- Urban/rural Classification
- Income
- Education

**Data Sources**


National data for demographics per county: <https://www2.census.gov/programs-surveys/popest/datasets/2020-2021/counties/asrh/cc-est2021-all.csv>
<br>
National data for elections: ***[Dave Leip's Atlas of US Presidential Elections](https://uselectionatlas.org)***

#### **Retrieving Election Data**>
<p>*Leip's Atlas* queries states by their FIPS code. Keeping this in mind, we'll start by defining some useful mappings.</p>

In [3]:
# maps states to their corresponding FIPS value for the atlas.
fips = {"Alabama": 1, "Arizona": 4, "Arkansas": 5, "California": 6, "Colorado": 8, "Connecticut" : 9, \
        "Delaware": 10, "DC": 11, "Florida": 12, "Georgia": 13, "Hawaii": 15, "Idaho": 16, "Illinois": 17, \
        "Indiana": 18, "Iowa": 19, "Kansas": 20, "Kentucky": 21, "Maine": 23, "Maryland": 24, \
        "Massachusetts": 25, "Michigan": 26, "Minnesota": 27, "Mississippi": 28, "Missouri": 29, "Montana": 30, \
        "Nebraska": 31, "Nevada": 32, "New Hampshire": 33, "New Jersey": 34, "New Mexico": 35, \
        "New York":36, "North Carolina":37, "North Dakota":38, "Ohio": 39, "Oklahoma": 40, "Oregon": 41, \
        "Pennsylvania": 42, "Rhode Island": 44, "South Carolina": 45, "South Dakota": 46, "Tennessee": 47, \
        "Texas": 48, "Utah": 49, "Vermont": 50, "Virginia": 51, "Washington": 53, "West Virginia": 54, \
        "Wisconsin": 55, "Wyoming": 56}

# map of state abbreviations for use in MapCharts
abbs = {"Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA", "Colorado": "CO", \
        "Connecticut" : "CT", "Delaware": "DE", "DC": "DC", "Florida": "FL", "Georgia": "GA", "Hawaii": "HI", \
        "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS", "Kentucky": "KY", \
        "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA", "Michigan": "MI", \
        "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", \
        "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", \
        "North Carolina": "NC", "North Dakota": "ND", "Ohio":"OH", "Oklahoma": "OK", "Oregon": "OR", \
        "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN", \
        "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA", \
        "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY"}

all_states = list(abbs.keys())
atlas_states = list(fips.keys())

The goal is to obtain data for all states for the previous 4 election cycles. To cut down on the inevitable iterative code, we'll define a function to help.

##### **```election_results()```**
<p>Get past 4 presidential election results by county for input state. Results are mapped to -1, 0, and 1, with -1 for Republican, 1 for Democrat, and 0 for Independent or Other.</p>

In [4]:
def election_results(state):
    results = {"2008":{}, "2012":{}, "2016":{}, "2020":{}} # year: {county: result} dict
    years = results.keys()
    
    for year in years:
        request = requests.get("https://uselectionatlas.org/RESULTS/datagraph.php?year=" + \
                               str(year) + "&fips=" + str(fips[state]))
        soup = BeautifulSoup(request.content, "html.parser")
        tables = soup.body.find("div", {"class": "info"}).find_all("table")
    
        for table in tables:
            county_name = table.find("td").b.text
            # The table always lists the Democratic candidate first, followed by the Republican candidate, then Independent and Other. Therefore index 0 will correspond to (D), 1 to (R), and so forth.
            percentages = [float(per.text.split("%")[0]) for per in table.find_all("td", "per")]
            max_per = max(percentages)
            if percentages.index(max_per) == 0:
                results[year][county_name] = 1
            elif percentages.index(max_per) == 1:
                results[year][county_name] = -1
            else:
                results[year][county_name] = 0
    return pd.DataFrame(results)

Here's an example usage of the election_results() method for the state of Idaho:

In [5]:
df = election_results("Idaho")
df.head()

Unnamed: 0,2008,2012,2016,2020
Ada,-1,-1,-1,-1
Adams,-1,-1,-1,-1
Bannock,-1,-1,-1,-1
Bear Lake,-1,-1,-1,-1
Benewah,-1,-1,-1,-1


#### **Processing Election Data**
Now, we will call the election results method on all 49 states to compile a master data frame of all US county votes. 

> ***Why 49?***
> <p> <em>Leip's Atlas</em> doesn't have information on Louisiana and Alaska. More on fixing this problem later.</p> 

In [6]:
elections_df = pd.DataFrame()

# This loop might take a while to run
for state in atlas_states:
    state_df = election_results(state)
    state_df["County"] = state_df.index
    state_df["State"] = state
    cols = ["State", "County", "2008", "2012", "2016", "2020"]
    state_df = state_df[cols]
    elections_df = pd.concat([elections_df, state_df])

In [27]:
elections_df.index = list(range(len(elections_df)))
elections_df.head()

Unnamed: 0,State,County,2008,2012,2016,2020
0,Alabama,Autauga,-1.0,-1.0,-1.0,-1.0
1,Alabama,Baldwin,-1.0,-1.0,-1.0,-1.0
2,Alabama,Barbour,-1.0,1.0,-1.0,-1.0
3,Alabama,Bibb,-1.0,-1.0,-1.0,-1.0
4,Alabama,Blount,-1.0,-1.0,-1.0,-1.0


Having obtained our desired election dataframe, we can move on to curating the data for our parameters.

#### **Processing Demographics Data**

In [63]:
# Reading full CSV of Census 2020-2021
race_df_full = pd.read_csv("/Users/goddang/race_by_county_2020.csv", encoding="latin-1")
# Grabbing only the entires for 2020
race_df_full = race_df_full.loc[race_df_full["YEAR"] == 2]
race_df_full.head()

Unnamed: 0,SUMLEV,STATE,COUNTY,STNAME,CTYNAME,YEAR,AGEGRP,TOT_POP,TOT_MALE,TOT_FEMALE,...,HWAC_MALE,HWAC_FEMALE,HBAC_MALE,HBAC_FEMALE,HIAC_MALE,HIAC_FEMALE,HAAC_MALE,HAAC_FEMALE,HNAC_MALE,HNAC_FEMALE
19,50,1,1,Alabama,Autauga County,2,0,58877,28734,30143,...,858,739,114,105,40,34,23,23,19,14
20,50,1,1,Alabama,Autauga County,2,1,3480,1812,1668,...,84,58,15,10,6,0,8,1,3,0
21,50,1,1,Alabama,Autauga County,2,2,3683,1877,1806,...,78,70,7,9,5,3,2,1,0,2
22,50,1,1,Alabama,Autauga County,2,3,4180,2140,2040,...,83,79,9,12,3,3,2,2,1,1
23,50,1,1,Alabama,Autauga County,2,4,3832,1933,1899,...,71,67,9,11,4,2,5,3,4,3


This dataset contains a lot of irrelevant information; we must isolate what we need. For our purposes, this will be the percent population of non-POC, per county. Notice that the above dataset repeats county entires; this is because the total population is split by age group, 1 through 18. To obtain the true totals, we will first group by county, then sum over the necessary columns to calculate the percent population of Caucasian people.

In [64]:
county_groups = race_df_full.groupby("CTYNAME")
race_df = pd.DataFrame(columns=["Year", "State", "County", "% Caucasian"])
index = 0
for name, group in county_groups:
    total_population = group["TOT_POP"].sum()
    white_male = group["WA_MALE"].sum()
    white_female = group["WA_FEMALE"].sum()
    race_df.at[index, "Year"] = "2020"
    race_df.at[index, "State"] = list(group["STNAME"])[0]
    race_df.at[index, "County"] = name
    race_df.at[index, "% Caucasian"] = ((white_male + white_female) / total_population) * 100
    index += 1
race_df.sort_values(by=['State'], inplace=True)
race_df.head()

Unnamed: 0,Year,State,County,% Caucasian
551,2020,Alabama,Elmore County,78.632347
351,2020,Alabama,Clay County,84.931136
964,2020,Alabama,Lawrence County,92.193417
82,2020,Alabama,Autauga County,75.717173
356,2020,Alabama,Cleburne County,95.768262


Now, we will repeat this for census data from 2008, 2012, and 2012, and combine the resulting dataframes.