# US Presidential Election Analysis: Electoral College, Popular Vote, or Both?

## Objective
This notebook contains the second step in a larger effort to analyze historical US Presidential Election data. It focuses on scraping the electoral college voting results for all available US presidential elections (i.e. from 1789 to the present) from [The American Presidency Project website](https://www.presidency.ucsb.edu/statistics/elections), hosted at the University of California, Santa Barbara. Once the data is scraped, transformed, and validated, it is written to a Postgres Database for subsequent analysis. The following steps are implemented in this notebook:
1. [ ] Initial Setup: Import Modules, Define Functions, and Set Parameter Values
2. [ ] Scrape Electoral College Data from the National Archives Website
    1. Define Set containing All Presidential Election Years
    2. Define Set containing All US "States" that Vote in Presidential Elections (includes Washington DC)
    3. Scrape National Archive Summary web page for Links to each Election Year's Data
    4. Scrape each Election Year's web page to download the two tables containing all Election Data
    5. Parse the Data for All Election Years into a useable, compact format
    6. Validate Accuracy of Parsed Election Data
3. [ ] Transform and Validate Parsed Election Data so that it conforms to a Star Schema design pattern, with Candidate and State Dimension Tables and Electoral Vote Fact table:
    1. Spot Check Parsed Data for Individual Election Years
    2. Transform and Validate Candidate Data
    3. Transform and Validate State Data
    4. Transform and Validate Electoral Votes Data
4. [ ] Write Election Data Tables to Postgres

### Notes
- The APP website contains both Popular Vote and Electoral College results for US Presidential Elections from 1824? to present. I'll collect both voting datasets, and compare the electoral results to those obtained from the National Archives for the election years that overlap between the two data sources
- Currently I'm only scraping information regarding the Presidential Candidates; however, Vice Presidential results are also available, so I can circle back to include that data if the need arises

### Data Sources
1. [US Presidential Election voting results](https://www.presidency.ucsb.edu/statistics/elections)
2. [US States shapefile](https://www2.census.gov/geo/tiger/TIGER2019/STATE/tl_2019_us_state.zip)

### References
1. [Plotly Article](https://towardsdatascience.com/leap-from-matplotlib-to-plotly-a-hands-on-tutorial-for-beginners-d208cd9e6522)

## 1. Setup

### 1.1 Import Modules

In [9]:
# import modules
from bs4 import BeautifulSoup
from db_tools import DBC
import geopandas as gpd
import getpass
import matplotlib.pyplot as plt
import pandas as pd
import requests

### 1.2 Define Functions

In [34]:
def get_us_election_years(latest_election_year):
    us_election_years = [1789]+list(range(1792, latest_election_year+4, 4))
    print("US Presidential Election Years")
    print(*us_election_years, sep=", ")
    return set(us_election_years)

def get_html_data(url, tag_name, tag_id, find_tables=False, all_tables=False):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    tag_data = soup.find(tag_name, id=tag_id)
    if not find_tables:
        return tag_data
    elif not all_tables:
        return tag_data.find("table")
    else:
        return tag_data.find_all("table")
    
def scrape_data_links(url_domain, url_base, tag_name, tag_id):
    link_data = get_html_data(url_domain+url_base, tag_name, tag_id)
    return [url_domain+a['href'] for a in link_data.find_all("a")]

def scrape_raw_election_tables(election_links, us_election_years, tag_name, tag_id):
    raw_election_tables = {}
    for link in election_links:
        link_year = int(link.split('/')[-1])
        if link_year in us_election_years:
            raw_election_tables[link_year] = get_html_data(link, tag_name, tag_id, find_tables=True, all_tables=True)
        else:
            print(f"Error: The link year, {link_year}, parsed from the following link does not match a US election year: \n{link}")
    return raw_election_tables

### 1.3 Set Parameters

Define the base URL for the The American Presidency Project (APP), and the resource location for the summary page containing links to Presidential election data for each year. The shape file containing US State data can be downloaded from [this link](https://www2.census.gov/geo/tiger/TIGER2019/STATE/).

In [18]:
latest_election_year = 2020
app_url_domain = "https://www.presidency.ucsb.edu"
app_url_base = "/statistics/elections"
usa_state_shp = "/home/fdpearce/Documents/Projects/data/Maps/State_Shapes/tl_2019_us_state/tl_2019_us_state.shp"

## 2. Scrape Electoral College Data from the National Archives Website

### 2.1 Define Set Containing All Presidential Election Years

Create a set, us_election_years, with every year that a US Presidential Election occurred. This set will be used to scrape all available election data. See the `archive_url` website for the complete list of election years. A set is used so membership checks execute efficiently (O(1)).

In [4]:
us_election_years = get_us_election_years(latest_election_year)

US Presidential Election Years
1789, 1792, 1796, 1800, 1804, 1808, 1812, 1816, 1820, 1824, 1828, 1832, 1836, 1840, 1844, 1848, 1852, 1856, 1860, 1864, 1868, 1872, 1876, 1880, 1884, 1888, 1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020


### 2.2 Define Set Containing All US States

Load all the data from the USA States shape file. For now, only the state names are extracted, but later the geometry column can be used to generate maps, extract state features, etc. The names of US Territories that don't participate in Presidential Elections are dropped from the variable containing the set of states.

In [31]:
usa = gpd.read_file(usa_state_shp)
us_state_names = usa['NAME'].values
print(sorted(us_state_names))

['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Commonwealth of the Northern Mariana Islands', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'United States Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']


In [32]:
us_state_names = set(us_state_names)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 56


In [33]:
territories = ['American Samoa', 'Commonwealth of the Northern Mariana Islands', 'Guam', 'Puerto Rico', 'United States Virgin Islands']
us_state_names.difference_update(territories)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 51


### 2.3 Scrape the Links to each Election Year's Data

Parse html summary page and extract links to each page containing data for a given Presidential Election Year.

In [30]:
election_links = scrape_data_links(app_url_domain, app_url_base, "section", "block-views-election-maps-block-1")
print(election_links)

['https://www.presidency.ucsb.edu/statistics/elections/2020', 'https://www.presidency.ucsb.edu/statistics/elections/2016', 'https://www.presidency.ucsb.edu/statistics/elections/2012', 'https://www.presidency.ucsb.edu/statistics/elections/2008', 'https://www.presidency.ucsb.edu/statistics/elections/2004', 'https://www.presidency.ucsb.edu/statistics/elections/2000', 'https://www.presidency.ucsb.edu/statistics/elections/1996', 'https://www.presidency.ucsb.edu/statistics/elections/1992', 'https://www.presidency.ucsb.edu/statistics/elections/1988', 'https://www.presidency.ucsb.edu/statistics/elections/1984', 'https://www.presidency.ucsb.edu/statistics/elections/1980', 'https://www.presidency.ucsb.edu/statistics/elections/1976', 'https://www.presidency.ucsb.edu/statistics/elections/1972', 'https://www.presidency.ucsb.edu/statistics/elections/1968', 'https://www.presidency.ucsb.edu/statistics/elections/1964', 'https://www.presidency.ucsb.edu/statistics/elections/1960', 'https://www.presidency

### 2.4 Scrape the Two Tables Containing each Election Year's Data

Data tables dict has keys for each year data is available. Each value is a list with html for the two tables containing election data, which are stored in their own variables primarily for debugging purposes.

In [35]:
raw_election_tables = scrape_raw_election_tables(election_links, us_election_years, "section", "block-system-main")

In [36]:
print("Data is currently available from the National Archives website for the following election years:")
print(*raw_election_tables.keys(), sep=", ")

Data is currently available from the National Archives website for the following election years:
2020, 2016, 2012, 2008, 2004, 2000, 1996, 1992, 1988, 1984, 1980, 1976, 1972, 1968, 1964, 1960, 1956, 1952, 1948, 1944, 1940, 1936, 1932, 1928, 1924, 1920, 1916, 1912, 1908, 1904, 1900, 1896, 1892, 1888, 1884, 1880, 1876, 1872, 1868, 1864, 1860, 1856, 1852, 1848, 1844, 1840, 1836, 1832, 1828, 1824, 1820, 1816, 1812, 1808, 1804, 1800, 1796, 1792, 1789


### 2.5 Parse the Data for All Election Years

In [41]:
print(raw_election_tables[2020][0])

<table>
<tbody>
<tr>
<td align="center" class="x176" colspan="11">
<table cellpadding="2" cellspacing="2" width="700">
<tbody>
<tr>
<td class="x176" colspan="2" rowspan="2"><strong>Party</strong></td>
<td align="center" class="x176" colspan="2"><strong>Nominees</strong></td>
<td align="center" class="x176" colspan="2" rowspan="2"><strong>Electoral Vote</strong></td>
<td align="center" class="x176" colspan="3" rowspan="2"><strong>Popular Vote</strong></td>
</tr>
<tr>
<td><strong>Presidential</strong></td>
<td><strong>Vice Presidential</strong></td>
</tr>
<tr>
<td><em> Democratic</em></td>
<td><img align="middle" alt="election party winner" height="16" src="https://www.presidency.ucsb.edu/sites/default/files/wysiwyg_template_images/ic_check_circle_black2x.png" width="16"/></td>
<td>Joseph R. Biden</td>
<td>Kamala Harris</td>
<td align="center" class="x176">306</td>
<td align="right" class="x176">56.88%</td>
<td align="right" class="x176">81,268,773</td>
<td align="right" class="x176">51.