# US Presidential Election Analysis: Electoral College, Popular Vote, or Both?

## Objective
This notebook contains the first step in a larger effort to analyze historical US presidential election data. It focuses on scraping the electoral college voting results for all modern US presidential elections (i.e. from 1892 to the present) from the [National Archives Website](https://www.archives.gov/electoral-college/results), and then writing the results to a Postgres Database for subsequent analysis. The following steps are implemented in this notebook:
1. [X] Initial Setup: Import Modules, Define Functions, and Set Parameter Values
2. [X] Scrape Electoral College Data from the National Archives Website
    1. Define Set containing All Presidential Election Years
    2. Define Set containing All US "States" that Vote in Presidential Elections (includes Washington DC)
    3. Scrape National Archive Summary web page for Links to each Election Year's Data
    4. Scrape each Election Year's web page to download the two tables containing all Election Data
    5. Parse the Data for All Election Years into a useable, compact format
    6. Validate Accuracy of Parsed Election Data
3. [ ] Normalize Election Data for Writing to Postgres Database
4. [ ] Write Election Data Tables to Postgres

### Notes
- The National Archives website only contains Electoral College results for US Presidential Elections from 1892 to present. Another data source will be scraped to get Popular Vote data for each Presidential Election
- Currently I'm only scraping information regarding the Presidential Candidates and their electoral college vote tallies; however, Vice Presidential results are also available, so I can circle back to include that data if the need arises
- FYI: Any candidate who wins a majority or plurality of the popular vote nationwide has a good chance of winning in the Electoral College, but there are no guarantees: for example the results of 1824, 1876, 1888, 2000, and 2016 elections (see Reference 1 below).

### Data Sources
1. US Presidential Election Electoral College voting results: https://www.archives.gov/electoral-college/results
2. US States shapefile: https://www2.census.gov/geo/tiger/TIGER2019/STATE/tl_2019_us_state.zip

### References
1. Electoral College History: https://www.archives.gov/electoral-college/history
2. https://towardsdatascience.com/scraping-table-data-from-websites-using-a-single-line-in-python-ba898d54e2bc

## 1. Setup

### 1.1 Import Modules

In [1]:
# import modules
from bs4 import BeautifulSoup
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import requests

### 1.2 Define Functions

In [2]:
def get_html_tables(url, div_id="main-col", find_all=False):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    div = soup.find("div", id=div_id)
    if find_all:
        return div.find_all("table")
    else:
        return div.find("table")
    
def scrape_election_links(archive_url_domain, archive_url_base):
    link_table = get_html_tables(archive_url_domain+archive_url_base)
    return [archive_url_domain+a['href'] for a in link_table.find_all("a")]    

def scrape_raw_election_tables(election_links, us_election_years):
    raw_election_tables = {}
    for link in election_links:
        link_year = int(link.split('/')[-1])
        if link_year in us_election_years:
            raw_election_tables[link_year] = get_html_tables(link, find_all=True)
        else:
            print(f"Error: The link year, {link_year}, parsed from the following link does not match a US election year: \n{link}")
    return raw_election_tables
    
def parse_election_years(data_tables, state_names):
    parsed_years = []
    for ind, year in enumerate(data_tables.keys()):
        print(f"Working on Election Year = {year} ({ind})")
        parsed_dict = parse_election_year_tables(data_tables[year], state_names)
        parsed_dict['year'] = year
        parsed_years.append(parsed_dict)
    return parsed_years

def parse_election_year_tables(year_tables, state_names):
    parsed_tables = {}
    parsed_tables['t1'] = parse_table1(year_tables[0].find_all('tr'))
    parsed_tables['t2'] = parse_table2(year_tables[1].find_all('tr'), state_names)
    return parsed_tables

def parse_table1(t1_rows):
    cp_row_inds = [0, 1]
    cp_row_headers = ["President", "Main Opponent"]
    candidate_party = []
    for ri, rh in zip(cp_row_inds, cp_row_headers):
        candidate_party.append(parse_t1_candidate_party(t1_rows, ri, rh))
    return candidate_party
    
# Parse Table 1 to store the Presidential Candidates Name and Party
def parse_t1_candidate_party(t1_rows, row_ind, row_header):
    if t1_rows[row_ind].find('th').get_text() == row_header:
        row_data = t1_rows[row_ind].find('td').get_text()
        name, party = row_data.split(' [')
        return (name, party[:-1])
    else:
        print(f"Error: Row{row_ind} does not contain data for {row_header}")

def parse_table2(t2_rows, state_names):
    t2_data = {}
    num_candidates = parse_t2_num_candidates(t2_rows[0])
    t2_data['candidate_state'] = parse_t2_candidate_state(t2_rows[1], num_candidates)
    t2_data['votes_by_state'] = parse_t2_votes_by_state(t2_rows[2:], num_candidates, state_names)
    #t2_data['rows'] = t2_rows
    return t2_data

def parse_t2_num_candidates(header_row):
    return int(header_row.find('th', text="For President").get('colspan'))
    
def parse_t2_candidate_state(cs_row, num_candidates):
    cs_cols = cs_row.find_all('td')
    candidate_state = []
    for cs in cs_cols[:num_candidates]:
        if cs.find('br'):
            text = " ".join(cs.stripped_strings)
        else:
            text = cs.get_text()
        if text == "Other":
            candidate = text
            state = None
        else:
            candidate, state = text.split(" of ")
        candidate_state.append((candidate.strip(","), state))
    return candidate_state

def parse_t2_votes_by_state(states_rows, num_candidates, state_names):
    votes_by_state = []
    for sr in states_rows:
        state_cols = sr.find_all('td')
        col_0_text = state_cols[0].get_text().strip(" *")
        if col_0_text in state_names:
            state = col_0_text
            start_ind, end_ind = 1, num_candidates+2
        elif col_0_text in {"Total", "Totals"}:
            state = "Totals"
            start_ind, end_ind = 1, num_candidates+2
        elif sr.find('th', text='Total'):
            state = "Totals"
            start_ind, end_ind = 0, num_candidates+1
        else:
            state = False
        # Only parse vote data and store it if a valid state value is found
        # This helps validate the state name was parsed correctly, and
        # skips the Notes row at the end of some of the tables
        if state:
            state_votes = [state]   
            for sv in state_cols[start_ind:end_ind]:
                votes = sv.get_text()
                state_votes.append(int(votes) if votes != '-' else 0)
            votes_by_state.append(tuple(state_votes))
    return votes_by_state

def print_election_year_results(parsed_year):
    print(f"Election Year: {parsed_year['year']}")
    print(f"Table 1 Top 2 Candidates + Party: \n{pprint_list_of_tuples(parsed_year['t1'])}")
    print(f"Table 2 Candidates + Home State: \n{pprint_list_of_tuples(parsed_year['t2']['candidate_state'])}")
    print(f"Table 2 Votes by State: \n{pprint_list_of_tuples(parsed_year['t2']['votes_by_state'])}")
    
def pprint_list_of_tuples(list_of_tuples):
    list_of_strings = [str(tup).strip('()') for tup in list_of_tuples]
    return '\t'+'\n\t'.join(list_of_strings)

def make_map_usa(df, col2plot, figsize=(15,8), title=None, fontsize=18, cmap='Blues', edgecolor='k'):
    # Define fixed longitude limit, xlim, and latitude limit, ylim, for USA
    xlim = (-172, -58)
    ylim = (16, 74)
    fig, ax = plt.subplots(1, figsize=figsize)
    df.plot(column=col2plot, ax=ax, cmap=cmap, edgecolor=edgecolor)
    ax.axis('off')
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=df[col2plot].min(), vmax=df[col2plot].max()))
    sm._A = []
    cb = fig.colorbar(sm)
    cb.set_label(col2plot, fontsize=fontsize)
    cb.ax.tick_params(labelsize=fontsize)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    if title:
        ax.set_title(title, fontsize=fontsize+4)
    plt.tight_layout()

### 1.3 Set Parameters

In [3]:
# Define the base URL for the National Archives, and the resource location for the summary page
# containing links to Presidential election data for each year. The shape file containing US State
# data can be downloaded from here: https://www2.census.gov/geo/tiger/TIGER2019/STATE/
latest_election_year = 2020
archive_url_domain = "https://www.archives.gov"
archive_url_base = "/electoral-college/results"
usa_state_shp = "/home/fdpearce/Documents/Projects/data/Maps/State_Shapes/tl_2019_us_state/tl_2019_us_state.shp"

## 2. Scrape Electoral College Data from the National Archives Website

### 2.1 Define Set Containing All Presidential Election Years

In [4]:
# Create a set, us_election_years, with every year that a US Presidential Election occurred
# This set will be used to scrape all available election data
# See archive_url website for the complete list of election years
us_election_years = [1789]+list(range(1792, latest_election_year+4, 4))
print(us_election_years)

[1789, 1792, 1796, 1800, 1804, 1808, 1812, 1816, 1820, 1824, 1828, 1832, 1836, 1840, 1844, 1848, 1852, 1856, 1860, 1864, 1868, 1872, 1876, 1880, 1884, 1888, 1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020]


In [5]:
# Convert to set so membership checks execute efficiently (O(1))
us_election_years = set(us_election_years)

### 2.2 Define Set Containing All US States

In [6]:
# This loads all the data from the shape file. For now, only the state names
# are used, but later the geometry column will be used to generate maps
usa = gpd.read_file(usa_state_shp)
us_state_names = usa['NAME'].values
print(sorted(us_state_names))

['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Commonwealth of the Northern Mariana Islands', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'United States Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']


In [7]:
# Convert array to set and check how many states are included in the shapefile data
us_state_names = set(us_state_names)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 56


In [8]:
# Let's drop the names of Territories, etc that don't participate in Presidential Elections
territories = ['American Samoa', 'Commonwealth of the Northern Mariana Islands', 'Guam', 'Puerto Rico', 'United States Virgin Islands']
us_state_names.difference_update(territories)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 51


### 2.3 Scrape the Links to each Election Year's Data

In [9]:
# parse html page and extract links to each page containing data for a given Presidential Election Year
election_links = scrape_election_links(archive_url_domain, archive_url_base)
print(election_links)

['https://www.archives.gov/electoral-college/1892', 'https://www.archives.gov/electoral-college/1896', 'https://www.archives.gov/electoral-college/1900', 'https://www.archives.gov/electoral-college/1904', 'https://www.archives.gov/electoral-college/1908', 'https://www.archives.gov/electoral-college/1912', 'https://www.archives.gov/electoral-college/1916', 'https://www.archives.gov/electoral-college/1920', 'https://www.archives.gov/electoral-college/1924', 'https://www.archives.gov/electoral-college/1928', 'https://www.archives.gov/electoral-college/1932', 'https://www.archives.gov/electoral-college/1936', 'https://www.archives.gov/electoral-college/1940', 'https://www.archives.gov/electoral-college/1944', 'https://www.archives.gov/electoral-college/1948', 'https://www.archives.gov/electoral-college/1952', 'https://www.archives.gov/electoral-college/1956', 'https://www.archives.gov/electoral-college/1960', 'https://www.archives.gov/electoral-college/1964', 'https://www.archives.gov/elec

### 2.4 Scrape the Two Tables Containing each Election Year's Data

In [10]:
# Data tables dict has keys for each year data is available
# Each value is a list with html for the two tables containing election data
# Good to have the raw tables for debugging purposes
raw_election_tables = scrape_raw_election_tables(election_links, us_election_years)

In [11]:
print(raw_election_tables.keys())

dict_keys([1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020])


### 2.5 Parse the Data for All Election Years

In [12]:
parsed_election_years = parse_election_years(raw_election_tables, us_state_names)

Working on Election Year = 1892 (0)
Working on Election Year = 1896 (1)
Working on Election Year = 1900 (2)
Working on Election Year = 1904 (3)
Working on Election Year = 1908 (4)
Working on Election Year = 1912 (5)
Working on Election Year = 1916 (6)
Working on Election Year = 1920 (7)
Working on Election Year = 1924 (8)
Working on Election Year = 1928 (9)
Working on Election Year = 1932 (10)
Working on Election Year = 1936 (11)
Working on Election Year = 1940 (12)
Working on Election Year = 1944 (13)
Working on Election Year = 1948 (14)
Working on Election Year = 1952 (15)
Working on Election Year = 1956 (16)
Working on Election Year = 1960 (17)
Working on Election Year = 1964 (18)
Working on Election Year = 1968 (19)
Working on Election Year = 1972 (20)
Working on Election Year = 1976 (21)
Working on Election Year = 1980 (22)
Working on Election Year = 1984 (23)
Working on Election Year = 1988 (24)
Working on Election Year = 1992 (25)
Working on Election Year = 1996 (26)
Working on 

### 2.6 Validate Accuracy of Parsed Election Data

In [13]:
# This function provides a compact view of the election
# data parsed for a given election year
# Index value available in results of previous cell
year_index = 0
print_election_year_results(parsed_election_years[year_index])

Election Year: 1892
Table 1 Top 2 Candidates + Party: 
	'Grover Cleveland', 'D'
	'Benjamin Harrison', 'R'
Table 2 Candidates + Home State: 
	'Grover Cleveland', 'New York'
	'Benjamin Harrison', 'Indiana'
	'James B. Weaver', 'Iowa'
Table 2 Votes by State: 
	'Alabama', 11, 11, 0, 0
	'Arkansas', 8, 8, 0, 0
	'California', 9, 8, 1, 0
	'Colorado', 4, 0, 0, 4
	'Connecticut', 6, 6, 0, 0
	'Delaware', 3, 3, 0, 0
	'Florida', 4, 4, 0, 0
	'Georgia', 13, 13, 0, 0
	'Idaho', 3, 0, 0, 3
	'Illinois', 24, 24, 0, 0
	'Indiana', 15, 15, 0, 0
	'Iowa', 13, 0, 13, 0
	'Kansas', 10, 0, 0, 10
	'Kentucky', 13, 13, 0, 0
	'Louisiana', 8, 8, 0, 0
	'Maine', 6, 0, 6, 0
	'Maryland', 8, 8, 0, 0
	'Massachusetts', 15, 0, 15, 0
	'Michigan', 14, 5, 9, 0
	'Minnesota', 9, 0, 9, 0
	'Mississippi', 9, 9, 0, 0
	'Missouri', 17, 17, 0, 0
	'Montana', 3, 0, 3, 0
	'Nebraska', 8, 0, 8, 0
	'Nevada', 3, 0, 0, 3
	'New Hampshire', 4, 0, 4, 0
	'New Jersey', 10, 10, 0, 0
	'New York', 36, 36, 0, 0
	'North Carolina', 11, 11, 0, 0
	'North Dakota

In [14]:
# Verify that the # of States that voted for President each year makes sense
# I've confirmed that these values are consistent with when each state was
# added to the Union from 1892 to present, plus when DC was able to vote (1964)
# see https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union
print("Year Index, Year Value, # of States Including Totals")
for ind, pyr in enumerate(parsed_election_years):
    print(ind, pyr['year'], len(pyr['t2']['votes_by_state']), sep=", ")

Year Index, Year Value, # of States Including Totals
0, 1892, 45
1, 1896, 46
2, 1900, 46
3, 1904, 46
4, 1908, 47
5, 1912, 49
6, 1916, 49
7, 1920, 49
8, 1924, 49
9, 1928, 49
10, 1932, 49
11, 1936, 49
12, 1940, 49
13, 1944, 49
14, 1948, 49
15, 1952, 49
16, 1956, 49
17, 1960, 51
18, 1964, 52
19, 1968, 52
20, 1972, 52
21, 1976, 52
22, 1980, 52
23, 1984, 52
24, 1988, 52
25, 1992, 52
26, 1996, 52
27, 2000, 52
28, 2004, 52
29, 2008, 52
30, 2012, 52
31, 2016, 52
32, 2020, 52


## 3. Write Election Data to Postgres Database

In [None]:
# column_names = ['year', 'state', 'president_candidate_name', 'president_candidate_party', 'president_candidate_state', \
#                 'president_electoral_votes', 'president_electoral_rank', 'president_popular_votes', 'president_popular_rank']
# data_df = pd.DataFrame(columns=column_names)
# data_df.info()