# US Presidential Election Analysis: Electoral College, Popular Vote, or Both?

## Objective
This notebook contains the first step in a larger effort to analyze historical US presidential election data. It focuses on scraping the electoral college voting results for all modern US presidential elections (i.e. from 1892 to the present) from the [National Archives Website](https://www.archives.gov/electoral-college/results), and then writing the results to a Postgres Database for subsequent analysis. The following steps are implemented in this notebook:
1. [X] Initial Setup: Import Modules, Define Functions, and Set Parameter Values
2. [X] Scrape Electoral College Data from the National Archives Website
    1. Define Set containing All Presidential Election Years
    2. Define Set containing All US "States" that Vote in Presidential Elections (includes Washington DC)
    3. Scrape National Archive Summary web page for Links to each Election Year's Data
    4. Scrape each Election Year's web page to download the two tables containing all Election Data
    5. Parse the Data for All Election Years into a useable, compact format
    6. Validate Accuracy of Parsed Election Data
3. [ ] Validate and Transform Parsed Election Data
    1. Spot Check Parsed Data for Individual Election Years
    2. Validate that each Election Year has the Correct # of States
    3. 
4. [ ] Write Election Data Tables to Postgres

### Notes
- The National Archives website only contains Electoral College results for US Presidential Elections from 1892 to present. Another data source will be scraped to get Popular Vote data for each Presidential Election
- Currently I'm only scraping information regarding the Presidential Candidates and their electoral college vote tallies; however, Vice Presidential results are also available, so I can circle back to include that data if the need arises
- FYI: Any candidate who wins a majority or plurality of the popular vote nationwide has a good chance of winning in the Electoral College, but there are no guarantees: for example the results of 1824, 1876, 1888, 2000, and 2016 elections (see Reference 1 below).

### Data Sources
1. US Presidential Election Electoral College voting results: https://www.archives.gov/electoral-college/results
2. US States shapefile: https://www2.census.gov/geo/tiger/TIGER2019/STATE/tl_2019_us_state.zip

### References
1. Electoral College History: https://www.archives.gov/electoral-college/history
2. https://towardsdatascience.com/scraping-table-data-from-websites-using-a-single-line-in-python-ba898d54e2bc
3. https://searchdatamanagement.techtarget.com/definition/star-schema

## 1. Setup

### 1.1 Import Modules

In [1]:
# import modules
from bs4 import BeautifulSoup
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import requests

### 1.2 Define Functions

In [369]:
def get_html_tables(url, div_id="main-col", find_all=False):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    div = soup.find("div", id=div_id)
    if find_all:
        return div.find_all("table")
    else:
        return div.find("table")
    
def scrape_election_links(archive_url_domain, archive_url_base):
    link_table = get_html_tables(archive_url_domain+archive_url_base)
    return [archive_url_domain+a['href'] for a in link_table.find_all("a")]    

def scrape_raw_election_tables(election_links, us_election_years):
    raw_election_tables = {}
    for link in election_links:
        link_year = int(link.split('/')[-1])
        if link_year in us_election_years:
            raw_election_tables[link_year] = get_html_tables(link, find_all=True)
        else:
            print(f"Error: The link year, {link_year}, parsed from the following link does not match a US election year: \n{link}")
    return raw_election_tables
    
def parse_election_years(data_tables, state_names):
    parsed_years = []
    for ind, year in enumerate(data_tables.keys()):
        print(f"Working on Election Year = {year} ({ind})")
        parsed_dict = parse_election_year_tables(data_tables[year], state_names)
        parsed_dict['year'] = year
        parsed_years.append(parsed_dict)
    return parsed_years

def parse_election_year_tables(year_tables, state_names):
    parsed_tables = {}
    parsed_tables['t1'] = parse_table1(year_tables[0].find_all('tr'))
    parsed_tables['t2'] = parse_table2(year_tables[1].find_all('tr'), state_names)
    return parsed_tables

def parse_table1(t1_rows):
    cp_row_inds = [0, 1]
    cp_row_headers = ["President", "Main Opponent"]
    candidate_party = []
    for ri, rh in zip(cp_row_inds, cp_row_headers):
        candidate_party.append(parse_t1_candidate_party(t1_rows, ri, rh))
    return candidate_party
    
# Parse Table 1 to store the Presidential Candidates Name and Party
def parse_t1_candidate_party(t1_rows, row_ind, row_header):
    if t1_rows[row_ind].find('th').get_text() == row_header:
        row_data = t1_rows[row_ind].find('td').get_text()
        candidate, party = row_data.split(' [')
        return { \
            'president_candidate_name': candidate.strip(" *").replace(",", ""), \
            'president_candidate_party': party.strip(" *]") \
        }
        #return (candidate.strip(" *"), party.strip(" *]"))
    else:
        print(f"Error: Row{row_ind} does not contain data for {row_header}")

def parse_table2(t2_rows, state_names):
    t2_data = {}
    num_candidates = parse_t2_num_candidates(t2_rows[0])
    t2_data['candidate_state'] = parse_t2_candidate_state(t2_rows[1], num_candidates)
    t2_data['votes_by_state'] = parse_t2_votes_by_state(t2_rows[2:], num_candidates, state_names)
    return t2_data

def parse_t2_num_candidates(header_row):
    return int(header_row.find('th', text="For President").get('colspan'))
    
def parse_t2_candidate_state(cs_row, num_candidates):
    cs_cols = cs_row.find_all('td')
    candidate_state = []
    for ci, cs in enumerate(cs_cols[:num_candidates]):
        if cs.find('br'):
            text = " ".join(cs.stripped_strings)
        else:
            text = cs.get_text()
        if text == "Other":
            candidate, state = text, None
        else:
            candidate, state = text.split(" of ")
            candidate = candidate.strip(", *").replace(",", "")
            state = state.strip(" *").replace(",", "")
        candidate_state.append({ \
            'president_candidate_name': candidate, \
            'col_ind': ci+1, \
            'president_candidate_state': state \
        })
        #candidate_state.append((candidate.strip(", *"), state.strip(", *")))
    return candidate_state

def parse_t2_votes_by_state(states_rows, num_candidates, state_names):
    votes_by_state = []
    for sr in states_rows:
        state_cols = sr.find_all('td')
        col_0_text = state_cols[0].get_text().strip(" *")
        if col_0_text in state_names:
            state = col_0_text
            start_ind, end_ind = 1, num_candidates+2
        elif col_0_text in {"Total", "Totals"}:
            state = "Totals"
            start_ind, end_ind = 1, num_candidates+2
        elif sr.find('th', text='Total'):
            state = "Totals"
            start_ind, end_ind = 0, num_candidates+1
        else:
            state = False
        # Only parse vote data and store it if a valid state value is found
        # This helps validate the state name was parsed correctly, and
        # skips the Notes row at the end of some of the tables
        if state:
            state_votes = {'state': state}
            for si, sv in enumerate(state_cols[start_ind:end_ind]):
                votes = sv.get_text()
                if si == 0:
                    si = 'total_electoral_votes'
                state_votes[si] = int(votes) if votes != '-' else 0
            votes_by_state.append(state_votes)
            # state_votes = [state]   
            # for si, sv in enumerate(state_cols[start_ind:end_ind]):
            #     votes = sv.get_text()
            #     state_votes.append(int(votes) if votes != '-' else 0)
            # votes_by_state.append(tuple(state_votes))
    return votes_by_state

def print_election_year_results(parsed_year):
    print(f"Election Year: {parsed_year['year']}")
    print(f"Table 1 Top 2 Candidates + Party: \n{pprint_list_of_dicts(parsed_year['t1'])}")
    print(f"Table 2 Candidates + Home State: \n{pprint_list_of_dicts(parsed_year['t2']['candidate_state'])}")
    print(f"Table 2 Votes by State: \n{pprint_list_of_dicts(parsed_year['t2']['votes_by_state'])}")
    
def pprint_list_of_dicts(list_of_dicts):
    list_of_strings = [str(d).strip('{}') for d in list_of_dicts]
    return '\t'+'\n\t'.join(list_of_strings)

def get_name_middle_last(middle_last):
    try:
        split = middle_last.split()
    except:
        split = []
    if len(split) == 1:
        return (None, split[0])
    elif len(split) > 1:
    # This assumes any space in the name is part of the last name
        return (split[0], " ".join(split[1:]))
    else:
        return (None, None)

def make_map_usa(df, col2plot, figsize=(15,8), title=None, fontsize=18, cmap='Blues', edgecolor='k'):
    # Define fixed longitude limit, xlim, and latitude limit, ylim, for USA
    xlim = (-172, -58)
    ylim = (16, 74)
    fig, ax = plt.subplots(1, figsize=figsize)
    df.plot(column=col2plot, ax=ax, cmap=cmap, edgecolor=edgecolor)
    ax.axis('off')
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(vmin=df[col2plot].min(), vmax=df[col2plot].max()))
    sm._A = []
    cb = fig.colorbar(sm)
    cb.set_label(col2plot, fontsize=fontsize)
    cb.ax.tick_params(labelsize=fontsize)
    ax.set_xlim(xlim)
    ax.set_ylim(ylim)
    if title:
        ax.set_title(title, fontsize=fontsize+4)
    plt.tight_layout()

### 1.3 Set Parameters

Define the base URL for the National Archives, and the resource location for the summary page containing links to Presidential election data for each year. The shape file containing US State data can be downloaded from [this link](https://www2.census.gov/geo/tiger/TIGER2019/STATE/).

In [3]:
latest_election_year = 2020
archive_url_domain = "https://www.archives.gov"
archive_url_base = "/electoral-college/results"
usa_state_shp = "/home/fdpearce/Documents/Projects/data/Maps/State_Shapes/tl_2019_us_state/tl_2019_us_state.shp"

## 2. Scrape Electoral College Data from the National Archives Website

### 2.1 Define Set Containing All Presidential Election Years

Create a set, us_election_years, with every year that a US Presidential Election occurred. This set will be used to scrape all available election data. See the `archive_url` website for the complete list of election years. A set is used so membership checks execute efficiently (O(1)).

In [127]:
us_election_years = [1789]+list(range(1792, latest_election_year+4, 4))
print(*us_election_years)

1789 1792 1796 1800 1804 1808 1812 1816 1820 1824 1828 1832 1836 1840 1844 1848 1852 1856 1860 1864 1868 1872 1876 1880 1884 1888 1892 1896 1900 1904 1908 1912 1916 1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020


In [5]:
us_election_years = set(us_election_years)

### 2.2 Define Set Containing All US States

Load all the data from the USA States shape file. For now, only the state names are extracted, but later the geometry column can be used to generate maps, extract state features, etc. The names of US Territories that don't participate in Presidential Elections are dropped from the variable containing the set of states.

In [6]:
usa = gpd.read_file(usa_state_shp)
us_state_names = usa['NAME'].values
print(sorted(us_state_names))

['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Commonwealth of the Northern Mariana Islands', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'United States Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']


In [7]:
us_state_names = set(us_state_names)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 56


In [8]:
territories = ['American Samoa', 'Commonwealth of the Northern Mariana Islands', 'Guam', 'Puerto Rico', 'United States Virgin Islands']
us_state_names.difference_update(territories)
print(f"Total # of States = {len(us_state_names)}")

Total # of States = 51


### 2.3 Scrape the Links to each Election Year's Data

Parse html summary page and extract links to each page containing data for a given Presidential Election Year.

In [9]:
election_links = scrape_election_links(archive_url_domain, archive_url_base)
print(election_links)

['https://www.archives.gov/electoral-college/1892', 'https://www.archives.gov/electoral-college/1896', 'https://www.archives.gov/electoral-college/1900', 'https://www.archives.gov/electoral-college/1904', 'https://www.archives.gov/electoral-college/1908', 'https://www.archives.gov/electoral-college/1912', 'https://www.archives.gov/electoral-college/1916', 'https://www.archives.gov/electoral-college/1920', 'https://www.archives.gov/electoral-college/1924', 'https://www.archives.gov/electoral-college/1928', 'https://www.archives.gov/electoral-college/1932', 'https://www.archives.gov/electoral-college/1936', 'https://www.archives.gov/electoral-college/1940', 'https://www.archives.gov/electoral-college/1944', 'https://www.archives.gov/electoral-college/1948', 'https://www.archives.gov/electoral-college/1952', 'https://www.archives.gov/electoral-college/1956', 'https://www.archives.gov/electoral-college/1960', 'https://www.archives.gov/electoral-college/1964', 'https://www.archives.gov/elec

### 2.4 Scrape the Two Tables Containing each Election Year's Data

Data tables dict has keys for each year data is available. Each value is a list with html for the two tables containing election data, which are stored in their own variables primarily for debugging purposes.

In [10]:
raw_election_tables = scrape_raw_election_tables(election_links, us_election_years)

In [126]:
print("Data is currently available from the National Archives website for the following election years:")
print(*raw_election_tables.keys(), sep=", ")

Data is available from the National Archives website for the following election years:
1892, 1896, 1900, 1904, 1908, 1912, 1916, 1920, 1924, 1928, 1932, 1936, 1940, 1944, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016, 2020


### 2.5 Parse the Data for All Election Years

In [370]:
parsed_election_years = parse_election_years(raw_election_tables, us_state_names)

Working on Election Year = 1892 (0)
Working on Election Year = 1896 (1)
Working on Election Year = 1900 (2)
Working on Election Year = 1904 (3)
Working on Election Year = 1908 (4)
Working on Election Year = 1912 (5)
Working on Election Year = 1916 (6)
Working on Election Year = 1920 (7)
Working on Election Year = 1924 (8)
Working on Election Year = 1928 (9)
Working on Election Year = 1932 (10)
Working on Election Year = 1936 (11)
Working on Election Year = 1940 (12)
Working on Election Year = 1944 (13)
Working on Election Year = 1948 (14)
Working on Election Year = 1952 (15)
Working on Election Year = 1956 (16)
Working on Election Year = 1960 (17)
Working on Election Year = 1964 (18)
Working on Election Year = 1968 (19)
Working on Election Year = 1972 (20)
Working on Election Year = 1976 (21)
Working on Election Year = 1980 (22)
Working on Election Year = 1984 (23)
Working on Election Year = 1988 (24)
Working on Election Year = 1992 (25)
Working on Election Year = 1996 (26)
Working on 

## 3. Validate and Transform Parsed Election Data
The data transformation done in this section ultimately creates three tables, based on a star schema design (see Reference 3 above):
1. Candidate dimension table built in Section 3.3
2. State dimension table built in Section 3.4
3. Votes by Year fact table built in Section 3.5

Several validation steps are performed along the way.

### 3.1 Spot Check Parsed Data for Individual Election Years
Print a compact view of the data parsed for a given election year. The year index value is available in the results of previous cell. The output also provides a useful reference of the parsed data structure for a given election year that is used for development of later sections.

In [371]:
year_index = 13
print_election_year_results(parsed_election_years[year_index])

Election Year: 1944
Table 1 Top 2 Candidates + Party: 
	'president_candidate_name': 'Franklin D. Roosevelt', 'president_candidate_party': 'D'
	'president_candidate_name': 'Thomas E. Dewey', 'president_candidate_party': 'R'
Table 2 Candidates + Home State: 
	'president_candidate_name': 'Franklin D. Roosevelt', 'col_ind': 1, 'president_candidate_state': 'New York'
	'president_candidate_name': 'Thomas E. Dewey', 'col_ind': 2, 'president_candidate_state': 'New York'
Table 2 Votes by State: 
	'state': 'Alabama', 'total_electoral_votes': 11, 1: 11, 2: 0
	'state': 'Arizona', 'total_electoral_votes': 4, 1: 4, 2: 0
	'state': 'Arkansas', 'total_electoral_votes': 9, 1: 9, 2: 0
	'state': 'California', 'total_electoral_votes': 25, 1: 25, 2: 0
	'state': 'Colorado', 'total_electoral_votes': 6, 1: 0, 2: 6
	'state': 'Connecticut', 'total_electoral_votes': 8, 1: 8, 2: 0
	'state': 'Delaware', 'total_electoral_votes': 3, 1: 3, 2: 0
	'state': 'Florida', 'total_electoral_votes': 8, 1: 8, 2: 0
	'state': 'Geo

### 3.2 Validate that each Election Year has the Correct # of States
Verify that the # of States that voted for President each year makes sense. I've confirmed that these values are consistent with when each state was added to the Union from 1892 to present, plus when Washington DC was allowed to vote (1964). See [this link](https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union) for details on when each state joined the Union as was allowed to cast electoral votes.

In [372]:
print("Year Index, Year Value, # of States Including Totals")
for ind, pyr in enumerate(parsed_election_years):
    print(ind, pyr['year'], len(pyr['t2']['votes_by_state']), sep=", ")

Year Index, Year Value, # of States Including Totals
0, 1892, 45
1, 1896, 46
2, 1900, 46
3, 1904, 46
4, 1908, 47
5, 1912, 49
6, 1916, 49
7, 1920, 49
8, 1924, 49
9, 1928, 49
10, 1932, 49
11, 1936, 49
12, 1940, 49
13, 1944, 49
14, 1948, 49
15, 1952, 49
16, 1956, 49
17, 1960, 51
18, 1964, 52
19, 1968, 52
20, 1972, 52
21, 1976, 52
22, 1980, 52
23, 1984, 52
24, 1988, 52
25, 1992, 52
26, 1996, 52
27, 2000, 52
28, 2004, 52
29, 2008, 52
30, 2012, 52
31, 2016, 52
32, 2020, 52


### 3.3 Transform and Validate Candidate Data
This section creates a Candidate fact table that will be written to Postgres in subsequent steps. I start with the Candidate State data from Table 2 as that gives the complete list of presidential candidates that received electoral votes, except for the troubling 2016 election that has values of other, due to the unprecedented number of presidential candidates that received electoral votes. I'd need to circle back to parse the Notes section on Table 2 to get all candidate names that received at least one electoral vote in 2016, and it may simply be easier to enter that info manually at a later date...

#### **Extract and Validate Table 2 Data**

In [464]:
t2_states_df = pd.json_normalize(parsed_election_years, ['t2', 'candidate_state'], ['year'])

In [465]:
t2_states_df.head()

Unnamed: 0,president_candidate_name,col_ind,president_candidate_state,year
0,Grover Cleveland,1,New York,1892
1,Benjamin Harrison,2,Indiana,1892
2,James B. Weaver,3,Iowa,1892
3,William McKinley,1,Ohio,1896
4,William J. Bryan,2,Nebraska,1896


In [466]:
t2_states_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   president_candidate_name   79 non-null     object
 1   col_ind                    79 non-null     int64 
 2   president_candidate_state  77 non-null     object
 3   year                       79 non-null     object
dtypes: int64(1), object(3)
memory usage: 2.6+ KB


In [467]:
t2_states_df[t2_states_df['president_candidate_state'].isna()]

Unnamed: 0,president_candidate_name,col_ind,president_candidate_state,year
74,Other,2,,2016
76,Other,4,,2016


#### **Create Candidates DataFrame**

In [544]:
candidates_df = t2_states_df[['president_candidate_name', 'president_candidate_state']].drop_duplicates().reset_index(drop=True)

In [545]:
num_can_st = len(t2_states_df[['president_candidate_name', 'president_candidate_state']])
print(f"Original number of Candidate-State combinations = {num_can_st}")
num_unique_can_st = len(candidates_df)
print(f"Unique number of Candidate-State combinations = {num_unique_can_st}")
num_unique_can = len(candidates_df.drop_duplicates(subset='president_candidate_name'))
print(f"Unique number of Candidates = {num_unique_can}")

Original number of Candidate-State combinations = 79
Unique number of Candidate-State combinations = 57
Unique number of Candidates = 56


#### **Update Column Names**

In [546]:
# Remove 'president_candidate_' prefix and change 'name' to 'full_name'
candidates_df.columns = candidates_df.columns.str.replace("president_candidate_", "")

#### **Aggregate Candidates with Multiple State Affiliations**
Combine all state affiliations into a single column, 'state', separated by hyphens. This aggregation also yields a DataFrame at the correct grain, i.e. one row per candidate. Note that this produces the desired result without the need for sorting as I want the first party affiliation to be the primary one.

In [547]:
candidates_df.groupby('name').size()

name
Adlai Stevenson          1
Albert Gore Jr.          1
Alfred E. Smith          1
Alfred M. Landon         1
Alton B. Parker          1
Barack Obama             1
Barry M. Goldwater       1
Benjamin Harrison        1
Calvin Coolidge          1
Charles E. Hughes        1
Donald J. Trump          1
Donald Trump             1
Dwight D. Eisenhower     1
Franklin D. Roosevelt    1
George Bush              1
George C. Wallace        1
George McGovern          1
George W. Bush           1
Gerald R. Ford           1
Grover Cleveland         1
Harry F. Byrd            1
Harry S. Truman          1
Herbert C. Hoover        1
Hillary Clinton          1
Hubert H. Humphrey       1
J. Strom Thurmond        1
James B. Weaver          1
James M. Cox             1
Jimmy Carter             1
John Edwards             1
John F. Kennedy          1
John F. Kerry            1
John Hospers             1
John McCain              1
John W. Davis            1
Joseph R. Biden Jr.      1
Lloyd Bentsen          

In [548]:
# Richard M. Nixon is the only candidate with more than one State association
candidates_df[candidates_df['name']=="Richard M. Nixon"]

Unnamed: 0,name,state
27,Richard M. Nixon,California
31,Richard M. Nixon,New York


In [549]:
candidates_df = candidates_df.groupby('name')['state'].agg(state=lambda x: "-".join(i if i is not None else "" for i in x)).reset_index()

In [550]:
candidates_df

Unnamed: 0,name,state
0,Adlai Stevenson,Illinois
1,Albert Gore Jr.,Tennessee
2,Alfred E. Smith,New York
3,Alfred M. Landon,Kansas
4,Alton B. Parker,New York
5,Barack Obama,Illinois
6,Barry M. Goldwater,Arizona
7,Benjamin Harrison,Indiana
8,Calvin Coolidge,Massachusetts
9,Charles E. Hughes,New York


#### **Parse State Column to Create Primary, Secondary State Columns**

In [551]:
# Take first split on candidate name to get first name (that's the easy one, lol)
candidates_df[['state', 'state_2']] = candidates_df['state'].str.split("-", n=1, expand=True)
candidates_df.loc[candidates_df['state']=="", 'state'] = None
candidates_df

Unnamed: 0,name,state,state_2
0,Adlai Stevenson,Illinois,
1,Albert Gore Jr.,Tennessee,
2,Alfred E. Smith,New York,
3,Alfred M. Landon,Kansas,
4,Alton B. Parker,New York,
5,Barack Obama,Illinois,
6,Barry M. Goldwater,Arizona,
7,Benjamin Harrison,Indiana,
8,Calvin Coolidge,Massachusetts,
9,Charles E. Hughes,New York,


#### **Use Full Name to Create Columns for First, Middle, Last, and Suffix of Name**

In [552]:
# Take first split on candidate name to get first name (that's the easy one, lol)
candidates_df[['name_first', 'name_remainder']] = candidates_df['name'].str.split(n=1, expand=True)

In [553]:
# Remove Jr. suffix, commas, leading/trailing spaces from name_remainder prior to parsing middle/last name
candidates_df['name_remainder'] = candidates_df['name_remainder'].str.replace(r",? Jr\.?$", "", regex=True)

In [554]:
# Use get_name_middle_last function to split name_remainder into middle and last name
candidates_df[['name_middle', 'name_last']] = pd.DataFrame(candidates_df['name_remainder'].map(get_name_middle_last).tolist(), index=candidates_df.index)
candidates_df.drop('name_remainder', axis=1, inplace=True)

In [555]:
# Finally add column containing suffix "Jr." if present in full name
candidates_df['name_suffix'] = candidates_df['name'].map(lambda x: "Jr." if x.endswith("Jr.") else None)

#### **Reorder Columns to Complete Table 2 Data Transformation**

In [557]:
# Reorder columns so names are all together
candidates_df = candidates_df.iloc[:, [0, 3, 4, 5, 6, 1, 2]]
candidates_df

Unnamed: 0,name,name_first,name_middle,name_last,name_suffix,state,state_2
0,Adlai Stevenson,Adlai,,Stevenson,,Illinois,
1,Albert Gore Jr.,Albert,,Gore,Jr.,Tennessee,
2,Alfred E. Smith,Alfred,E.,Smith,,New York,
3,Alfred M. Landon,Alfred,M.,Landon,,Kansas,
4,Alton B. Parker,Alton,B.,Parker,,New York,
5,Barack Obama,Barack,,Obama,,Illinois,
6,Barry M. Goldwater,Barry,M.,Goldwater,,Arizona,
7,Benjamin Harrison,Benjamin,,Harrison,,Indiana,
8,Calvin Coolidge,Calvin,,Coolidge,,Massachusetts,
9,Charles E. Hughes,Charles,E.,Hughes,,New York,


#### **Extract and Validate Table 1 Candidate Data**

In [397]:
t1_df = pd.json_normalize(parsed_election_years, 't1', ['year'])

In [398]:
t1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   president_candidate_name   66 non-null     object
 1   president_candidate_party  66 non-null     object
 2   year                       66 non-null     object
dtypes: object(3)
memory usage: 1.7+ KB


In [399]:
t1_df.head()

Unnamed: 0,president_candidate_name,president_candidate_party,year
0,Grover Cleveland,D,1892
1,Benjamin Harrison,R,1892
2,William McKinley,R,1896
3,William J. Bryan,D-P,1896
4,William McKinley,R,1900


In [400]:
t1_df['president_candidate_party'].value_counts()

R      32
D      31
D-P     2
P       1
Name: president_candidate_party, dtype: int64

In [401]:
num_can_pa = len(t1_df[['president_candidate_name', 'president_candidate_party']])
print(f"Original number of Candidate-Party combinations = {num_can_pa}")
num_unique_can_pa = len(t1_df.drop_duplicates(subset=['president_candidate_name', 'president_candidate_party']))
print(f"Unique number of Candidate-Party combinations = {num_unique_can_pa}")
num_unique_can = len(t1_df.drop_duplicates(subset='president_candidate_name'))
print(f"Unique number of Candidates = {num_unique_can}")

Original number of Candidate-Party combinations = 66
Unique number of Candidate-Party combinations = 47
Unique number of Candidates = 45


In [402]:
# William J. Bryan has two party affiliations: Primary = "D" and Secondary = "P"
t1_df[t1_df['president_candidate_name']=="William J. Bryan"]

Unnamed: 0,president_candidate_name,president_candidate_party,year
3,William J. Bryan,D-P,1896
5,William J. Bryan,D-P,1900
9,William J. Bryan,D,1908


In [403]:
# Only William J. Bryan has a split party designation
t1_df[t1_df['president_candidate_party'].str.contains("-")]

Unnamed: 0,president_candidate_name,president_candidate_party,year
3,William J. Bryan,D-P,1896
5,William J. Bryan,D-P,1900


In [404]:
# Theodore Roosevelt is the only candidate to change parties: Primary = "R" and Secondary = "P"
t1_df[t1_df['president_candidate_name']=='Theodore Roosevelt']

Unnamed: 0,president_candidate_name,president_candidate_party,year
6,Theodore Roosevelt,R,1904
11,Theodore Roosevelt,P,1912


#### **Create Candidates Party DataFrame**
This Candidate table will be joined to the one above to add party affiliation if available.

In [442]:
candidates_party_df = t1_df[['president_candidate_name', 'president_candidate_party']].drop_duplicates().reset_index(drop=True)

#### **Update Column Names**

In [443]:
# Remove 'president_candidate_' prefix and change 'name' to 'full_name'
candidates_party_df.columns = candidates_party_df.columns.str.replace("president_candidate_", "")
candidates_party_df.head()

Unnamed: 0,name,party
0,Grover Cleveland,D
1,Benjamin Harrison,R
2,William McKinley,R
3,William J. Bryan,D-P
4,Theodore Roosevelt,R


#### **Aggregate Candidates with Multiple Party Affiliations**
Combine all party affiliations into a single column, 'party', separated by hyphens. This aggregation also yields a DataFrame at the correct grain, i.e. one row per candidate. Note that this produces the desired result without the need for sorting as I want the first party affiliation to be the primary one. May need to revisit this assumption in the future...

In [445]:
candidates_party_df = candidates_party_df.groupby('name')['party'].agg(party="-".join).reset_index()

#### **Parse Party Column to Create Primary, Secondary Party Columns**

In [447]:
# Take first split on candidate name to get first name (that's the easy one, lol)
candidates_party_df[['party', 'party_2']] = candidates_party_df['party'].str.split("-", n=1, expand=True)
candidates_party_df['party_2'] = candidates_party_df['party_2'].map(lambda x: x[0] if x else None)
candidates_party_df

Unnamed: 0,name,party,party_2
0,Adlai Stevenson,D,
1,Albert Gore Jr.,D,
2,Alfred E. Smith,D,
3,Alfred M. Landon,R,
4,Alton B. Parker,D,
5,Barack Obama,D,
6,Barry M. Goldwater,R,
7,Benjamin Harrison,R,
8,Bob Dole,R,
9,Calvin Coolidge,R,


#### **Validate Grain is Correct: One Candidate per Row**

In [451]:
print(f"Q: Does each row correspond to a unique candidate name? A: {len(candidates_party_df) == len(candidates_party_df.drop_duplicates(subset='name'))}")

Q: Does each row correspond to a unique candidate name? A: True


#### **Validate Party Distributions are Correct**

In [452]:
candidates_party_df['party'].value_counts(dropna=False)

D    23
R    22
Name: party, dtype: int64

In [453]:
candidates_party_df['party_2'].value_counts(dropna=False)

NaN    43
P       2
Name: party_2, dtype: int64

#### **Construct Final Candidate Table**
Final Candidate table is created by left-joining the Candidates Party table to the Candidates State table

### 3.4 Transform and Validate State Data

### 3.5 Transform and Validate Votes By State Data

In [104]:
t2_votes_df = pd.json_normalize(parsed_election_years, ['t2', 'votes_by_state'], ['year'])

In [105]:
t2_votes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1649 entries, 0 to 1648
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   state                  1649 non-null   object 
 1   total_electoral_votes  1649 non-null   int64  
 2   1                      1649 non-null   int64  
 3   2                      1649 non-null   int64  
 4   3                      604 non-null    float64
 5   4                      52 non-null     float64
 6   year                   1649 non-null   object 
dtypes: float64(2), int64(3), object(2)
memory usage: 90.3+ KB


In [106]:
t2_votes_df.head(10)

Unnamed: 0,state,total_electoral_votes,1,2,3,4,year
0,Alabama,11,11,0,0.0,,1892
1,Arkansas,8,8,0,0.0,,1892
2,California,9,8,1,0.0,,1892
3,Colorado,4,0,0,4.0,,1892
4,Connecticut,6,6,0,0.0,,1892
5,Delaware,3,3,0,0.0,,1892
6,Florida,4,4,0,0.0,,1892
7,Georgia,13,13,0,0.0,,1892
8,Idaho,3,0,0,3.0,,1892
9,Illinois,24,24,0,0.0,,1892


In [97]:
t2_votes_df = pd.melt(t2_votes_df, id_vars=['year', 'state', 'total_electoral_votes'], value_vars=[1, 2, 3, 4], \
              var_name='col_ind', value_name='president_electoral_votes')

In [98]:
t2_votes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6596 entries, 0 to 6595
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   year                       6596 non-null   object 
 1   state                      6596 non-null   object 
 2   total_electoral_votes      6596 non-null   int64  
 3   col_ind                    6596 non-null   object 
 4   president_electoral_votes  3954 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 257.8+ KB


In [103]:
t2_votes_df.head(10)

Unnamed: 0,year,state,total_electoral_votes,col_ind,president_electoral_votes
0,1892,Alabama,11,1,11.0
1,1892,Arkansas,8,1,8.0
2,1892,California,9,1,8.0
3,1892,Colorado,4,1,0.0
4,1892,Connecticut,6,1,6.0
5,1892,Delaware,3,1,3.0
6,1892,Florida,4,1,4.0
7,1892,Georgia,13,1,13.0
8,1892,Idaho,3,1,0.0
9,1892,Illinois,24,1,24.0


## 4. Write Election Data to Postgres Database

In [111]:
# column_names = ['year', 'state', 'president_candidate_name', 'president_candidate_party', 'president_candidate_state', \
#                 'president_electoral_votes', 'president_electoral_rank', 'president_popular_votes', 'president_popular_rank']
# data_df = pd.DataFrame(columns=column_names)
# data_df.info()