# U.S. Presidential Election–Data Cleaning and Feature Extraction

* **Author:** Brian P. Josey
* **Date Created:** 2021-01-07
* **Date Modified:** 2021-01-07
* **Language:** Python 3.8.3

In this notebook, I will generate a Python function that pulls data from multiple sources and combines them into a single dataframe and CSV file that I can draw from in later analyses. First, I will combine only the election results at the county level, and I will later incorporate demographic data from the U.S. Census Bureau and the Johns Hopkins COVID-19 tracking project.

For the first round, I want the following data for each election year that I have data (2008, 2012, 2016, 2020):

* County name
* State
* FIPS Code
* Total number of votes cast
* Votes for DNC candidate
* Votes for GOP candidate
* Percent of votes for DNC candidate
* Percent of votes for GOP candidate
* Margin of votes (% GOP - % DNC)

All together this will generate a database with 28 features.

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
from urllib.request import urlopen
import json

# Data visualization and plotting
import matplotlib.pyplot as plt
import plotly.express as px

# Filter warnings
import warnings
warnings.filterwarnings('ignore')

## Loading Election Results and Feature Engineering

My election data are saved across three different spreadsheets: one each containing the 2016 and 2020 results, and a third that combines the 2008 and 2012 results. In order to combine the data into a single dataframe, I will first create and update a dataframe for each election. Then I will join them together to create a single dataframe.

In [2]:
# Election Results 2008
# Load raw data into a dataframe
results_2008 = pd.read_csv("data/2016-results.csv",
                          usecols=['State','ST','Fips','County','Democrats 08 (Votes)','Republicans 08 (Votes)'])
# Rename columns
results_2008 = results_2008.rename(columns={"State":"state","ST":"abrev","Fips":"fips","County":"county",
                                           "Democrats 08 (Votes)":"votes_dem_08",
                                           "Republicans 08 (Votes)":"votes_gop_08"})

# Append zeroes to FIPS codes
results_2008['fips']=results_2008['fips'].map("{:05}".format)

# Create total votes, percentages, and margin
results_2008['total_votes_08'] = results_2008['votes_dem_08']+results_2008['votes_gop_08']
results_2008['per_dem_08'] = results_2008['votes_dem_08']/results_2008['total_votes_08']*100
results_2008['per_gop_08'] = results_2008['votes_gop_08']/results_2008['total_votes_08']*100
results_2008['margin_08'] = results_2008['per_gop_08']-results_2008['per_dem_08']

# Replace NaNs with zeroes
results_2008=results_2008.fillna(0)

# Sanity check
#results_2008.head(10)

In [3]:
# Election Results 2012
# Load raw data into a dataframe
results_2012 = pd.read_csv("data/2016-results.csv",
                          usecols=['Fips','Democrats 12 (Votes)','Republicans 12 (Votes)'])

# Rename columns
results_2012 = results_2012.rename(columns={'Fips':'fips',"Democrats 12 (Votes)":"votes_dem_12",
                                           "Republicans 12 (Votes)":"votes_gop_12"})

# Append zeroes to FIPS codes
results_2012['fips']=results_2012['fips'].map("{:05}".format)

# Create total votes, percentages, and margin
results_2012['total_votes_12'] = results_2012['votes_dem_12']+results_2012['votes_gop_12']
results_2012['per_dem_12'] = results_2012['votes_dem_12']/results_2012['total_votes_12']*100
results_2012['per_gop_12'] = results_2012['votes_gop_12']/results_2012['total_votes_12']*100
results_2012['margin_12'] = results_2012['per_gop_12']-results_2012['per_dem_12']

# Replace NaNs with zeroes
results_2012=results_2012.fillna(0)

# Sanity check
#results_2012.head(10)

In [4]:
# Election Results 2016
# Load raw data into a dataframe
results_2016 = pd.read_csv("US County Level Election (tonmcg)/2016_US_County_Level_Presidential_Results.csv",
                          usecols=['votes_dem','votes_gop','combined_fips'])

# Rename columns
results_2016 = results_2016.rename(columns={'votes_dem':'votes_dem_16','votes_gop':'votes_gop_16',
                                           'combined_fips':'fips'})

# Append zeroes to FIPS codes
results_2016['fips']=results_2016['fips'].map("{:05}".format)

# Create total votes, percentages, and margin
results_2016['total_votes_16'] = results_2016['votes_dem_16']+results_2016['votes_gop_16']
results_2016['per_dem_16'] = results_2016['votes_dem_16']/results_2016['total_votes_16']*100
results_2016['per_gop_16'] = results_2016['votes_gop_16']/results_2016['total_votes_16']*100
results_2016['margin_16'] = results_2016['per_gop_16']-results_2016['per_dem_16']

# Replace NaNs with zeroes
results_2016=results_2016.fillna(0)

# Sanity check
results_2016.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'US County Level Election (tonmcg)/2016_US_County_Level_Presidential_Results.csv'

In [None]:
# Election Results 2020
# Load raw data into a dataframe
results_2020 = pd.read_csv("US County Level Election (tonmcg)/2020_US_County_Level_Presidential_Results.csv",
                          usecols=['votes_dem','votes_gop','county_fips'])

# Rename columns
results_2020 = results_2020.rename(columns={'votes_dem':'votes_dem_20','votes_gop':'votes_gop_20',
                                           'county_fips':'fips'})

# Append zeroes to FIPS codes
results_2020['fips']=results_2020['fips'].map("{:05}".format)

# Create total votes, percentages, and margin
results_2020['total_votes_20'] = results_2020['votes_dem_20']+results_2020['votes_gop_20']
results_2020['per_dem_20'] = results_2020['votes_dem_20']/results_2020['total_votes_20']*100
results_2020['per_gop_20'] = results_2020['votes_gop_20']/results_2020['total_votes_20']*100
results_2020['margin_20'] = results_2020['per_gop_20']-results_2020['per_dem_20']

# Replace NaNs with zeroes
results_2020=results_2020.fillna(0)

# Sanity check
#results_2020.head(10)

In [None]:
# Load map of counties
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

# Plot election results    
fig = px.choropleth(results_2020, geojson = counties, locations = "fips", color = "margin_20",
                   color_continuous_scale="bluered",
                   range_color = (-100, 100),
                   scope = "usa",
                   labels = {"margin_08":"Margin"},
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
merged_08_12 = pd.merge(results_2008, results_2012, how="right", on=["fips"])
merged_16 = pd.merge(merged_08_12, results_2016, how="right", on=["fips"])
merged = pd.merge(merged_16, results_2020, how="right", on=["fips"])

merged


#right_merged = pd.merge(precip_one_station, climate_temp, how="right", on=["STATION", "DATE"])
#right_merged.head()
#right_merged.shape

#outer_merged = pd.merge(precip_one_station, climate_temp, how="outer", on=["STATION", "DATE"])
#outer_merged.head()
#outer_merged.shape

## Closer Look at a State

Analyses from others have led me to believe that the margins of the 2020 election are mostly correlated with the margins of the 2016 election. To test this hypothesis, I want to plot a scatter plot of a state with relatively many counties, which happens to also be my state of residence: Virginia.

In [None]:
virginia_results = merged[merged["state"] == "Virginia"]

virginia_results

In [None]:
#x = texas["margin"]
#y = texas["pop_aa"]
#colors = texas["margin"]

#plt.scatter(x, y, c=colors, alpha=0.3, cmap='coolwarm') #bwr
#plt.colorbar()

#plt.xlabel(iris.feature_names[0])
#plt.ylabel(iris.feature_names[1]);

plt.scatter(virginia_results["margin_20"], virginia_results["margin_16"],
            c=virginia_results["margin_20"], alpha=0.3, cmap='coolwarm')
plt.xlabel("Margin 2020")
plt.ylabel("Margin 2016")
plt.title("Virginia Results")
plt.colorbar()

In [None]:
virginia_results['margin_20'].corr(virginia_results['margin_16'])

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# A little regression for plotting
model = LinearRegression(fit_intercept=True)

X = virginia_results["margin_20"].values.reshape(-1,1)
Y = virginia_results["margin_16"].values.reshape(-1,1)

model.fit(X, Y)
xfit = np.linspace(-80, 80, 1000)
yfit = model.predict(xfit[:,np.newaxis])
Y_pred = model.predict(X)

plt.scatter(X, Y, c=virginia_results["margin_20"], alpha=0.5, cmap='coolwarm')
plt.plot(X, Y_pred, color='black', alpha=0.3)
plt.xlabel("Margin 2020")
plt.ylabel("Margin 2016")
plt.title("Virginia Results")
plt.colorbar()

plt.show()