## 2020 US Presidential Elections

## 1. Introduction

The US Presidential election has always been of great worldwide interest, not only because US is considered to be a superpower but also because the US has noteworthy significance bearing on international affairs. People around the globe tend to view America through the figure of its President. The 2020 election particularly has been under greater speculation, as it not only matters for the United States but will also shape the international landscape. 

This election is between the **<span style='color:Red'> incumbent President Donald Trump </span>** and the **<span style='color:Blue'>former Vice President Joe Biden </span>** as the major candidates. 

Our project aims to forecast the 2020 election results using survey data and by simulating 1000 paths for the election.


## 2. Motivation

This work aims to collect polling data from various sources and simulate the election results to find the likely winner for the 2020 US election. The motivation for this project is to predict -

- The final electoral score
- Democratic candidate Biden's chance of winning the election
- Battleground states

## 3. Methodology
### 3.1 <u>Data Collection, Cleaning and Preparation of DataFrame</u>
#### 3.1.1 Preparing the Dataframe
The first step was to create a dataframe that included the following information about the 2020 election 

* <b> State </b>
* <b> Poll date: </b>  Date the poll was conducted.
* <b> Sample Size: </b> Polls contain the sample size. This number was before either LV (likely voters) and RV(registered) voters. For the purposes of this project, this distinction has been ignored.
* <b> Pollster Name:</b> This included the name of the polling organization. The organization that contributed the most intellectual property to the methodology and execution of the poll.
* <b> Pollster Grade: </b> A letter grade from A+ to F that reflects a pollster's Predictive Plus-Minus score. An “A/B” provisional rating means that the pollster has shown strong initial results, a “B/C” rating means it has average initial results, and a “C/D” rating means below-average initial results.
* <b> Pollster Bias: </b> A pollster's historical average statistical bias toward Democratic or Republican candidates, reverted to a mean of zero based on the number of polls in the database.
* <b> Reported Proportion for Biden: </b> Reported proportion for Biden as per the polling organization in each state
* <b> Reported Proportion for Trump: </b> Reported proportion for Trump as per the polling organization in each state

The data for this project was collected from the following sources from January 2020 to October 2020: 

- <a href='https://www.270towin.com/2020-polls-biden-trump/' target='_blank'>270towin.com</a>
- <a href='https://projects.fivethirtyeight.com/pollster-ratings/' target='_blank'>fivethirtyeight.com</a>

The 270towin website displays the Polling Average for each state. The FiveThirtyEight website provides information about the pollster ratings. 

The relevant information from the above sites were used for further analysis of the project.

#### <u>Points to note:</u>
a. When the proportions of vote for Biden and Trump do not equal 100%, the remaining was split equally between Biden and Trump such that the sum was 100%. Columns 'Reported Proportion for Biden' and 'Reported Proportion for Trump' represent the adjusted values. <br>
b. Two new columns were created:
* Estimated proportion of the vote $\hat{p}$ Biden wins
* Standard Deviation $\sigma$ of $\hat{p}$

d. Due to mismatch in the pollster names gathered from 2 sources - '270towin' and 'fivethirtyeight', the Bias and Grade details for some pollsters were missing after merging the data. This was handled by assigning a Grade of '0' and a Bias of '0' for such polls. <br>

A snapshot of the merged data is shown below.

In [6]:
import pandas as pd
import glob
import numpy as np
import geopandas as gpd
# from pollster_name_changes import pollster_changes
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib.lines import Line2D
from scipy import stats
import matplotlib.patches as patches
import matplotlib.collections as coll
import matplotlib.image as mpimg

pollster_changes = {'CNBC/Change Research' : 'Change Research',
                    'Public Policy' : 'Public Policy Polling',
                    'Axios / SurveyMonkey' : 'SurveyMonkey',
                    'NY Times / Siena College' : 'Siena College/The New York Times Upshot',
                    'Quinnipiac' : 'Quinnipiac University',
                    'YouGov/CBS News' : 'YouGov',
                    'Ipsos/Reuters' : 'Ipsos',
                    'Fox News' : 'Fox News/Beacon Research/Shaw & Co. Research',
                    'Baldwin Wallace Univ.' : 'Baldwin Wallace University',
                    'NBC News/Marist' : 'Marist College',
                    'Mason-Dixon' : 'Mason-Dixon Polling & Strategy',
                    'CNN/SSRS' : 'CNN/Opinion Research Corp.',
                    'Marquette Law' : 'Marquette University Law School',
                    'East Carolina Univ.' : 'East Carolina University',
                    'Benenson / GS Strategy' : 'Benenson Strategy Group',
                    'Florida Atlantic Univ.' : 'Florida Atlantic University',
                    'Rasmussen Reports' : 'Rasmussen Reports/Pulse Opinion Research',
                    'UT Tyler' : 'University of Texas at Tyler',
                    'VCU' : 'Virginia Commonwealth University',
                    'Landmark Comm.' : 'Landmark Communications',
                    'UMass Lowell' : 'University of Massachusetts Lowell',
                    'Remington Research' : 'Remington Research Group',
                    'Susquehanna' : 'Susquehanna Polling & Research Inc.',
                    'WPA Intelligence' : 'WPA Intelligence (WPAi)',
                    '1892' : '1892 Polling',
                    'ABC News / Wash. Post' : 'ABC News/The Washington Post',
                    'Christopher Newport Univ.' : 'Christopher Newport University',
                    'East Tennessee State' : 'East Tennessee State University',
                    'Emer' : 'Emerson College',
                    'Fabrizio Lee' : 'Fabrizio, Lee & Associates',
                    'Fairleigh Dickinson' : 'Fairleigh Dickinson University (PublicMind)',
                    'Franklin & Marshall' : 'Franklin & Marshall College',
                    'GQR Research' : 'GQR Research (GQRR)',
                    'Garin-Hart-Yang' : 'Garin-Hart-Yang Research Group',
                    'Gonzales Research' : 'Gonzales Research & Marketing Strategies Inc.',
                    'HighGround' : 'HighGround Inc.',
                    'InsiderAdvantage' : 'Opinion Savvy/InsiderAdvantage',
                    'MRG' : 'MRG Research',
                    'MassINC' : 'MassINC Polling Group',
                    'Meeting Street Insights' : 'Meeting Street Research',
                    'Mitchell Research' : 'Mitchell Research & Communications',
                    'Montana State U.' : 'Montana State University Billings',
                    'Morning Call / Muhlenberg' : 'Muhlenberg College',
                    'PPIC' : 'Public Policy Institute of California',
                    'Research America' : 'Research America Inc.',
                    'Rutgers-Eagleton' : 'Rutgers University',
                    'SLU/YouGov' : 'Saint Leo University',
                    'Selzer & Company' : 'Selzer & Co.',
                    'Sooner Poll' : 'SoonerPoll.com',
                    'Sooner Survey' : 'SoonerPoll.com',
                    'St. Leo University' : 'Saint Leo University',
                    'THPF/Rice Univ.' : 'Rice University',
                    'TargetSmart' : 'TargetSmart/William & Mary',
                    'UC Berkeley' : 'University of California, Berkeley',
                    'UMass Amherst/WCVB' : 'University of Massachusetts Amherst',
                    'USC' : 'USC Dornsife/Los Angeles Times',
                    'Univ. of Colorado' : 'University of Colorado',
                    'Univ. of New Hampshire' : 'University of New Hampshire',
                    'Univ. of Georgia' : 'University of Georgia',
                    'Univ. of North Florida' : 'University of North Florida',
                    'Univ. of Texas / Texas Tribune' : 'University of Texas at Tyler',
                    'Univ. of Wisconsin-Madison' : 'University of Wisconsin (Badger Poll)',
                    'Univision/ASU' : 'Arizona State University',
                    'Univision/CMAS UH' : 'Univision/University of Houston/Latino Decisions',
                    'Yahoo/YouGov': 'YouGov',
                    'YouGov/CBS News': 'YouGov'
                    }

# Change this to the path where you place the entire zip
path = "D:/Veena/Bentley/Fall20/MA 705/Project/Final/Final/Dataset/"

# Pollster variables
pollster_csv = pd.read_csv(path + 'Input/pollster-ratings.csv', header = 0)
pollster_df = pollster_csv[['Pollster', '538 Grade', 'Bias']]

# Electoral votes
elector = pd.DataFrame()
elector = pd.read_csv(path + "Input/Electoral.txt", sep='\t',engine='python')

# Master dataframe
df = pd.DataFrame()

for filename in glob.iglob(path + 'Input/StatePolls/*.txt', recursive = True): 
   
    # Getting name of State
    name=filename.split('/')[-1]    
    name=name.split('\\')[-1]
    state = name[:-4]
    df1 = pd.read_csv(filename, sep='\t',engine='python')
    # Adding column for state
    df1['State']=state
    # Removing %
    df1.Other = df1.Other.str.replace('[%]', '')
    df1.Trump = df1.Trump.str.replace('[%]', '')
    df1.Biden = df1.Biden.str.replace('[%]', '')
    # Converting to type float
    df1["Other"] = df1["Other"].astype(float)
    df1["Biden"] = df1["Biden"].astype(float)
    df1["Trump"] = df1["Trump"].astype(float)
    # Distributing the Other equally between Biden and Trump
    df1["Biden"] = df1["Biden"]+(df1["Other"]/2)
    df1["Trump"] = df1["Trump"]+(df1["Other"]/2)
    # Getting sample size and converting to float
    df1.Sample = df1.Sample.str.split()
    df1.Sample = df1.Sample.str[0]
    df1.Sample = df1.Sample.str.replace('[,]', '')
    df1["Sample"] = df1["Sample"].astype(float)
    # Estimated p hat Biden wins
    df1['p'] = (df1["Biden"]/100)
    # Estimated Std Deviation
    df1['s'] = np.sqrt(((df1['p']*(1-df1['p']))/(df1['Sample'])))
    df = df.append(df1)
    
    # Filter out 2020 rows
    year = df.Date.str.split('/')
    year = year.str[2].astype(int)
    is_2020 = year == 2020
    df = df[is_2020]
    
    grade = []
    bias = []
    
    # Merge with Pollster data
    for index, row in df.iterrows():
     # access data using column names
     # print(index, row['delay'], row['distance'], row['origin'])
        # lookup poll_source in key_value dictionary
        poll_name = pollster_changes.get(row['Source'])
        
        if poll_name:            
            # locate row in pollster data
            poll_info = pollster_df.loc[pollster_df.Pollster == poll_name, :]
            # print(poll_info['538 Grade'])
            # add new columns Grade and Bias 
            grade.append(poll_info['538 Grade'].iloc[0])
            bias.append(poll_info['Bias'].iloc[0])
            # print(row)
        else:
            grade.append("0")
            bias.append("0")
    
    df['Grade'] = grade
    df['Bias'] = bias
    
    
df = df[['State','Source','Date','Sample','Biden','Trump','p','s', 'Grade', 'Bias']]
df.columns = ['State','Poll Source','Poll Date','Sample Size','Reported Proportion for Biden','Reported Proportion for Trump', 'Estimated p hat Biden wins','Estimated std dev', 'Grade', 'Bias']

df.to_csv (path + 'Output/Merged_dataframe.csv', index = False, header=True)
df.head()

Unnamed: 0,State,Poll Source,Poll Date,Sample Size,Reported Proportion for Biden,Reported Proportion for Trump,Estimated p hat Biden wins,Estimated std dev,Grade,Bias
0,Alabama,Auburn U. Montgomery,10/05/2020,1072.0,40.0,60.0,0.4,0.014963,0,0
1,Alabama,Axios / SurveyMonkey,10/01/2020,1354.0,40.0,60.0,0.4,0.013314,D-,D +5.6
2,Alabama,Morning Consult,8/04/2020,609.0,39.0,61.0,0.39,0.019765,0,0
3,Alabama,AUM Poll,7/10/2020,567.0,43.0,57.0,0.43,0.020791,0,0
4,Alabama,Mason-Dixon,2/11/2020,625.0,40.0,60.0,0.4,0.019596,B+,R +0.7


#### 3.1.2 <u>Handling Data from Multiple Poll Sources and Handling Bias</u>
a. With data from multiple poll sources, and Grade information missing for most of them, we decided to take the Average of Reported proportion for Biden and Average of Standard Deviation of all the polls for each state.

b. Next, the Average reported proportion was then corrected by accounting for the Bias towards either of the candidates.

#### 3.1.3 <u>Merge the Electoral votes for each State into our Dataframe</u>
We got the 2020 Electoral Votes for each state from <a href='https://state.1keydata.com/state-electoral-votes.php' target='_blank'>this</a> link.

The electoral votes for each state was merged into our dataframe which has the Average of Reported Proportion for Biden and Average Standard Deviation for all States.

Here is a snapshot of the merged dataframe.

In [7]:
# Extracting columns without blank values
data = pd.read_csv(path + "Output/Merged_Dataframe.csv")
new_df = data.dropna(axis = 'index', how = 'any')

#Spliting the Bias column using the '+' separator
bias_eliminate = new_df.Bias.str.split('+')

# Segregating data as democrat or republican based on 'D' or 'R' which is included in the Bias column
bias_d = []
for bias in bias_eliminate.array:
    if(bias[0].strip() == 'D'):
        democrat_bias = bias[1]
    else:
        democrat_bias = 0
    bias_d.append(democrat_bias)
new_df['Democrat Bias'] = bias_d

bias_r = []
for bias in bias_eliminate.array:
    if(bias[0].strip() == 'R'):
        republican_bias = bias[1]
    else:
        republican_bias = 0
    bias_r.append(republican_bias)
new_df['Republican Bias'] = bias_r
        
# Converting Democrat Bias to float
new_df['Democrat Bias'] = new_df['Democrat Bias'].astype(float)

# Calculating the mean proportion for Biden grouped by State
avg = new_df.groupby('State')['Reported Proportion for Biden'].mean()
avg_dem_bias = new_df.groupby('State')['Democrat Bias'].mean()

avg_biden = avg - avg_dem_bias

avg_sd = new_df.groupby('State')['Estimated std dev'].mean()

df2 = pd.DataFrame(data = avg_sd)
df2 = df2.reset_index()
df2.columns=['State','Avg Std Dev']

df2.State = df2.State.str.lower()
df2.State = df2.State.str.title()
df3 = pd.DataFrame(data = avg_biden)
df3 = df3.reset_index()
df3.columns=['State','Avg_Proportion']

df3.State = df3.State.str.lower()
df3.State = df3.State.str.title()
total = pd.merge(df2, df3, on= 'State')

total.columns=['State','Avg Std Dev','Avg_Proportion']
total=total.set_index('State')

#  Sorting by state alphabetically 
total = total.sort_values(by=['State'])
total = total.reset_index()

# total.to_csv(path + "Output/Final_frame.csv", index = False)

# Objective here is to compute final electoral score for Biden and Trump
# Get the electoral votes and Avg polls for Biden for each state in one dataframe and if avg >= 50, electoral votes go to Democrats

# Get number of rows from elector dataframe
rows=elector.shape[0]
# Convert from 4 columns to 2 columns
elector=elector.values.reshape(rows*2,2)
# Converting to dataframe
elector = pd.DataFrame(data = elector)
# Drop rows with NaN
elector = elector.dropna(axis=0, how='any')
# Naming colmuns
elector.columns = ['State','Electoral Votes']

# Removing spaces between names of states for ex: New Hampshire to Newhampshire
elector.State = elector.State.str.replace(' ', '')
elector.State = elector.State.str.lower()
elector.State = elector.State.str.title()
elector= elector.sort_values(by=['State'])

total.State = total.State.str.replace(' ', '')
total.State = total.State.str.lower()
total.State = total.State.str.title()
total = total.sort_values(by=['State'])

# Merging elector df with total
elec_avg_df = pd.merge(total, elector, on= 'State')
elec_avg_df.to_csv(path + 'Output/Final_dataframe.csv', index = False, header=True)

elec_avg_df.head()

Unnamed: 0,State,Avg Std Dev,Avg_Proportion,Electoral Votes
0,Alabama,0.018382,39.316667,9
1,Alaska,0.019314,46.6,3
2,Arizona,0.019471,51.38871,11
3,Arkansas,0.015753,38.766667,6
4,California,0.013226,64.386667,55


### 3.2 <u>Predicting the Winner for each State</u>

* We assumed winner-take-all for all states.


* For each state, we decided the electoral votes would go to Biden if the Average Reported Proportion for Biden is greater than 50.5%.


* Since the poll data for Washington DC was unavailable, we assumed DC to be a safe democratic state based on previous election data.


* Using all the above information, the Final Electoral Score for Biden and Trump were calculated. _Refer to 4.1 in Results section below_


* Depending on the range of average reported proportion for Biden, states were classified into Safe Republic, Lean Republican, Battleground, Lean Democrat, and Safe Democrat. _Refer to 4.2 in Results section below_


### 3.3 <u>Simulation</u> 
This project simulates the election 1000 times for each of the 50 states. The simulation is based on the assumption that the reported proportion for each candidate follows a Normal distribution, with *mean* as the average reported proportion and *standard deviation* as the average standard deviation in all polls from January 2020 to October 2020 (Section 3.1.3).

**Key Steps in the simulation**

* For a trial in which the simulated proportion was greater than 50.5%, Biden was considered the winner (See table below).
* Electoral votes for states in which Biden won were summed up to get the total electoral votes for Biden in each trial. In the event that this sum exceeded 270, Biden was declared the winner of the trial.
* The probability that Biden wins the election was calculated as the proportion of Biden-wins of all 1000 runs.

In [8]:
"""
This function simulates 1000 runs of the election.
The results are 3 dataframes with index-State and column-state
"""
def simulate_election(proportion, distribution, n):
    proportions = pd.DataFrame(index = proportion.index)
    winner = pd.DataFrame(index = proportion.index)
    votes = pd.DataFrame(index = proportion.index)
    
    for k in range(n):
        sim_prop = distribution.rvs()
        # 1 indicates Biden is the winner
        # 0 indicates Trump is the winner
        sim_winner = np.where(sim_prop >= 0.505, 1, 0)
        # votes is the sum of electoral college votes that Biden wins
        sim_votes = np.where(sim_winner == 1, proportion.Votes, 0)
        
        trial_str = "Trial" + str(k+1)
        proportions[trial_str] = sim_prop
        winner[trial_str] = sim_winner
        votes[trial_str] = sim_votes
        
    return proportions, winner, votes   

proportion = pd.read_csv(path + "Output/Final_dataframe.csv")
proportion.columns = ['State','StdDev','Proportion', 'Votes']
proportion = proportion.set_index('State')

# divide proprtion by 100 to convert to percentage
distribution = stats.norm(proportion.Proportion/100, proportion.StdDev)

# 1000 simulations
num_sims = 1000

# get proportions, winner and votes from function
proportions, winner, votes = simulate_election(proportion, distribution, num_sims)
total_votes = votes.sum()

# count across the row (axis=1) for all 1s - 1 indicates Biden
# total_electoral_votes = num_sims
state_proportion_sum_Biden = winner.sum(axis = 1)
state_proportion_sum_Trump = num_sims - state_proportion_sum_Biden

df = pd.DataFrame(state_proportion_sum_Biden)
df['Proportion_Trump'] = state_proportion_sum_Trump
df = df.reset_index()
df.columns = ['State', 'Proportion_Biden', 'Proportion_Trump']
df = df.set_index('State')

# States in which Biden wins between 300 & 700, out of 1000 runs, are Battleground states
battleground_states = df.loc[(df['Proportion_Biden'] >= num_sims/10 *3) & (df['Proportion_Biden'] <= num_sims/10 * 7)]

proportions.head()

Unnamed: 0_level_0,Trial1,Trial2,Trial3,Trial4,Trial5,Trial6,Trial7,Trial8,Trial9,Trial10,...,Trial991,Trial992,Trial993,Trial994,Trial995,Trial996,Trial997,Trial998,Trial999,Trial1000
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama,0.44887,0.394103,0.378148,0.408746,0.416078,0.397757,0.399514,0.397676,0.406673,0.386808,...,0.445586,0.418075,0.35928,0.396892,0.397432,0.399132,0.388761,0.389021,0.371577,0.396951
Alaska,0.473512,0.484316,0.468258,0.448705,0.477588,0.438578,0.453591,0.48541,0.458287,0.479514,...,0.46318,0.467061,0.453931,0.445685,0.47158,0.471335,0.456622,0.464305,0.485847,0.480236
Arizona,0.525178,0.522557,0.51121,0.507087,0.520034,0.491799,0.536095,0.53853,0.490075,0.514036,...,0.494326,0.51178,0.530752,0.507321,0.542621,0.515825,0.518812,0.519995,0.520891,0.508788
Arkansas,0.351482,0.383844,0.375533,0.397241,0.400598,0.400377,0.381492,0.361296,0.385483,0.375744,...,0.433404,0.381605,0.37891,0.410995,0.375911,0.408873,0.414496,0.369763,0.404249,0.406411
California,0.635785,0.651694,0.638381,0.630663,0.630013,0.659414,0.63272,0.645059,0.647932,0.651155,...,0.649425,0.638502,0.652907,0.649726,0.644727,0.644935,0.665973,0.636811,0.627856,0.625843


## 4. Results
### 4.1 <u>Final Electoral Score</u>
Based on the methodology described in Section 3.2, the final electoral college votes for <br>
* Biden - 319

* Trump - 219

![alt text](score.png "Final electoral score")

### 4.2 <u>US Map showing likely winner for each state</u>

The following US map depicts the likely winner for each state and the battleground states. The results of the map is not based on simulated results.

![alt text](USMap.png "US Map")

### 4.3 Probability Biden wins

The trials in which Biden won more than 270 electoral votes have Biden as the winner. The visualization below has 1000 cells each representing a simulation of the election, with red cells indicating the winner as Trump blue cells indicating the winner as Biden.


In [10]:
print("In this simulation, Biden wins " + str(np.sum(total_votes > 270)) + " out of " + str(num_sims) + " runs.")

In this simulation, Biden wins 929 out of 1000 runs.


![alt text](simulation.png "Simulations")

In the figure below, each bar at 'x' on the x-axis shows the number of trials in which Biden won 'x' electoral votes.

![alt text](bar.png "Bar chart")

### 4.4 <u>Battleground states based on simulated results</u>
Simulation results were used to determine "Battleground" states. A state in which Biden won between 30% and 70% percent out of 1000 simulations was labelled as a *Battleground state*. Battleground states are listed below -

In [11]:
for state in battleground_states.index:
    print(state, end =" ")

Arizona Florida Northcarolina 

## 5. Conclusion

Bases on our analysis and simulation results, our final pre-election forecast is that democrats will have an upper hand in this election & Joe Biden will most likely be the 46th President of United States of America.

### _References_

* https://projects.fivethirtyeight.com/pollster-ratings/
* https://www.270towin.com/2020-polls-biden-trump/
* https://theconversation.comas-the-world-watches-us-election-the-appeal-of-america-is-diminished-148495
* https://state.1keydata.com/state-electoral-votes.php
* https://projects.economist.com/us-2020-forecast/president
