# Forecasting The 2020 US Presidential Election With Prophet

With about a month to go before the 2020 United States Presidential Election on November 3, all eyes are on the barrage of polls and forecasts for the highly volatile race for the White House. Though a wide range of datasets and even model source codes are available from data and media outlets tracking the US election, it is not practical, in my view, for most data practioners to build their own model from scratch due to the domain expertise required.

For one, the White House race is not decided by the popular vote but rather by the electoral college, a unique system where the successful candidate has to stitch together a winning coalition from various states that would give him at least 270 "electoral votes" out of a possible total of 538.

Second, not all states are equally influential in the election outcome. The results often come down to how Americans vote in a few critical "battleground states" like Florida, Pennsylvania, Wisconsin and Ohio. The shifting voter sentiments in many of these states pose a huge challenge for US-based experts, much less those outside of the country.

A more practical alternative, in my view, is to leverage the forecasts by reputable outlets and apply a separate layer of data analysis on it.

The notebooks in this repo will detail the data extraction process for this approach, and the subsequent time series analysis using (FB) Prophet. I hope to introduce alternative approaches for time series analysis in future notebooks, such as by using XGB.

## MEDIUM POST

Background and related links [here](https://medium.com/@chinhonchua/forecasting-the-2020-us-presidential-election-with-fb-prophet-36ab84f1a75a)

# 1. DATA EXTRACTION AND PROCESSING

US data and media outlets have been publishing a range of different election forecasts, ranging from Trump/Biden's chances of winning the election, their respective share of the national vote, to the potential number of Electoral Votes (EV) both candidates might potentially win.

This post will focus only on the EV forecasts for two reasons. First, the EV count is the only thing that matters on Nov 3. If Biden wins the popular vote but can't win more than 270 EVs - like what happened to Hillary Clinton in the 2016 election - then Trump wins re-election even if more Americans voted for his challenger.

Second, the EV forecasts by FiveThirtyEight and The Economist would have already factored into account their respective assessments of the outcomes in key battleground states, albeit from an overall perspective. In contrast, forecasts of Trump and Biden's national vote share would not be conclusive about which candidate won or how he performed in the key battleground states.

Both FiveThirtyEight and The Economist have been publishing their respective predicted EV counts for several months, and their forecasts can be downloaded [here](https://data.fivethirtyeight.com/) and [here](https://cdn.economistdatateam.com/us-2020-forecast/data/president/economist_model_output.zip). Both outlets release daily updates of their polling data and model output files. Here are the names of the original CSV files:

* From 538: presidential_national_toplines_2020.csv

* From The Economist: electoral_college_votes_over_time.csv

Both files were slightly renamed below to remove ambiguity.

The Economist's forecasts go back to March 1, while the earliest 538 forecast for EV count is June 1. For consistency, I'll set the common baseline at June 1 for both sets of forecasts. I've set the current cut-off date at Oct 1, but will run updated analysis closer to polling day on Nov 3.

In [1]:
import numpy as np
import pandas as pd
import warnings

pd.set_option('display.max_columns', 40)
warnings.filterwarnings('ignore')

In [2]:
raw_538 = pd.read_csv("../data/538_presidential_national_toplines_2020.csv")

raw_economist = pd.read_csv("../data/economist_electoral_college_votes_over_time.csv")

In [3]:
raw_538.shape, raw_economist.shape

((123, 40), (430, 7))

# 1.1 EXTRACT 538'S EV PROJECTIONS FROM JUNE 1  - OCT 1

538's CSV file on the topline forecasts is probably the most useful I've seen. If you are keen to run time series projections on the national vote share or a candidate's chance of winning, you can easily slice a different piece of the data and run it on Prophet. 

In [4]:
raw_538.head()

Unnamed: 0,cycle,branch,model,modeldate,candidate_inc,candidate_chal,candidate_3rd,ecwin_inc,ecwin_chal,ecwin_3rd,ec_nomajority,popwin_inc,popwin_chal,popwin_3rd,ev_inc,ev_chal,ev_3rd,ev_inc_hi,ev_chal_hi,ev_3rd_hi,ev_inc_lo,ev_chal_lo,ev_3rd_lo,national_voteshare_inc,national_voteshare_chal,national_voteshare_3rd,nat_voteshare_other,national_voteshare_inc_hi,national_voteshare_chal_hi,national_voteshare_3rd_hi,nat_voteshare_other_hi,national_voteshare_inc_lo,national_voteshare_chal_lo,national_voteshare_3rd_lo,nat_voteshare_other_lo,national_turnout,national_turnout_hi,national_turnout_lo,timestamp,simulations
0,2020,President,polls-plus,10/1/2020,Trump,Biden,,0.196175,0.79935,,0.004475,0.09855,0.90145,,202.5146,335.4854,,308,428.0,,110.0,230,,45.80369,52.91401,,1.282302,49.33895,56.42828,,1.959087,42.30087,49.3783,,0.689478,141000000.0,151000000.0,131000000.0,20:45:04 1 Oct 2020,40000
1,2020,President,polls-plus,9/30/2020,Trump,Biden,,0.213425,0.78185,,0.004725,0.1119,0.8881,,206.2862,331.7138,,312,427.0,,111.0,226,,45.9699,52.74551,,1.284587,49.56242,56.32423,,1.963979,42.41258,49.1458,,0.689635,141000000.0,151000000.0,131000000.0,20:24:04 30 Sep 2020,40000
2,2020,President,polls-plus,9/29/2020,Trump,Biden,,0.216025,0.77935,,0.004625,0.110875,0.889125,,206.9328,331.0672,,312,425.0,,113.0,226,,45.96296,52.75171,,1.285323,49.54276,56.31634,,1.964764,42.41872,49.16298,,0.69029,141000000.0,151000000.0,131000000.0,20:44:03 29 Sep 2020,40000
3,2020,President,polls-plus,9/28/2020,Trump,Biden,,0.2196,0.775725,,0.004675,0.11455,0.88545,,207.7895,330.2105,,315,425.0,,113.0,223,,45.98885,52.72159,,1.289563,49.60487,56.3167,,1.970574,42.40921,49.10035,,0.693078,141000000.0,151000000.0,131000000.0,22:13:03 28 Sep 2020,40000
4,2020,President,polls-plus,9/27/2020,Trump,Biden,,0.2197,0.7756,,0.0047,0.105025,0.894975,,206.6381,331.3619,,313,427.0,,111.0,225,,45.81521,52.8941,,1.29069,49.45233,56.5064,,1.972729,42.21351,49.25216,,0.693352,141000000.0,151000000.0,131000000.0,20:00:04 27 Sep 2020,40000


In [5]:
raw_538['modeldate'] = pd.to_datetime(raw_538['modeldate'])

In [6]:
# narrowing down to EV forecasts only

cols1 = ["modeldate", "candidate_inc", "ev_inc"]

cols2 = ["modeldate", "candidate_chal", "ev_chal"]


trump_538 = raw_538[cols1].copy()
biden_538 = raw_538[cols2].copy()


In [7]:
# renaming cols for clarity

trump_538 = trump_538.rename(
    columns={
        "modeldate": "Forecast_Date",
        "candidate_inc": "Candidate",
        "ev_inc": "538's Projection of Trump's EV",
    }
)


In [8]:
# renaming cols for clarity

biden_538 = biden_538.rename(
    columns={
        "modeldate": "Forecast_Date",
        "candidate_chal": "Candidate",
        "ev_chal": "538's Projection of Biden's EV",
    }
)

In [9]:
trump_538.head()

Unnamed: 0,Forecast_Date,Candidate,538's Projection of Trump's EV
0,2020-10-01,Trump,202.5146
1,2020-09-30,Trump,206.2862
2,2020-09-29,Trump,206.9328
3,2020-09-28,Trump,207.7895
4,2020-09-27,Trump,206.6381


In [10]:
biden_538.head()

Unnamed: 0,Forecast_Date,Candidate,538's Projection of Biden's EV
0,2020-10-01,Biden,335.4854
1,2020-09-30,Biden,331.7138
2,2020-09-29,Biden,331.0672
3,2020-09-28,Biden,330.2105
4,2020-09-27,Biden,331.3619


In [11]:
# check that both DFs are in the same shape

trump_538.shape, biden_538.shape

((123, 3), (123, 3))

# 1.2 EXTRACT ECONOMIST'S EV PROJECTIONS FROM JUNE 1 - OCT1

The Economist's CSV file is in a different format, naturally. I'll only be using the median EV forecasts.

In [12]:
raw_economist.head()

Unnamed: 0,date,party,lower_95_ev,lower_60_ev,median_ev,upper_60_ev,upper_95_ev
0,2020-03-01,democratic,144.0,209.0,285.0,356.0,423.0
1,2020-03-01,republican,115.0,182.0,253.0,329.0,394.0
2,2020-03-02,democratic,146.0,212.0,289.0,357.0,421.0
3,2020-03-02,republican,117.0,181.0,249.0,326.0,392.0
4,2020-03-03,democratic,144.0,212.0,288.0,357.0,423.0


In [13]:
raw_economist['date'] = pd.to_datetime(raw_economist['date'])

In [14]:
# filtering out forecasts earlier than June 1
# for consistency with 538's baseline

economist_ev = (
    raw_economist[raw_economist["date"] >= "2020-06-01"]
    .sort_values(by="date", ascending=False)
    .reset_index()
)


In [15]:
economist_ev["Candidate"] = np.where(economist_ev["party"] == "democratic", "Biden", "Trump")

In [16]:
cols3 = ["date", "Candidate", "median_ev"]

trump_economist = economist_ev[economist_ev["Candidate"] == "Trump"][cols3].copy()
biden_economist = economist_ev[economist_ev["Candidate"] == "Biden"][cols3].copy()

In [17]:
# renaming cols for clarity

trump_economist = trump_economist.rename(
    columns={
        "date": "Forecast_Date",
        "Candidate": "Candidate",
        "median_ev": "Economist's Projection of Trump's EV",
    }
)

biden_economist = biden_economist.rename(
    columns={
        "date": "Forecast_Date",
        "Candidate": "Candidate",
        "median_ev": "Economist's Projection of Biden's EV",
    }
)

In [18]:
# confirm that latest forecast is indeed Oct 1

trump_economist.head()

Unnamed: 0,Forecast_Date,Candidate,Economist's Projection of Trump's EV
0,2020-10-01,Trump,199.0
2,2020-09-30,Trump,201.0
4,2020-09-29,Trump,204.0
6,2020-09-28,Trump,206.0
8,2020-09-27,Trump,207.0


In [19]:
# checking that both DFs are in the same shape

trump_economist.shape, biden_economist.shape

((123, 3), (123, 3))

# 2.0 CONCAT DATAFRAMES FROM 538 AND ECONOMIST; AGGREGATE FORECASTS

I haven't been able to find another source of EV projections, so we'll just aggregate two sources for now. But this format can be easily extended to include more data sources, if they release their model output online.

In [20]:
trump_ev = trump_538.merge(trump_economist, on="Forecast_Date", how="left").drop(
    columns=["Candidate_x", "Candidate_y"]
)

biden_ev = biden_538.merge(biden_economist, on="Forecast_Date", how="left").drop(
    columns=["Candidate_x", "Candidate_y"]
)


In [21]:
# aggregate the 2 forecasts

trump_ev["Average_Projected_EV"] = (
    trump_ev["538's Projection of Trump's EV"]
    + trump_ev["Economist's Projection of Trump's EV"]
) / 2

biden_ev["Average_Projected_EV"] = (
    biden_ev["538's Projection of Biden's EV"]
    + biden_ev["Economist's Projection of Biden's EV"]
) / 2


In [22]:
trump_ev.head()

Unnamed: 0,Forecast_Date,538's Projection of Trump's EV,Economist's Projection of Trump's EV,Average_Projected_EV
0,2020-10-01,202.5146,199.0,200.7573
1,2020-09-30,206.2862,201.0,203.6431
2,2020-09-29,206.9328,204.0,205.4664
3,2020-09-28,207.7895,206.0,206.89475
4,2020-09-27,206.6381,207.0,206.81905


In [23]:
biden_ev.head()

Unnamed: 0,Forecast_Date,538's Projection of Biden's EV,Economist's Projection of Biden's EV,Average_Projected_EV
0,2020-10-01,335.4854,339.0,337.2427
1,2020-09-30,331.7138,337.0,334.3569
2,2020-09-29,331.0672,334.0,332.5336
3,2020-09-28,330.2105,332.0,331.10525
4,2020-09-27,331.3619,331.0,331.18095


In [24]:
trump_ev.shape, biden_ev.shape

((123, 4), (123, 4))

In [25]:
# outputing file for time series projections in next notebook

#trump_ev.to_csv("../data/trump_ev.csv", index=False)

#biden_ev.to_csv("../data/biden_ev.csv", index=False)