# Forecasting The 2020 US Presidential Election With Prophet And XGB

# -- Updated with new data up to Oct 19 2020 

With weeks to go before the 2020 United States Presidential Election on November 3, all eyes are on the barrage of polls and forecasts for the highly volatile race for the White House. Though a wide range of datasets and even model source codes are available from data and media outlets tracking the US election, it is not practical, in my view, for most data practioners to build their own model from scratch due to the domain expertise required.

For one, the White House race is not decided by the popular vote but rather by the electoral college, a unique system where the successful candidate has to stitch together a winning coalition from various states that would give him at least 270 "electoral votes" out of a possible total of 538.

Second, not all states are equally influential in the election outcome. The results often come down to how Americans vote in a few critical "battleground states" like Florida, Pennsylvania, Wisconsin and Ohio. The shifting voter sentiments in many of these states pose a huge challenge for US-based experts, much less those outside of the country.

A more practical alternative, in my view, is to leverage the forecasts by reputable outlets and apply a separate layer of data analysis on it.

The notebooks in this repo will detail the data extraction process for this approach, and the subsequent time series analysis using (FB) Prophet and XGB.

## MEDIUM POSTS

Background and related links:
* [Part 2](https://chuachinhon.medium.com/for-trump-no-comfort-in-forecasts-or-twitter-in-final-stretch-of-2020-us-presidential-election-186e655e9bf5)

* [Part 1](https://medium.com/@chinhonchua/forecasting-the-2020-us-presidential-election-with-fb-prophet-36ab84f1a75a)

# 1. DATA EXTRACTION AND PROCESSING

US data and media outlets have been publishing a range of different election forecasts, ranging from Trump/Biden's chances of winning the election, their respective share of the national vote, to the potential number of Electoral Votes (EV) both candidates might potentially win.

This post will focus only on the forecasts for the EV count and chances of winning the Electoral College (EC). I'm focused on the EV count/EC win probabilities - instead of the popular vote count or national vote share - for two main reasons.

First, the EV count is the only thing that matters on Nov 3. If Biden wins the popular vote but can't win more than 270 EVs - like what happened to Hillary Clinton in the 2016 election - then Trump wins re-election even if more Americans voted for his challenger.

Second, the EV forecasts by FiveThirtyEight and The Economist would have already factored into account their respective assessments of the outcomes in key battleground states, albeit from an overall perspective. In contrast, forecasts of Trump and Biden's national vote share would not be conclusive about which candidate won or how he performed in the key battleground states.

Both FiveThirtyEight and The Economist have been publishing their respective predicted EV counts for several months, and their forecasts can be downloaded [here](https://data.fivethirtyeight.com/) and [here](https://cdn.economistdatateam.com/us-2020-forecast/data/president/economist_model_output.zip). Both outlets release daily updates of their polling data and model output files. Here are the names of the original CSV files:

* From 538: presidential_national_toplines_2020.csv

* From The Economist (EV count): electoral_college_votes_over_time.csv

* From The Economist (EC win probabilities): electoral_college_probability_over_time.csv

The three files were slightly renamed below to remove ambiguity.

The Economist's forecasts go back to March 1, while the earliest 538 forecast for EV count is June 1. For consistency, I'll set the common baseline at June 1 for both sets of forecasts. I've set the NEW cut-off date at Oct 18, but will run updated analysis closer to polling day on Nov 3.

In [1]:
import numpy as np
import pandas as pd
import warnings

pd.set_option('display.max_columns', 40)
warnings.filterwarnings('ignore')

In [2]:
raw_538 = pd.read_csv("../data/538_19102020.csv")

raw_economist = pd.read_csv("../data/economist_19102020.csv")

raw_economist_chances = pd.read_csv("../data/economist_chances_19102020.csv")

In [3]:
raw_538.shape, raw_economist.shape, raw_economist_chances.shape

((141, 40), (466, 7), (466, 3))

# 1.1 EXTRACT 538'S PROJECTIONS FROM JUNE 1  - OCT 19

538's CSV file on the topline forecasts is probably the most useful I've seen. If you are keen to run time series projections on the national vote share, you can easily slice a different piece of the data and run it on Prophet. 

In [4]:
raw_538.head()

Unnamed: 0,cycle,branch,model,modeldate,candidate_inc,candidate_chal,candidate_3rd,ecwin_inc,ecwin_chal,ecwin_3rd,ec_nomajority,popwin_inc,popwin_chal,popwin_3rd,ev_inc,ev_chal,ev_3rd,ev_inc_hi,ev_chal_hi,ev_3rd_hi,ev_inc_lo,ev_chal_lo,ev_3rd_lo,national_voteshare_inc,national_voteshare_chal,national_voteshare_3rd,nat_voteshare_other,national_voteshare_inc_hi,national_voteshare_chal_hi,national_voteshare_3rd_hi,nat_voteshare_other_hi,national_voteshare_inc_lo,national_voteshare_chal_lo,national_voteshare_3rd_lo,nat_voteshare_other_lo,national_turnout,national_turnout_hi,national_turnout_lo,timestamp,simulations
0,2020,President,polls-plus,10/19/2020,Trump,Biden,,0.120425,0.875,,0.004575,0.040675,0.959325,,190.336,347.664,,279,426.0,,112.0,259,,45.17442,53.57301,,1.25257,48.1834,56.56211,,1.900372,42.18653,50.55529,,0.683699,143000000.0,153000000.0,133000000.0,21:55:04 19 Oct 2020,40000
1,2020,President,polls-plus,10/18/2020,Trump,Biden,,0.121775,0.87355,,0.004675,0.0418,0.9582,,191.2572,346.7427,,279,425.0,,113.0,259,,45.17662,53.57382,,1.249555,48.21889,56.60146,,1.899508,42.14919,50.52079,,0.679204,143000000.0,153000000.0,133000000.0,21:11:03 18 Oct 2020,40000
2,2020,President,polls-plus,10/17/2020,Trump,Biden,,0.12235,0.87365,,0.004,0.04375,0.95625,,191.3031,346.6969,,280,426.0,,112.0,258,,45.17821,53.57002,,1.251769,48.253,56.63504,,1.903847,42.1105,50.48,,0.679659,143000000.0,153000000.0,133000000.0,19:17:03 17 Oct 2020,40000
3,2020,President,polls-plus,10/16/2020,Trump,Biden,,0.129225,0.8677,,0.003075,0.045975,0.954025,,192.213,345.787,,283,426.0,,112.0,255,,45.22214,53.52393,,1.253931,48.32899,56.62143,,1.907859,42.12756,50.40108,,0.680276,143000000.0,153000000.0,133000000.0,21:08:11 16 Oct 2020,40000
4,2020,President,polls-plus,10/15/2020,Trump,Biden,,0.1295,0.867025,,0.003475,0.04715,0.95285,,192.0065,345.9936,,283,426.0,,112.0,255,,45.22849,53.52165,,1.249865,48.36061,56.64082,,1.905437,42.1101,50.37309,,0.675187,143000000.0,153000000.0,133000000.0,20:38:03 15 Oct 2020,40000


In [5]:
raw_538['modeldate'] = pd.to_datetime(raw_538['modeldate'])

In [6]:
# narrowing down to EV forecasts only

cols1 = ["modeldate", "candidate_inc", "ev_inc", "ecwin_inc"]

cols2 = ["modeldate", "candidate_chal", "ev_chal", "ecwin_chal"]


trump_538 = raw_538[cols1].copy()
biden_538 = raw_538[cols2].copy()


In [7]:
# renaming cols for clarity

trump_538 = trump_538.rename(
    columns={
        "modeldate": "Forecast_Date",
        "candidate_inc": "Candidate",
        "ev_inc": "Trump's EV Forecast (538)",
        "ecwin_inc": "Trump's Chance of Winning (538)"
    }
)


In [8]:
# renaming cols for clarity

biden_538 = biden_538.rename(
    columns={
        "modeldate": "Forecast_Date",
        "candidate_chal": "Candidate",
        "ev_chal": "Biden's EV Forecast (538)",
        "ecwin_chal": "Biden's Chance of Winning (538)"
    }
)

In [9]:
trump_538.head()

Unnamed: 0,Forecast_Date,Candidate,Trump's EV Forecast (538),Trump's Chance of Winning (538)
0,2020-10-19,Trump,190.336,0.120425
1,2020-10-18,Trump,191.2572,0.121775
2,2020-10-17,Trump,191.3031,0.12235
3,2020-10-16,Trump,192.213,0.129225
4,2020-10-15,Trump,192.0065,0.1295


In [10]:
biden_538.head()

Unnamed: 0,Forecast_Date,Candidate,Biden's EV Forecast (538),Biden's Chance of Winning (538)
0,2020-10-19,Biden,347.664,0.875
1,2020-10-18,Biden,346.7427,0.87355
2,2020-10-17,Biden,346.6969,0.87365
3,2020-10-16,Biden,345.787,0.8677
4,2020-10-15,Biden,345.9936,0.867025


In [11]:
# check that both DFs are in the same shape

trump_538.shape, biden_538.shape

((141, 4), (141, 4))

# 1.2 EXTRACT ECONOMIST'S PROJECTIONS FROM JUNE 1 - OCT 19

The Economist's CSV files are in a different format, naturally. I'll only be using the median EV forecasts.

In [12]:
raw_economist.head()

Unnamed: 0,date,party,lower_95_ev,lower_60_ev,median_ev,upper_60_ev,upper_95_ev
0,2020-03-01,democratic,144.0,209.0,285.0,356.0,423.0
1,2020-03-01,republican,115.0,182.0,253.0,329.0,394.0
2,2020-03-02,democratic,146.0,212.0,289.0,357.0,421.0
3,2020-03-02,republican,117.0,181.0,249.0,326.0,392.0
4,2020-03-03,democratic,144.0,212.0,288.0,357.0,423.0


In [13]:
raw_economist_chances.head()

Unnamed: 0,date,party,win_prob
0,2020-03-01,democratic,0.559
1,2020-03-01,republican,0.436
2,2020-03-02,democratic,0.573
3,2020-03-02,republican,0.423
4,2020-03-03,democratic,0.578


In [14]:
raw_economist['date'] = pd.to_datetime(raw_economist['date'])

raw_economist_chances['date'] = pd.to_datetime(raw_economist_chances['date'])

In [15]:
# filtering out forecasts earlier than June 1
# for consistency with 538's baseline

economist_ev = (
    raw_economist[raw_economist["date"] >= "2020-06-01"]
    .sort_values(by="date", ascending=False)
    .reset_index()
)


economist_chances = (
    raw_economist_chances[raw_economist_chances["date"] >= "2020-06-01"]
    .sort_values(by="date", ascending=False)
    .reset_index()
)

In [16]:
economist_ev["Candidate"] = np.where(economist_ev["party"] == "democratic", "Biden", "Trump")

economist_chances["Candidate"] = np.where(economist_chances["party"] == "democratic", "Biden", "Trump")

In [17]:
cols3 = ["date", "Candidate", "median_ev"]
cols4 = ["date", "Candidate", "win_prob"]

trump_economist1 = economist_ev[economist_ev["Candidate"] == "Trump"][cols3].copy()

biden_economist1 = economist_ev[economist_ev["Candidate"] == "Biden"][cols3].copy()

trump_economist2 = economist_chances[economist_chances["Candidate"] == "Trump"][
    cols4
].copy()

biden_economist2 = economist_chances[economist_chances["Candidate"] == "Biden"][
    cols4
].copy()


In [18]:
# concating the two Economist forecasts we need into 1 DF
trump_economist = pd.concat(
    [trump_economist1, trump_economist2], axis=1, join="inner", sort=True
)

trump_economist = trump_economist.loc[:, ~trump_economist.columns.duplicated()]

biden_economist = pd.concat(
    [biden_economist1, biden_economist2], axis=1, join="inner", sort=True
)

biden_economist = biden_economist.loc[:, ~biden_economist.columns.duplicated()]


In [19]:
# renaming cols for clarity

trump_economist = trump_economist.rename(
    columns={
        "date": "Forecast_Date",
        "Candidate": "Candidate",
        "median_ev": "Trump's EV Forecast (Economist)",
        "win_prob": "Trump's Chance of Winning (Economist)",
    }
)

biden_economist = biden_economist.rename(
    columns={
        "date": "Forecast_Date",
        "Candidate": "Candidate",
        "median_ev": "Biden's EV Forecast (Economist)",
        "win_prob": "Biden's Chance of Winning (Economist)"
    }
)

In [20]:
# confirm that latest forecast is indeed Oct 19

trump_economist.head()

Unnamed: 0,Forecast_Date,Candidate,Trump's EV Forecast (Economist),Trump's Chance of Winning (Economist)
0,2020-10-19,Trump,188.0,0.073
2,2020-10-18,Trump,198.0,0.094
4,2020-10-17,Trump,197.0,0.089
6,2020-10-16,Trump,195.0,0.087
8,2020-10-15,Trump,195.0,0.086


In [21]:
biden_economist.head()

Unnamed: 0,Forecast_Date,Candidate,Biden's EV Forecast (Economist),Biden's Chance of Winning (Economist)
1,2020-10-19,Biden,350.0,0.924
3,2020-10-18,Biden,340.0,0.903
5,2020-10-17,Biden,341.0,0.908
7,2020-10-16,Biden,343.0,0.911
9,2020-10-15,Biden,343.0,0.911


In [22]:
# checking that both DFs are in the same shape

trump_economist.shape, biden_economist.shape

((141, 4), (141, 4))

# 2.0 CONCAT DATAFRAMES FROM 538 AND ECONOMIST; AGGREGATE FORECASTS

I haven't been able to find a third source of public forecasts, so we'll just aggregate two sources for now. But this format can be easily extended to include more data sources, if they release their model output online.

In [23]:
trump = trump_538.merge(trump_economist, on="Forecast_Date", how="left").drop(
    columns=["Candidate_x", "Candidate_y"]
)

biden = biden_538.merge(biden_economist, on="Forecast_Date", how="left").drop(
    columns=["Candidate_x", "Candidate_y"]
)


In [24]:
# aggregate the 2 forecasts for EV forecasts

trump["Average_Projected_EV"] = (
    trump["Trump's EV Forecast (538)"]
    + trump["Trump's EV Forecast (Economist)"]
) / 2

biden["Average_Projected_EV"] = (
    biden["Biden's EV Forecast (538)"]
    + biden["Biden's EV Forecast (Economist)"]
) / 2


In [25]:
# aggregate the 2 forecasts for chances of winning EC

trump["Average_Chance_of_Winning (%)"] = 100 * (
    trump["Trump's Chance of Winning (538)"]
    + trump["Trump's Chance of Winning (Economist)"]
) / 2

biden["Average_Chance_of_Winning (%)"] = 100 * (
    biden["Biden's Chance of Winning (538)"]
    + biden["Biden's Chance of Winning (Economist)"]
) / 2

In [26]:
# re-arranging cols for clarity
cols5 = [
    "Forecast_Date",
    "Biden's EV Forecast (538)",
    "Biden's EV Forecast (Economist)",
    "Average_Projected_EV",
    "Biden's Chance of Winning (538)",
    "Biden's Chance of Winning (Economist)",
    "Average_Chance_of_Winning (%)",
]

biden = biden[cols5].copy()


In [27]:
# re-arranging cols for clarity
cols6 = [
    "Forecast_Date",
    "Trump's EV Forecast (538)",
    "Trump's EV Forecast (Economist)",
    "Average_Projected_EV",
    "Trump's Chance of Winning (538)",
    "Trump's Chance of Winning (Economist)",
    "Average_Chance_of_Winning (%)",
]

trump = trump[cols6].copy()

In [28]:
trump.head()

Unnamed: 0,Forecast_Date,Trump's EV Forecast (538),Trump's EV Forecast (Economist),Average_Projected_EV,Trump's Chance of Winning (538),Trump's Chance of Winning (Economist),Average_Chance_of_Winning (%)
0,2020-10-19,190.336,188.0,189.168,0.120425,0.073,9.67125
1,2020-10-18,191.2572,198.0,194.6286,0.121775,0.094,10.78875
2,2020-10-17,191.3031,197.0,194.15155,0.12235,0.089,10.5675
3,2020-10-16,192.213,195.0,193.6065,0.129225,0.087,10.81125
4,2020-10-15,192.0065,195.0,193.50325,0.1295,0.086,10.775


In [29]:
biden.head()

Unnamed: 0,Forecast_Date,Biden's EV Forecast (538),Biden's EV Forecast (Economist),Average_Projected_EV,Biden's Chance of Winning (538),Biden's Chance of Winning (Economist),Average_Chance_of_Winning (%)
0,2020-10-19,347.664,350.0,348.832,0.875,0.924,89.95
1,2020-10-18,346.7427,340.0,343.37135,0.87355,0.903,88.8275
2,2020-10-17,346.6969,341.0,343.84845,0.87365,0.908,89.0825
3,2020-10-16,345.787,343.0,344.3935,0.8677,0.911,88.935
4,2020-10-15,345.9936,343.0,344.4968,0.867025,0.911,88.90125


In [30]:
trump.shape, biden.shape

((141, 7), (141, 7))

In [31]:
# outputing file for time series projections in next notebook

#trump.to_csv("../data/trump_19102020.csv", index=False)

#biden.to_csv("../data/biden_19102020.csv", index=False)