# Forecasting The 2020 US Presidential Election With XGB - Trump's chance of winning the EC

# -- Updated with new data up to Oct 19 2020 

This notebook deals with forecasts for Trump's probabilities of winning the electoral college. 

XGB is the go-to model for a lot of classification and regression tasks, but less commonly used in time series. I've seen a few online examples of how it can be used in time series forecasts, so thought I'll try implementing it here to see how the results compare with Prophet.

The workflow would be familiar to those who have worked with XGB. Only notable differences are the lack of a grid-search for optimal hyperparameters, and a manual split for the train-test datasets. 

As Jason Brownlee points out in this [very helpful post](https://machinelearningmastery.com/xgboost-for-time-series-forecasting/), a time series model should not be evaluated with methods that randomize the dataset during evaluation, such as k-fold cross validation. Reason being that the model must be **strictly trained on past data** in order to predict the future, and not vice versa.

I used his walk-forward validation code to test out a few configurations and went with the set below after achieving a mean-absolute error of about 1.44~1.47 EVs for 30 predictions. I won't incorporate that code here, but his tutorial has all the details if you wish to try that out.

For these reasons, the train-test split can't be done via the usual scikit-learn tool. I manually set Sept 14 2020 as the cut-off date for the train-test split. As we get closer to the election, clearly this will change so that we continue to maintain a roughly 80:20 split.

I split up the Trump-Biden forecasts to make the notebook easier to follow/replicate. Biden's forecasts using XGB are in notebook 3.3a/b.

## Resources:

I referenced these two posts substantially:

* [Hourly Time Series Forecasting using XGBoost](https://www.kaggle.com/robikscube/tutorial-time-series-forecasting-with-xgboost)

* [How to Use XGBoost for Time Series Forecasting](https://machinelearningmastery.com/xgboost-for-time-series-forecasting/)

## MEDIUM POST

Background and related links:
* [Part 2](https://chuachinhon.medium.com/for-trump-no-comfort-in-forecasts-or-twitter-in-final-stretch-of-2020-us-presidential-election-186e655e9bf5)

* [Part 1](https://medium.com/@chinhonchua/forecasting-the-2020-us-presidential-election-with-fb-prophet-36ab84f1a75a)

In [1]:
import numpy as np
import pandas as pd
import warnings

from xgboost import XGBRegressor 

warnings.filterwarnings('ignore')

In [2]:
# I've included this dataset in the repo
# It was prepared with the latest forecasts from 538 and Economist, and 
# prepared same way in notebook 1.1_data_extract_19102020; load only 2 cols that we need

trump = pd.read_csv("../data/trump_19102020.csv")[["Forecast_Date", "Average_Chance_of_Winning (%)"]]


In [3]:
trump.shape

(141, 2)

In [4]:
trump.head()

Unnamed: 0,Forecast_Date,Average_Chance_of_Winning (%)
0,2020-10-19,9.67125
1,2020-10-18,10.78875
2,2020-10-17,10.5675
3,2020-10-16,10.81125
4,2020-10-15,10.775


## 1. PREPARE DATA  

## 1.1 CREATE NEW FEATURES FROM DATES

Nominally, we only have the forecast dates to go on as "predictors". But from dates we can extract a range of new features, right down to hours and minutes. But in this case, days of the week, week of the year etc make more sense as the forecasts change on a daily basis.  

In [5]:
trump["Forecast_Date"] = pd.to_datetime(trump["Forecast_Date"], errors='coerce')

trump["dayofweek"] = trump["Forecast_Date"].dt.dayofweek
trump["quarter"] = trump["Forecast_Date"].dt.quarter
trump["month"] = trump["Forecast_Date"].dt.month
trump["dayofmonth"] = trump["Forecast_Date"].dt.day
trump["weekofyear"] = trump["Forecast_Date"].dt.weekofyear

## 1.2 MANUAL TRAIN-TEST SPLIT

Pick a different date if you want to use a smaller test set.

In [6]:
train = trump[~(trump["Forecast_Date"] >= "2020-09-22")]

test = trump[trump["Forecast_Date"] >= "2020-09-22"]

In [7]:
X_train = train[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

X_test = test[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

In [8]:
y_train = train["Average_Chance_of_Winning (%)"].values

y_test = test["Average_Chance_of_Winning (%)"].values

In [9]:
# this gives us a 80:20 split

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((113, 5), (113,), (28, 5), (28,))

# 2. LOAD MODEL & CHECK TEST RESULTS

Params below arrived at after a few trials using Jason Brownlee's walk-forward validation method, as discussed above.

In [10]:
xgb = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.01,
    gamma=0.01,
    objective="reg:squarederror",
    max_depth=3,
    n_jobs=-1,
    random_state=42,
)

xgb.fit(
    X_train,
    y_train,
    verbose=False,
)


XGBRegressor(gamma=0.01, learning_rate=0.01, n_estimators=1000, n_jobs=-1,
             objective='reg:squarederror', random_state=42)

In [11]:
test["XGB_Forecasts"] = xgb.predict(X_test)

trump_pred = pd.concat([test, train], sort=False).drop(
    ["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"], axis=1
)


In [12]:
trump_pred.head(20)

Unnamed: 0,Forecast_Date,Average_Chance_of_Winning (%),XGB_Forecasts
0,2020-10-19,9.67125,18.638954
1,2020-10-18,10.78875,18.181906
2,2020-10-17,10.5675,18.570286
3,2020-10-16,10.81125,18.556946
4,2020-10-15,10.775,18.60738
5,2020-10-14,10.78875,18.653614
6,2020-10-13,10.73875,19.504313
7,2020-10-12,11.21625,19.660921
8,2020-10-11,11.23625,19.137009
9,2020-10-10,11.17125,19.478369


# 3. FORECASTS FOR CHANCE OF WINNING TO NOV 4

## 3.1 EXPOSE MODEL TO FULL DATA

First let's let expose the model to the full set of available data.

In [13]:
X = trump[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

y = trump["Average_Chance_of_Winning (%)"].values

In [14]:
xgb.fit(X,y)

XGBRegressor(gamma=0.01, learning_rate=0.01, n_estimators=1000, n_jobs=-1,
             objective='reg:squarederror', random_state=42)

## 3.2 GENERATE NEW DF WITH DATES UP TO NOV 4

In [15]:
forecast = (trump["Forecast_Date"] + pd.Timedelta(16, unit="days")).to_frame()

forecast["Average_Chance_of_Winning (%)"] = None 

forecast = forecast[forecast["Forecast_Date"] >= "2020-10-01"]

In [16]:
#forecast["Forecast_Date"] = pd.to_datetime(forecast["Forecast_Date"], errors='coerce')

forecast["dayofweek"] = forecast["Forecast_Date"].dt.dayofweek
forecast["quarter"] = forecast["Forecast_Date"].dt.quarter
forecast["month"] = forecast["Forecast_Date"].dt.month
forecast["dayofmonth"] = forecast["Forecast_Date"].dt.day
forecast["weekofyear"] = forecast["Forecast_Date"].dt.weekofyear

In [17]:
forecast.head()

Unnamed: 0,Forecast_Date,Average_Chance_of_Winning (%),dayofweek,quarter,month,dayofmonth,weekofyear
0,2020-11-04,,2,4,11,4,45
1,2020-11-03,,1,4,11,3,45
2,2020-11-02,,0,4,11,2,45
3,2020-11-01,,6,4,11,1,44
4,2020-10-31,,5,4,10,31,44


## 3.3 GENERATE FORECASTS 

In [18]:
X_forecast = forecast[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

In [19]:
forecast["XGB_Trump_Forecast"] = xgb.predict(X_forecast)

In [20]:
forecast_pred = forecast[["Forecast_Date", "XGB_Trump_Forecast"]].copy()

In [21]:
forecast_pred.head()

Unnamed: 0,Forecast_Date,XGB_Trump_Forecast
0,2020-11-04,11.284355
1,2020-11-03,11.998125
2,2020-11-02,11.911569
3,2020-11-01,10.739668
4,2020-10-31,10.795016


In [22]:
# forecast_pred.to_csv("../data/XGB_Trump_chances_19102020.csv", index=False)