# Forecasting The 2020 US Presidential Election With XGB - Biden's forecast

XGB is the go-to model for a lot of classification and regression tasks, but less commonly used in time series. I've seen a few online examples of how it can be used in time series forecasts, so thought I'll try implementing it here to see how the results compare with Prophet.

The workflow would be familiar to those who have worked with XGB. Only notable differences are the lack of a grid-search for optimal hyperparameters, and a manual split for the train-test datasets. 

As Jason Brownlee points out in this [very helpful post](https://machinelearningmastery.com/xgboost-for-time-series-forecasting/), a time series model should not be evaluated with methods that randomize the dataset during evaluation, such as k-fold cross validation. Reason being that the model must be **strictly trained on past data** in order to predict the future, and not vice versa.

I used his walk-forward validation code to test out a few configurations and went with the set below after achieving a mean-absolute error of about 1.44~1.47 EVs for 30 predictions. I won't incorporate that code here, but his tutorial has all the details if you wish to try that out.

For these reasons, the train-test split can't be done via the usual scikit-learn tool. I manually set Sept 14 2020 as the cut-off date for the train-test split. As we get closer to the election, clearly this will change so that we continue to maintain a roughly 80:20 split.

I split up the Trump-Biden forecasts to make the notebook easier to follow/replicate. Trump's forecasts using XGB are in notebook 3.0.

## Resources:

I referenced these two posts substantially:

* [Hourly Time Series Forecasting using XGBoost](https://www.kaggle.com/robikscube/tutorial-time-series-forecasting-with-xgboost)

* [How to Use XGBoost for Time Series Forecasting](https://machinelearningmastery.com/xgboost-for-time-series-forecasting/)

## MEDIUM POST

Further background and related links [here](https://medium.com/@chinhonchua/forecasting-the-2020-us-presidential-election-with-fb-prophet-36ab84f1a75a)

In [1]:
import numpy as np
import pandas as pd
import warnings

from xgboost import XGBRegressor 

warnings.filterwarnings('ignore')

In [2]:
# I've included this dataset in the repo
# It was prepared with the latest forecasts from 538 and Economist, and prepared same way in notebook1.0
# load only 2 cols that we need

biden = pd.read_csv("../data/biden_ev08102020.csv")[["Forecast_Date", "Average_Projected_EV"]]


In [3]:
biden.shape

(130, 2)

In [4]:
biden.head()

Unnamed: 0,Forecast_Date,Average_Projected_EV
0,2020-10-08,346.50355
1,2020-10-07,344.31025
2,2020-10-06,339.83245
3,2020-10-05,338.372
4,2020-10-04,336.872


## 1. PREPARE DATA  

## 1.1 CREATE NEW FEATURES FROM DATES

Nominally, we only have the forecast dates to go on as "predictors". But from dates we can extract a range of new features, right down to hours and minutes. But in this case, days of the week, week of the year etc make more sense as the forecasts change on a daily basis.  

In [5]:
biden["Forecast_Date"] = pd.to_datetime(biden["Forecast_Date"], errors='coerce')

biden["dayofweek"] = biden["Forecast_Date"].dt.dayofweek
biden["quarter"] = biden["Forecast_Date"].dt.quarter
biden["month"] = biden["Forecast_Date"].dt.month
biden["dayofmonth"] = biden["Forecast_Date"].dt.day
biden["weekofyear"] = biden["Forecast_Date"].dt.weekofyear

## 1.2 MANUAL TRAIN-TEST SPLIT

Pick a different date if you want to use a smaller test set.

In [6]:
train = biden[~(biden["Forecast_Date"] >= "2020-09-14")]

test = biden[biden["Forecast_Date"] >= "2020-09-14"]

In [7]:
X_train = train[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

X_test = test[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

In [8]:
y_train = train["Average_Projected_EV"].values

y_test = test["Average_Projected_EV"].values

In [9]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((105, 5), (105,), (25, 5), (25,))

# 2. LOAD MODEL & CHECK TEST RESULTS

Params below arrived at after a few trials using Jason Brownlee's walk-forward validation method, as discussed above.

In [10]:
xgb = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.01,
    gamma=0.01,
    objective="reg:squarederror",
    max_depth=3,
    n_jobs=-1,
    random_state=42,
)

xgb.fit(
    X_train,
    y_train,
    verbose=False,
)


XGBRegressor(gamma=0.01, learning_rate=0.01, n_estimators=1000, n_jobs=-1,
             objective='reg:squarederror', random_state=42)

In [11]:
test["XGB_Forecasts"] = xgb.predict(X_test)

biden_pred = pd.concat([test, train], sort=False).drop(
    ["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"], axis=1
)


In [12]:
biden_pred.head(20)

Unnamed: 0,Forecast_Date,Average_Projected_EV,XGB_Forecasts
0,2020-10-08,346.50355,328.566803
1,2020-10-07,344.31025,328.650543
2,2020-10-06,339.83245,327.594727
3,2020-10-05,338.372,326.714722
4,2020-10-04,336.872,327.683228
5,2020-10-03,333.8888,327.347504
6,2020-10-02,333.99465,327.767242
7,2020-10-01,337.20045,326.070953
8,2020-09-30,334.3569,326.835144
9,2020-09-29,332.5336,327.034576


## NOTE:

XGB's forecasts in mid-to-late September were pretty close to the aggregated forecasts from 538 and The Economist. But the forecasts started looking a little unstable from Sept 29. Let's push ahead to see how well it does in the forecasts for the month ahead till Nov 4.

# 3. FORECASTS TO NOV 4

## 3.1 EXPOSE MODEL TO FULL DATA

First let's let expose the model to the full set of available data.

In [13]:
X = biden[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

y = biden["Average_Projected_EV"].values

In [14]:
xgb.fit(X,y)

XGBRegressor(gamma=0.01, learning_rate=0.01, n_estimators=1000, n_jobs=-1,
             objective='reg:squarederror', random_state=42)

## 3.2 GENERATE NEW DF WITH DATES UP TO NOV 4

In [15]:
forecast = (biden["Forecast_Date"] + pd.Timedelta(27, unit="days")).to_frame()

forecast["Average_Projected_EV"] = None 

forecast = forecast[forecast["Forecast_Date"] >= "2020-10-01"]

In [16]:
#forecast["Forecast_Date"] = pd.to_datetime(forecast["Forecast_Date"], errors='coerce')

forecast["dayofweek"] = forecast["Forecast_Date"].dt.dayofweek
forecast["quarter"] = forecast["Forecast_Date"].dt.quarter
forecast["month"] = forecast["Forecast_Date"].dt.month
forecast["dayofmonth"] = forecast["Forecast_Date"].dt.day
forecast["weekofyear"] = forecast["Forecast_Date"].dt.weekofyear

In [17]:
forecast.head()

Unnamed: 0,Forecast_Date,Average_Projected_EV,dayofweek,quarter,month,dayofmonth,weekofyear
0,2020-11-04,,2,4,11,4,45
1,2020-11-03,,1,4,11,3,45
2,2020-11-02,,0,4,11,2,45
3,2020-11-01,,6,4,11,1,44
4,2020-10-31,,5,4,10,31,44


## 3.3 GENERATE FORECASTS 

In [18]:
X_forecast = forecast[["dayofweek", "dayofmonth", "month", "quarter", "weekofyear"]]

In [19]:
forecast["XGB_Biden_Forecast"] = xgb.predict(X_forecast)

In [20]:
forecast_pred = forecast[["Forecast_Date", "XGB_Biden_Forecast"]].copy()

In [21]:
forecast_pred.head(40)

Unnamed: 0,Forecast_Date,XGB_Biden_Forecast
0,2020-11-04,341.965027
1,2020-11-03,339.595764
2,2020-11-02,338.287415
3,2020-11-01,340.249023
4,2020-10-31,343.027008
5,2020-10-30,343.306244
6,2020-10-29,345.164124
7,2020-10-28,344.214081
8,2020-10-27,342.172699
9,2020-10-26,341.547058


## NOTE:

XGB's forecasts are generally more stable for Biden, but the final results for Nov 3/4 are again pretty, pretty close to those by Prophet.

See notebook 3.0 for Trump's forecasts using XGB.

In [22]:
# forecast_pred.to_csv("../data/XGB_Biden_08102020.csv", index=False)