### XGBoost for Timeseries

#### Boosting

Ensemble models are a standard tool for predictive modeling and boosting is one technique to create ensemble models.

Boosting fits a series of models and fits each successive model in order to minimize the error of the previous models.

There are a couple of variants of this concept, one being gradient boosting.

#### XGBoost

https://xgboost.readthedocs.io/

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.

XGBoost is an ensemble of decision trees where new trees fix errors of the trees that are already part of the model. Trees are added until no further improvements can be made to the model.

Requirements to use XGBoost for time series:
- evaluate the model via walk-forward validation, instead of k-fold cross validation, as k-fold would have biased results.



In [None]:
#!pipenv install scikit-learn xgboost --skip-lock

In [None]:
from IPython.core.debugger import set_trace

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import time


In [None]:
df = pd.read_csv("data/MSFT-1Y-Hourly.csv")

In [None]:
df.head(5)

In [None]:
df = df[["close"]].copy()

In [None]:
df.head(5)

#### Transform this to a supervised learning problem.

In [None]:
df["target"] = df.close.shift(-1)

In [None]:
df.dropna(inplace=True)

In [None]:
df.head(5)

#### Train test split

In [None]:
def train_test_split(data, perc):
    data = data.values
    n = int(len(data) * (1 - perc))
    return data[:n], data[n:]

In [None]:
train, test = train_test_split(df, 0.2)

In [None]:
print(len(df))
print(len(train))
print(len(test))

We'll use the XGBRegressor class to make a prediction. XGBRegressor is an implementation of the scikit-learn API for XGBoost regression.

We'll take the train set and test input row as input, fit a model, and make a prediction.

In [None]:
X = train[:, :-1]
y = train[:, -1]

In [None]:
y

In [None]:
from xgboost import XGBRegressor

model = XGBRegressor(objective="reg:squarederror", n_estimators=1000)
model.fit(X, y)

In [None]:
test[0]

In [None]:
np.array(test[0,0]).reshape(1, -1)

In [None]:
val = np.array(test[0, 0]).reshape(1, -1)

pred = model.predict(val)
print(pred[0])

#### Predict
Train on train set and predict one sample at a time

In [None]:
def xgb_predict(train, test):
    train = np.array(train)
    X, y = train[:, :-1], train[:, -1]
    model = XGBRegressor(objective="reg:squarederror", n_estimators=1000)
    model.fit(X, y)

    val = np.array(test).reshape(1, -1)
    pred = model.predict(val)
    return train, X, y, pred[0]

In [None]:
t, x, y, pred = xgb_predict(train, test[0, 0])

print(y)

#### Walk-forward validation

Since we are making a one step forward prediction, in this case an hourly prediction we will predict the first record in the test dataset. 

Afterwards we add the real observation from the test set to the train set, refit the model, then predict the next step in the test dataset.

We'll evaluate the model with the RMSE metric.

In [None]:
from sklearn.metrics import mean_squared_error


def validate(data, perc):
    predictions = []

    train, test = train_test_split(data, perc)

    history = [x for x in train]

    for i in range(len(test)):
        test_X, test_y = test[i, :-1], test[i, -1]

        pred = xgb_predict(history, test_X[0])
        predictions.append(pred)

        history.append(test[i])

    error = mean_squared_error(test[:, -1], predictions, squared=False)

    return error, test[:, -1], predictions

In [None]:
%%time
rmse, y, pred = validate(df, 0.2)

print(rmse)