## Overfitting Exercise
In this exercise, we'll build a model that, as you'll see, dramatically overfits the training data. This will allow you to see what overfitting can "look like" in practice.

In [1]:
import os
import pandas as pd 
import numpy as np 
import math
import matplotlib.pyplot as plt

For this exercise, we'll use gradient boosted trees. In order to implement this model, we'll use the XGBoost package.

In [2]:
! pip install xgboost

Collecting xgboost
  Using cached https://files.pythonhosted.org/packages/aa/08/779aaa15de09590fad94cf533e3cc94b967d71b0daddaa2180685712be28/xgboost-1.1.1.tar.gz
Building wheels for collected packages: xgboost
  Running setup.py bdist_wheel for xgboost ... [?25lerror
  Complete output from command /opt/conda/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-adk15hgu/xgboost/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-_gm_d3k1 --python-tag cp36:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/xgboost
  copying xgboost/rabit.py -> build/lib.linux-x86_64-3.6/xgboost
  copying xgboost/plotting.py -> build/lib.linux-x86_64-3.6/xgboost
  copying xgboost/dask.py -> build/lib.linux-x86_64-3.6/xgboost
  copying xgboost/callback.py -> build/lib.linu

In [3]:
import xgboost as xgb

ModuleNotFoundError: No module named 'xgboost'

Here, we define a few helper functions.

In [None]:
# number of rows in a dataframe
def nrow(df): 
    return(len(df.index))

# number of columns in a dataframe
def ncol(df): 
    return(len(df.columns))

# flatten nested lists/arrays
flatten = lambda l: [item for sublist in l for item in sublist]

# combine multiple arrays into a single list
def c(*args):
    return(flatten([item for item in args]))

In this exercise, we're going to try to predict the returns of the S&P 500 ETF. This may be a futile endeavor, since many experts consider the S&P 500 to be essentially unpredictable, but it will serve well for the purpose of this exercise. The following cell loads the data.

In [None]:
df = pd.read_csv("SPYZ.csv")

As you can see, the data file has four columns, `Date`, `Close`, `Volume` and `Return`.

In [None]:
df.head()

In [None]:
n = nrow(df)

Next, we'll form our predictors/features. In the cells below, we create four types of features. We also use a parameter, `K`, to set the number of each type of feature to build. With a `K` of 25, 100 features will be created. This should already seem like a lot of features, and alert you to the potential that the model will be overfit.

In [None]:
predictors = []

# we'll create a new DataFrame to hold the data that we'll use to train the model
# we'll create it from the `Return` column in the original DataFrame, but rename that column `y`
model_df = pd.DataFrame(data = df['Return']).rename(columns = {"Return" : "y"})

# IMPORTANT: this sets how many of each of the following four predictors to create
K = 25

Now, you write the code to create the four types of predictors.

In [None]:
for L in range(1,K+1): 
    # this predictor is just the return L days ago, where L goes from 1 to K
    # these predictors will be named `R1`, `R2`, etc.
    pR = "".join(["R",str(L)]) 
    predictors.append(pR)
    for i in range(K+1,n): 
        # TODO: fill in the code to assign the return from L days before to the ith row of this predictor in `model_df`
        model_df.loc[i, pR] = None

    # this predictor is the return L days ago, squared, where L goes from 1 to K
    # these predictors will be named `Rsq1`, `Rsq2`, etc.
    pR2 = "".join(["Rsq",str(L)])
    predictors.append(pR2)
    for i in range(K+1,n): 
        # TODO: fill in the code to assign the squared return from L days before to the ith row of this predictor 
        # in `model_df`
        model_df.loc[i, pR2] = None

    # this predictor is the log volume L days ago, where L goes from 1 to K
    # these predictors will be named `V1`, `V2`, etc.
    pV = "".join(["V",str(L)])
    predictors.append(pV)
    for i in range(K+1,n): 
        # TODO: fill in the code to assign the log of the volume from L days before to the ith row of this predictor 
        # in `model_df`
        # Add 1 to the volume before taking the log
        model_df.loc[i, pV] = None

    # this predictor is the product of the return and the log volume from L days ago, where L goes from 1 to K
    # these predictors will be named `RV1`, `RV2`, etc.
    pRV = "".join(["RV",str(L)])
    predictors.append(pRV)
    for i in range(K+1,n): 
        # TODO: fill in the code to assign the product of the return and the log volume from L days before to the
        # ith row of this predictor in `model_df`
        model_df.loc[i, pRV] = None

Let's take a look at the predictors we've created.

In [None]:
model_df.iloc[100:105,:]

Next, we create a DataFrame that holds the recent volatility of the ETF's returns, as measured by the standard deviation of a sliding window of the past 20 days' returns.

In [None]:
vol_df = pd.DataFrame(data = df[['Return']])

for i in range(K+1,n): 
    # TODO: create the code to assign the standard deviation of the return from the time period starting 
    # 20 days before day i, up to the day before day i, to the ith row of `vol_df`
    vol_df.loc[i, 'vol'] = None

Let's take a quick look at the result.

In [None]:
vol_df.iloc[100:105,:]

Now that we have our data, we can start thinking about training a model.

In [None]:
# for training, we'll use all the data except for the first K days, for which the predictors' values are NaNs
model = model_df.iloc[K:n,:]

In the cell below, first split the data into train and test sets, and then split off the targets from the predictors.

In [None]:
# Split data into train and test sets
train_size = 2.0/3.0
breakpoint = round(nrow(model) * train_size)

# TODO: fill in the code to split off the chunk of data up to the breakpoint as the training set, and
# assign the rest as the test set.
training_data = None
test_data = None

# TODO: Split training data and test data into targets (Y) and predictors (X), for the training set and the test set
X_train = None
Y_train = None
X_test = None
Y_test = None

Great, now that we have our data, let's train the model.

In [None]:
# DMatrix is a internal data structure that used by XGBoost which is optimized for both memory efficiency 
# and training speed. 
dtrain = xgb.DMatrix(X_train, Y_train)

# Train the XGBoost model
param = { 'max_depth':20, 'silent':1 }
num_round = 20
xgModel = xgb.train(param, dtrain, num_round)

Now let's predict the returns for the S&P 500 ETF in both the train and test periods. If the model is successful, what should the train and test accuracies look like? What would be a key sign that the model has overfit the training data?

Todo: Before you run the next cell, write down what you expect to see if the model is overfit.

In [None]:
# Make the predictions on the test data
preds_train = xgModel.predict(xgb.DMatrix(X_train))
preds_test = xgModel.predict(xgb.DMatrix(X_test))

Let's quickly look at the mean squared error of the predictions on the training and testing sets.

In [None]:
# TODO: Calculate the mean squared error on the training set
msetrain = None

In [None]:
# TODO: Calculate the mean squared error on the test set
msetest = None

Looks like the mean squared error on the test set is an order of magnitude greater than on the training set. Not a good sign. Now let's do some quick calculations to gauge how this would translate into performance. 

In [None]:
# combine prediction arrays into a single list
predictions = c(preds_train, preds_test)
responses = c(Y_train, Y_test)

# as a holding size, we'll take predicted return divided by return variance
# this is mean-variance optimization with a single asset
vols = vol_df.loc[K:n,'vol']
position_size = predictions / vols ** 2

# TODO: Calculate pnl. Pnl in each time period is holding * realized return.
performance = None

# plot simulated performance
plt.plot(np.cumsum(performance))
plt.ylabel('Simulated Performance')
plt.axvline(x=breakpoint, c = 'r')
plt.show()

Our simulated returns accumulate throughout the training period, but they are absolutely flat in the testing period. The model has no predictive power whatsoever in the out-of-sample period.

Can you think of a few reasons our simulation of performance is unrealistic?

In [None]:
# TODO: Answer the above question.

If you need a little assistance, check out the [solution](overfitting_exercise_solution.ipynb).