# H1N1 Cases to Predict S&P 500

### Introduction

__With COVID-19 griping everyone's lives and finances we have all noticed the financial market dive and unemployment sky rocket. The question that everyone has is when will we bounce back and are we on that track yet?__

__More questions come from the volitiliy of the market, and the question on everyone's question in the financial world is, have we hit rock bottom? In order to analyze this question, I thought it was best to look back into past pandemics that gripped the world. Although the world has never seen such a deadily virus in modern history, I want to know if there is something from the past that can help us with our future.__

__I investigate Swine Flu, a respiratory disease of pigs caused by type A influenza virus. When this virs is detected in a person, it is called Variant Influenza Virus as was seen in the 2009 Influenza Pandemic. MOst commonly, human infections with variant viruses occur in people with exposure to infected pigs. The virus can also be trransmitted through direct contact of an infected surface or person and then touching your face. The third way of transmission is through the air in droplets or dust. The United States of America faced the virus since 2009.__

In this analysis I have used __XGBoost__ as my machine learning model, and measured my accuracy with mean squared error and mean absolute error. __XGBoost__ is an optimized distributed gradient boosting library to be highly efficient, flexible, and portable. It implements machine learning under the __Gradient Boosting__ framework, as it provides a parallel tree boosting to solve data science problems in a fast and accurate way.

__Gradient boosting__ is a method that goes through cycles to iteratively add models into an ensemble. It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. However, subsequent additions to the ensemble will address those errors.

__Steps of Gradient Boosting:__
    - First use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble
    - The predictions are then used to calculate a loss function, which are my error values
    - Next we use the loss function to fit a new model that will be added to the ensemble. Specifically, I determined model parameters so that adding this new model to the ensemble will reduce the loss
    - Finally, we add the new model to ensemble and repeat until we have an accurate model.
    
__XGBoost Parameters:__
    - n_estimators -> Specifies how many times to go through the modeling cycle
    - max_depth -> Maximum depth of a tree, large value can cause overfitting
    - learning_rate -> 
    - subsample -> Subsample ratio of the training instances, it will occur once in every boosting iteration
    - colsample_bytree -> Subsample ratio of columns when constructing each tree
    - colsample_bylevel -> Subsample ratio of columns for each level
    - gamma -> Minimum loss reduction required to make a futher partition of a leaf node of the tree

#### Import Necessary Libraries

In [1]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

%matplotlib notebook
import math
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
from pylab import rcParams
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm_notebook
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import matplotlib.dates as mdates
from sklearn.model_selection import train_test_split
import Common_Functions as cf
import time
import mplcursors

from matplotlib.backends.backend_agg import FigureCanvasAgg 
plt.get_backend()

test_size = 0.2
valid_size = 0.2
N = 21

n_estimators = 100
max_depth = 3
learning_rate = 0.1
min_child_weight = 1
subsample = 1
colsample_bytree = 1
colsample_bylevel = 1
gamma = 0
model_seed = 100

style.use('seaborn-darkgrid')

### Load and Format Dataset

The dataset has already been merged on the dates of Swine Flu cases dates, with the cases summed up per date, and the OHLC data along with cases. I have taken the dataset and formatted it for greater readability and analysis. The dates column is changed into datetime data type.

In [2]:
SWINE_GSPC = pd.read_csv('SWINE_GSPC.csv', parse_dates = True, index_col = 0)
SWINE_GSPC.head()

Unnamed: 0,season,year,week,age_0_4,age_5_24,age_25_49,age_25_64,age_50_64,age_65,ilitotal,...,num_of_providers,unweighted_ili,weighted_ili,Date,Open,High,Low,Close,Adj Close,Volume
0,2008-09,2009,1,3245,2548,,2606.0,,557,8956,...,1539,0.015457,0.014714,2009-01-08,905.72998,910.0,896.809998,909.72998,909.72998,4991550000
1,2008-09,2009,2,3142,3345,,2329.0,,473,9289,...,1512,0.016422,0.016155,2009-01-15,841.98999,851.590027,817.039978,843.73999,843.73999,7807350000
2,2008-09,2009,3,3602,4597,,2821.0,,436,11456,...,1566,0.019652,0.019533,2009-01-22,839.73999,839.73999,811.289978,827.5,827.5,5843830000
3,2008-09,2009,4,4039,6602,,3215.0,,468,14324,...,1554,0.023437,0.024291,2009-01-29,868.890015,868.890015,844.150024,845.140015,845.140015,5067060000
4,2008-09,2009,5,4606,8882,,4240.0,,540,18268,...,1580,0.028606,0.029669,2009-02-05,831.75,850.549988,819.909973,845.849976,845.849976,6624030000


In [3]:
SWINE_GSPC.columns = [str(col).upper().replace(' ', '_') for col in SWINE_GSPC.columns]
SWINE_GSPC.head()

Unnamed: 0,SEASON,YEAR,WEEK,AGE_0_4,AGE_5_24,AGE_25_49,AGE_25_64,AGE_50_64,AGE_65,ILITOTAL,...,NUM_OF_PROVIDERS,UNWEIGHTED_ILI,WEIGHTED_ILI,DATE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME
0,2008-09,2009,1,3245,2548,,2606.0,,557,8956,...,1539,0.015457,0.014714,2009-01-08,905.72998,910.0,896.809998,909.72998,909.72998,4991550000
1,2008-09,2009,2,3142,3345,,2329.0,,473,9289,...,1512,0.016422,0.016155,2009-01-15,841.98999,851.590027,817.039978,843.73999,843.73999,7807350000
2,2008-09,2009,3,3602,4597,,2821.0,,436,11456,...,1566,0.019652,0.019533,2009-01-22,839.73999,839.73999,811.289978,827.5,827.5,5843830000
3,2008-09,2009,4,4039,6602,,3215.0,,468,14324,...,1554,0.023437,0.024291,2009-01-29,868.890015,868.890015,844.150024,845.140015,845.140015,5067060000
4,2008-09,2009,5,4606,8882,,4240.0,,540,18268,...,1580,0.028606,0.029669,2009-02-05,831.75,850.549988,819.909973,845.849976,845.849976,6624030000


In [4]:
SWINE_GSPC = SWINE_GSPC.rename(columns = {'ILITOTAL':'CASES'})
SWINE_GSPC = SWINE_GSPC.drop(['SEASON', 'YEAR', 'WEEK', 'AGE_0_4', 'AGE_5_24', 'AGE_25_49', 'AGE_25_64', 'AGE_50_64', 'AGE_65', 'TOTAL_PATIENTS', 'NUM_OF_PROVIDERS', 'UNWEIGHTED_ILI', 'WEIGHTED_ILI'], axis = 1)
SWINE_GSPC.head()

Unnamed: 0,CASES,DATE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME
0,8956,2009-01-08,905.72998,910.0,896.809998,909.72998,909.72998,4991550000
1,9289,2009-01-15,841.98999,851.590027,817.039978,843.73999,843.73999,7807350000
2,11456,2009-01-22,839.73999,839.73999,811.289978,827.5,827.5,5843830000
3,14324,2009-01-29,868.890015,868.890015,844.150024,845.140015,845.140015,5067060000
4,18268,2009-02-05,831.75,850.549988,819.909973,845.849976,845.849976,6624030000


In [5]:
SWINE_GSPC.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 0 to 100
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CASES      101 non-null    int64  
 1   DATE       101 non-null    object 
 2   OPEN       101 non-null    float64
 3   HIGH       101 non-null    float64
 4   LOW        101 non-null    float64
 5   CLOSE      101 non-null    float64
 6   ADJ_CLOSE  101 non-null    float64
 7   VOLUME     101 non-null    int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 7.1+ KB


In [6]:
SWINE_GSPC.loc[:, 'DATE'] = pd.to_datetime(SWINE_GSPC['DATE'], format = '%Y-%m-%d')

In [7]:
SWINE_GSPC.dtypes

CASES                 int64
DATE         datetime64[ns]
OPEN                float64
HIGH                float64
LOW                 float64
CLOSE               float64
ADJ_CLOSE           float64
VOLUME                int64
dtype: object

### Graphing Adjusted Close and Volume of GSPC

I have graphed the adjusted close and volume of GSPC over time and we can see a drop in the price right before 04/2009 with an increase of volume of shares during the same period as more people went short on their stocks of companies in the S&P 500.

In [8]:
rcParams['figure.figsize'] = 8, 8

plt.figure()
ax1 = plt.subplot2grid((6,1), (0,0), rowspan = 5, colspan = 1)
ax2 = plt.subplot2grid((6,1), (5,0), rowspan = 1, colspan = 1, sharex = ax1)

ax1.plot(SWINE_GSPC['DATE'], SWINE_GSPC['ADJ_CLOSE'], 'b-')
ax1.set_ylabel('USD', fontweight = 'bold')
ax1.set_title('Adjusted Close Price and Volume Over Time', fontsize = 14, fontweight = 'bold')
ax1.grid(True)

ax2.bar(SWINE_GSPC['DATE'], SWINE_GSPC['VOLUME'], color = 'blue')
ax2.grid(True)
ax2.set_ylabel('Volume', fontweight = 'bold')

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

### Feature Engineering

For better analysis I have added the range of high and low as well as range of open and close of the prices each day. Lag columns are created for the past 21 days allowing constant feed of data into the XGBoost model for better predictions. The mean and standard deviation of columns is also calculated to help with scaling in the next steps.

In [9]:
SWINE_GSPC['RANGE_HL'] = SWINE_GSPC['HIGH'] - SWINE_GSPC['LOW']
SWINE_GSPC['RANGE_OC'] = SWINE_GSPC['OPEN'] - SWINE_GSPC['CLOSE']
SWINE_GSPC.head()

Unnamed: 0,CASES,DATE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC
0,8956,2009-01-08,905.72998,910.0,896.809998,909.72998,909.72998,4991550000,13.190002,-4.0
1,9289,2009-01-15,841.98999,851.590027,817.039978,843.73999,843.73999,7807350000,34.550049,-1.75
2,11456,2009-01-22,839.73999,839.73999,811.289978,827.5,827.5,5843830000,28.450012,12.23999
3,14324,2009-01-29,868.890015,868.890015,844.150024,845.140015,845.140015,5067060000,24.739991,23.75
4,18268,2009-02-05,831.75,850.549988,819.909973,845.849976,845.849976,6624030000,30.640015,-14.099976


In [10]:
SWINE_GSPC['ORDER'] = [idx for idx in list(range(len(SWINE_GSPC)))]

merging_keys = ['ORDER']

lag_cols = ['CASES', 'OPEN', 'HIGH', 'LOW', 'CLOSE', 'ADJ_CLOSE', 'VOLUME', 'RANGE_HL', 'RANGE_OC']
lag_cols

['CASES',
 'OPEN',
 'HIGH',
 'LOW',
 'CLOSE',
 'ADJ_CLOSE',
 'VOLUME',
 'RANGE_HL',
 'RANGE_OC']

In [11]:
days_lagging = [day+1 for day in range(N)]

for lag in tqdm_notebook(days_lagging):
    train_lag = SWINE_GSPC[merging_keys + lag_cols].copy()
    
    train_lag['ORDER'] = train_lag['ORDER'] + lag
    
    foo = lambda num: '{}_LAG_{}'.format(num, lag) if num in lag_cols else num
    train_lag = train_lag.rename(columns = foo)
    
    SWINE_GSPC = pd.merge(SWINE_GSPC, train_lag, on = merging_keys, how = 'left')
    
del train_lag

SWINE_GSPC = SWINE_GSPC[N:]

SWINE_GSPC.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for lag in tqdm_notebook(days_lagging):


HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))




Unnamed: 0,CASES,DATE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC,...,RANGE_OC_LAG_20,CASES_LAG_21,OPEN_LAG_21,HIGH_LAG_21,LOW_LAG_21,CLOSE_LAG_21,ADJ_CLOSE_LAG_21,VOLUME_LAG_21,RANGE_HL_LAG_21,RANGE_OC_LAG_21
21,17105,2009-06-04,932.48999,942.469971,929.320007,942.460022,942.460022,5352890000,13.149964,-9.970032,...,-1.75,8956.0,905.72998,910.0,896.809998,909.72998,909.72998,4991550000.0,13.190002,-4.0
22,12392,2009-06-11,939.039978,956.22998,939.039978,944.890015,944.890015,5500840000,17.190002,-5.850037,...,12.23999,9289.0,841.98999,851.590027,817.039978,843.73999,843.73999,7807350000.0,34.550049,-1.75
23,10780,2009-06-18,910.859985,921.929993,907.940002,918.369995,918.369995,4684010000,13.989991,-7.51001,...,23.75,11456.0,839.73999,839.73999,811.289978,827.5,827.5,5843830000.0,28.450012,12.23999
24,8801,2009-06-25,899.450012,921.419983,896.27002,920.26001,920.26001,4911240000,25.149963,-20.809998,...,-14.099976,14324.0,868.890015,868.890015,844.150024,845.140015,845.140015,5067060000.0,24.739991,23.75
25,7422,2009-07-02,921.23999,921.23999,896.419983,896.419983,896.419983,3931000000,24.820007,24.820007,...,-5.280029,18268.0,831.75,850.549988,819.909973,845.849976,845.849976,6624030000.0,30.640015,-14.099976


In [12]:
SWINE_GSPC.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 21 to 100
Columns: 200 entries, CASES to RANGE_OC_LAG_21
dtypes: datetime64[ns](1), float64(196), int64(3)
memory usage: 125.6 KB


In [13]:
for col in lag_cols:
    SWINE_GSPC = cf.get_mov_avg_std(SWINE_GSPC, col)
    
SWINE_GSPC.head()

Unnamed: 0,CASES,DATE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC,...,CLOSE_MEAN,CLOSE_STD,ADJ_CLOSE_MEAN,ADJ_CLOSE_STD,VOLUME_MEAN,VOLUME_STD,RANGE_HL_MEAN,RANGE_HL_STD,RANGE_OC_MEAN,RANGE_OC_STD
21,17105,2009-06-04,932.48999,942.469971,929.320007,942.460022,942.460022,5352890000,13.149964,-9.970032,...,,,,,,,,,,
22,12392,2009-06-11,939.039978,956.22998,939.039978,944.890015,944.890015,5500840000,17.190002,-5.850037,...,942.460022,,942.460022,,5352890000.0,,13.149964,,-9.970032,
23,10780,2009-06-18,910.859985,921.929993,907.940002,918.369995,918.369995,4684010000,13.989991,-7.51001,...,943.675018,1.718265,943.675018,1.718265,5426865000.0,104616400.0,15.169983,2.856738,-7.910034,2.913276
24,8801,2009-06-25,899.450012,921.419983,896.27002,920.26001,920.26001,4911240000,25.149963,-20.809998,...,935.240011,14.660296,935.240011,14.660296,5179247000.0,435220400.0,14.776652,2.131808,-7.776693,2.072904
25,7422,2009-07-02,921.23999,921.23999,896.419983,896.419983,896.419983,3931000000,24.820007,24.820007,...,931.495011,14.120303,931.495011,14.120303,5112245000.0,379782500.0,17.36998,5.470935,-11.035019,6.732858


### Splitting Train, Validation, and Test Dataset

I have used 60% of the data for training on, then 20% to validate and tune, finally the last 20% for testing the optimized XGBoost model

In [14]:
valid_len = int(valid_size * len(SWINE_GSPC))
test_len = int(test_size * len(SWINE_GSPC))
train_len = len(SWINE_GSPC) - (valid_len + test_len)
print('Train Set Len: ' + str(train_len))
print('Test Set Len: ' + str(test_len))
print('Valid Set Len: ' + str(valid_len))

train = SWINE_GSPC[:train_len]
valid = SWINE_GSPC[train_len:(train_len + valid_len)]
test = SWINE_GSPC[(train_len + valid_len):]
print('Train Set Shape: ' + str(train.shape))
print('Valid Set Shape: ' + str(valid.shape))
print('Test Set Shape: ' + str(test.shape))

Train Set Len: 48
Test Set Len: 16
Valid Set Len: 16
Train Set Shape: (48, 218)
Valid Set Shape: (16, 218)
Test Set Shape: (16, 218)


### Scaling Data

When looking at time-series stock data, the prices of stocks overall goes up, therefore the later dates will have a higher stock price compared to the earlier stock prices. In order to have a levled analysis I have scaled all of the data across the entire dataframe, this evens out the playing field for analysis. Also the unnecessary data columns were removed allowing for better readability while implementing XGBoost.

In [15]:
scaling_cols = ['ADJ_CLOSE']

for num in range(1, N+1):
    scaling_cols.append('CASES_LAG_'+str(num))
    scaling_cols.append('OPEN_LAG_'+str(num))
    scaling_cols.append('HIGH_LAG_'+str(num))
    scaling_cols.append('LOW_LAG_'+str(num))
    scaling_cols.append('ADJ_CLOSE_LAG_'+str(num))
    scaling_cols.append('VOLUME_LAG_'+str(num))
    scaling_cols.append('RANGE_HL_LAG_'+str(num))
    scaling_cols.append('RANGE_OC_LAG_'+str(num))
    
scaler = StandardScaler()
train_scale = scaler.fit_transform(train[scaling_cols])
train_df_scaled = pd.DataFrame(train_scale, columns = scaling_cols)
train_df_scaled[['DATE']] = train.reset_index()[['DATE']]
train_df_scaled

Unnamed: 0,ADJ_CLOSE,CASES_LAG_1,OPEN_LAG_1,HIGH_LAG_1,LOW_LAG_1,ADJ_CLOSE_LAG_1,VOLUME_LAG_1,RANGE_HL_LAG_1,RANGE_OC_LAG_1,CASES_LAG_2,...,RANGE_OC_LAG_20,CASES_LAG_21,OPEN_LAG_21,HIGH_LAG_21,LOW_LAG_21,ADJ_CLOSE_LAG_21,VOLUME_LAG_21,RANGE_HL_LAG_21,RANGE_OC_LAG_21,DATE
0,-1.526854,0.18082,-1.9146,-1.885863,-1.921814,-1.848752,0.446733,0.793655,-1.089707,0.061293,...,0.045461,-0.697154,-0.265754,-0.333533,-0.256126,-0.252987,-0.728319,-1.048351,-0.093303,2009-06-04
1,-1.497055,-0.155029,-1.466373,-1.497335,-1.437568,-1.424792,0.151464,-0.509549,-0.774468,0.16476,...,1.034064,-0.67604,-0.853188,-0.880828,-0.975654,-0.859477,1.580792,2.061316,0.067053,2009-06-11
2,-1.822272,-0.454302,-1.392103,-1.335429,-1.324747,-1.395877,0.264611,0.095617,-0.441443,-0.173042,...,1.847419,-0.53864,-0.873924,-0.991862,-1.02752,-1.008732,-0.029402,1.173253,1.064107,2009-06-18
3,-1.799094,-0.556663,-1.711634,-1.739018,-1.685726,-1.711438,-0.360073,-0.38372,-0.575621,-0.474055,...,-0.82725,-0.356792,-0.605274,-0.718729,-0.731121,-0.846609,-0.666396,0.633135,1.884416,2009-06-25
4,-2.091446,-0.682329,-1.841011,-1.745019,-1.821181,-1.688949,-0.186295,1.287959,-1.650677,-0.577012,...,-0.203988,-0.10672,-0.94756,-0.890573,-0.949767,-0.840084,0.610405,1.492081,-0.81312,2009-07-02
5,-2.25994,-0.769895,-1.593936,-1.747137,-1.81944,-1.97262,-0.93595,1.238534,2.037658,-0.703408,...,0.802988,0.108796,-0.964518,-1.031778,-1.056654,-0.938056,0.489389,1.023297,-0.18453,2009-07-09
6,-1.547947,-0.841776,-2.047039,-2.139901,-2.028018,-2.136112,-0.617677,-1.069772,-0.081738,-0.791483,...,1.082823,0.004494,-1.351594,-1.386897,-1.336546,-1.455029,-0.108857,0.02314,0.831056,2009-07-16
7,-1.111996,-0.862477,-1.492679,-1.479803,-1.459273,-1.445258,-0.195931,-0.00624,-0.822964,-0.863783,...,1.986631,0.031632,-1.55573,-1.557054,-1.564573,-1.694996,1.410729,1.059693,1.113283,2009-07-23
8,-0.983724,-0.877019,-1.221679,-1.062565,-1.159578,-1.02225,0.46407,1.437751,-1.764647,-0.884604,...,-1.940224,-0.095751,-2.085563,-2.223722,-2.230432,-2.340913,1.334694,1.448407,2.024819,2009-07-30
9,-0.857047,-0.886416,-0.972903,-0.859476,-0.895634,-0.897787,0.673257,0.616892,-0.836704,-0.89923,...,1.149956,-0.268848,-1.969256,-1.808073,-1.898224,-1.714204,1.186575,2.544645,-1.935609,2009-08-06


In [16]:
valid_df_scaled = valid[['DATE']]
for col in tqdm_notebook(lag_cols):
    feat_list = [col + '_LAG_' + str(shift) for shift in range(1, N+1)]
    temp = valid.apply(lambda row: cf.scale_row(row[feat_list], row[col+'_MEAN'], row[col+'_STD']), axis = 1)
    valid_df_scaled = pd.concat([valid_df_scaled, temp], axis = 1)
        
    
valid_df_scaled.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for col in tqdm_notebook(lag_cols):


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




Unnamed: 0,DATE,CASES_LAG_1,CASES_LAG_2,CASES_LAG_3,CASES_LAG_4,CASES_LAG_5,CASES_LAG_6,CASES_LAG_7,CASES_LAG_8,CASES_LAG_9,...,RANGE_OC_LAG_12,RANGE_OC_LAG_13,RANGE_OC_LAG_14,RANGE_OC_LAG_15,RANGE_OC_LAG_16,RANGE_OC_LAG_17,RANGE_OC_LAG_18,RANGE_OC_LAG_19,RANGE_OC_LAG_20,RANGE_OC_LAG_21
69,2010-05-21,-1.243436,-0.749069,0.449259,0.27156,1.271686,1.299024,6.716559,9.928807,13.922474,...,-1.320265,-0.999466,-1.194896,0.036128,0.813988,-0.123497,-1.380991,-0.138286,-1.454189,-0.233278
70,2010-05-28,-1.049704,-0.803147,-0.233967,1.145704,0.941114,2.092589,2.124065,8.361439,12.059798,...,-0.362805,-0.481973,-0.267579,-0.398187,0.42452,0.944372,0.317841,-0.522557,0.307957,-0.571476
71,2010-06-04,-1.476756,-0.171279,-0.007659,0.370059,1.285634,1.149864,1.914003,1.934891,6.074123,...,-1.362587,-0.661156,-0.792929,-0.555857,-0.70028,0.209451,0.784293,0.091488,-0.837806,0.080558
72,2010-06-11,-1.284623,-0.835161,0.471412,0.635169,1.013204,1.929547,1.793663,2.558444,2.579349,...,-0.549918,-1.309688,-0.704355,-0.818075,-0.613482,-0.738119,0.046977,0.543063,-0.054825,-0.856804
73,2010-06-18,-1.438852,-0.383828,-0.059785,0.882201,1.000264,1.272811,1.933458,1.835491,2.386867,...,-0.044842,-0.27358,-0.96521,-0.414166,-0.517687,-0.331444,-0.444903,0.269782,0.721377,0.17711


In [17]:
test_df_scaled = test[['DATE']]
for col in tqdm_notebook(lag_cols):
    feat_list = [col + '_LAG_' + str(shift) for shift in range(1, N+1)]
    temp = test.apply(lambda row: cf.scale_row(row[feat_list], row[col+'_MEAN'], row[col+'_STD']), axis = 1)
    test_df_scaled = pd.concat([test_df_scaled, temp], axis = 1)
        
    
test_df_scaled.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for col in tqdm_notebook(lag_cols):


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




Unnamed: 0,DATE,CASES_LAG_1,CASES_LAG_2,CASES_LAG_3,CASES_LAG_4,CASES_LAG_5,CASES_LAG_6,CASES_LAG_7,CASES_LAG_8,CASES_LAG_9,...,RANGE_OC_LAG_12,RANGE_OC_LAG_13,RANGE_OC_LAG_14,RANGE_OC_LAG_15,RANGE_OC_LAG_16,RANGE_OC_LAG_17,RANGE_OC_LAG_18,RANGE_OC_LAG_19,RANGE_OC_LAG_20,RANGE_OC_LAG_21
85,2010-09-10,1.501813,0.034984,0.279872,-0.994545,-0.822124,-1.389364,-1.139478,-0.429803,0.712174,...,0.277034,-0.590827,4.262356,1.936241,-1.901754,2.887459,2.276546,2.724186,-0.643364,2.491229
86,2010-09-17,1.403013,0.554712,-0.469882,-0.298825,-1.189018,-1.06858,-1.464803,-1.290255,-0.79454,...,0.379092,0.415847,-0.485064,4.55294,2.13824,-1.845915,3.125684,2.491505,2.956192,-0.539603
87,2010-09-24,1.005531,0.972683,0.033621,-1.100596,-0.911238,-1.896673,-1.76335,-2.201965,-2.008742,...,1.299186,0.455375,0.49425,-0.458635,4.870011,2.316008,-1.897993,3.360417,2.689653,3.181148
88,2010-10-01,1.527395,0.088595,0.068726,-0.499314,-1.185402,-1.070859,-1.66695,-1.586303,-1.851621,...,0.272902,1.969003,1.058,1.099971,0.071211,5.824162,3.066791,-1.48276,4.194363,3.470188
89,2010-10-08,0.784579,1.281078,-0.445446,-0.469289,-1.150922,-1.974211,-1.836763,-2.552057,-2.455282,...,5.109881,-0.07081,1.711073,0.753996,0.79809,-0.282701,5.761212,2.864384,-1.915265,4.048984


### Build XGBoost

I have made the features list of data that will be fed into the machine learning model. I also build the X and Y data for train, valid, and test datasets for the scled and unsclaed data. I then went ahead ran the model on the train and validation datasets.

In [18]:
features_list = []
for num in range(1, N+1):
    features_list.append('CASES_LAG_'+str(num))
    features_list.append('OPEN_LAG_'+str(num))
    features_list.append('HIGH_LAG_'+str(num))
    features_list.append('LOW_LAG_'+str(num))
    features_list.append('ADJ_CLOSE_LAG_'+str(num))
    features_list.append('VOLUME_LAG_'+str(num))
    features_list.append('RANGE_HL_LAG_'+str(num))
    features_list.append('RANGE_OC_LAG_'+str(num))
    
target = 'ADJ_CLOSE'

X_train = train[features_list]
y_train = train[target]
X_valid = valid[features_list]
y_valid = valid[target]
X_test = test[features_list]
y_test = test[target]

In [19]:
X_train_scaled = train_df_scaled[features_list]
y_train_scaled = train_df_scaled[target]
X_valid_scaled = valid_df_scaled[features_list]
X_test_scaled = test_df_scaled[features_list]

In [20]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.legend(['Train Data', 'Valid Data', 'Test Data'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xticks(rotation = 45)

plt.xlabel('Date', fontweight = 'bold')

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [21]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train_df_scaled['DATE'], train_df_scaled['ADJ_CLOSE'], 'b-')
plt.legend(['Trian Data'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time With Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [22]:
model = XGBRegressor(seed = model_seed,
                    n_estimators = n_estimators,
                    max_depth = max_depth,
                    learning_rate = learning_rate,
                    min_child_weight = min_child_weight,
                    subsample = subsample,
                    colsample_bytree = colsample_bytree,
                    colsample_bylevel = colsample_bylevel,
                    gamma = gamma)

model.fit(X_train_scaled, y_train_scaled)

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.1, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=100, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, seed=100, subsample=1,
             tree_method=None, validate_parameters=False, verbosity=None)

In [23]:
predict_train_scaled = model.predict(X_train_scaled)
predict_train = predict_train_scaled * math.sqrt(scaler.var_[0]) + scaler.mean_[0]

predit_train_df = pd.DataFrame({'PREDICT_VALUE':predict_train,
                               'DATE':train['DATE']})

print('MSE on Train Set =', mean_squared_error(y_train, predict_train))
print('MAE on Train Set =', mean_absolute_error(y_train, predict_train))

MSE on Train Set = 0.11400265287900403
MAE on Train Set = 0.24925482560220993


In [24]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_train_df['DATE'], predit_train_df['PREDICT_VALUE'], 'r-')
plt.legend(['Train Data', 'Valid Data', 'Test Data', 'Train Data Predictions'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [25]:
predict_valid_scaled = model.predict(X_valid_scaled)
valid['PREDICT_VALID'] = predict_valid_scaled
valid['PREDICT'] = valid['PREDICT_VALID'] * valid['ADJ_CLOSE_STD'] + valid['ADJ_CLOSE_MEAN']

predit_valid_df = pd.DataFrame({'PREDICT_VALUE':valid['PREDICT'],
                               'Y_VALID':y_valid,
                               'DATE':valid['DATE']})

MSE_Original = mean_squared_error(y_valid, valid['PREDICT'])
MAE_Original = mean_absolute_error(y_valid, valid['PREDICT'])

print('MSE on Validation Set =', mean_squared_error(y_valid, valid['PREDICT']))
print('MAE on Validation Set =', mean_absolute_error(y_valid, valid['PREDICT']))

MSE on Validation Set = 1867.3881765173078
MAE on Validation Set = 37.83621087039508


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['PREDICT_VALID'] = predict_valid_scaled
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['PREDICT'] = valid['PREDICT_VALID'] * valid['ADJ_CLOSE_STD'] + valid['ADJ_CLOSE_MEAN']


In [26]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_valid_df['DATE'], predit_valid_df['PREDICT_VALUE'], 'r-')
plt.legend(['Train Data', 'Valid Data', 'Test Data', 'Valid Data Predictions'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

### Preliminary Analysis

The analysis is clearly showing the predictions are well off. This can be due to the overfitting. The positive aspect is the model is able to predict the generic trend showing that it is on the right path, but does fall short of being accurate. In the world of finance we need efficiency and accuracy. To make this model better, I next went ahead and tuned the hyperparameters of the machine learning model.

### Tuning of XGBoost

First I needed to see what feature list is the most important or the most influential on machine learning model. I then took two parameters at a time of XGBoost and manually optimized the model and storing the optimum values to use for further tuning of the machine learning model.

- XGBoost Parameters:
    - n_estimators -> Specifies how many times to go through the modeling cycle
    - max_depth -> Maximum depth of a tree, large value can cause overfitting
    - learning_rate -> Shrinkage factor that avoids overfitting of the model
    - subsample -> Subsample ratio of the training instances, it will occur once in every boosting iteration
    - colsample_bytree -> Subsample ratio of columns when constructing each tree
    - colsample_bylevel -> Subsample ratio of columns for each level
    - gamma -> Minimum loss reduction required to make a futher partition of a leaf node of the tree

In [27]:
imp = list(zip(train[features_list], model.feature_importances_))
imp.sort(key = lambda tup: tup[1])
imp[-10:]

[('ADJ_CLOSE_LAG_1', 0.013077019),
 ('CASES_LAG_21', 0.014406298),
 ('CASES_LAG_5', 0.02385864),
 ('OPEN_LAG_13', 0.024067378),
 ('OPEN_LAG_9', 0.03774989),
 ('HIGH_LAG_19', 0.059592523),
 ('ADJ_CLOSE_LAG_5', 0.06372984),
 ('LOW_LAG_13', 0.08353842),
 ('HIGH_LAG_4', 0.20916526),
 ('OPEN_LAG_3', 0.36135226)]

In [28]:
imp = list(zip(train[features_list], model.feature_importances_))
imp.sort(key = lambda tup: tup[1])

imp_10 = imp[-10:]

category = []
value = []

for imp in imp_10:
    category.append(imp[0])
    value.append(imp[1])

imp_df = pd.DataFrame({'FEATURE':category,
                      'IMPORTANCE':value})

imp_df

Unnamed: 0,FEATURE,IMPORTANCE
0,ADJ_CLOSE_LAG_1,0.013077
1,CASES_LAG_21,0.014406
2,CASES_LAG_5,0.023859
3,OPEN_LAG_13,0.024067
4,OPEN_LAG_9,0.03775
5,HIGH_LAG_19,0.059593
6,ADJ_CLOSE_LAG_5,0.06373
7,LOW_LAG_13,0.083538
8,HIGH_LAG_4,0.209165
9,OPEN_LAG_3,0.361352


In [29]:
rcParams['figure.figsize'] = 10, 5

plt.figure()

plt.barh(imp_df['FEATURE'], imp_df['IMPORTANCE'], color = 'blue')
plt.legend(['Feature Importance'])
plt.ylabel('Volume')
plt.xlabel('Date')
plt.title('Importance of Features Used to Train and Predict Stock Value')

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [30]:
n_estimators_parameter = 'n_estimators'
n_estimators_list = range(10, 500, 10)

max_depth_parameter = 'max_depth'
max_depth_list = [2, 3, 4, 5, 6, 7, 8, 9]

error_rates = {n_estimators_parameter: [], max_depth_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(n_estimators_list):
    for param2 in max_depth_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = param,
                                                             max_depth = param2,
                                                             learning_rate = learning_rate,
                                                             min_child_weight = min_child_weight,
                                                             subsample = subsample,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = gamma)
        error_rates[n_estimators_parameter].append(param)
        error_rates[max_depth_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df1 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df1

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(n_estimators_list):


HBox(children=(FloatProgress(value=0.0, max=49.0), HTML(value='')))


Minutes Taken to Execute: 5.8500207980473835


Unnamed: 0,n_estimators,max_depth,MSE,MAE
0,10,2,1757.074245,35.335146
1,10,3,1987.317527,38.336418
2,10,4,2071.028946,39.550892
3,10,5,1977.892669,38.450976
4,10,6,1978.159095,38.454953
...,...,...,...,...
387,490,5,1684.308435,36.151272
388,490,6,1760.765519,36.713392
389,490,7,1707.661236,36.286771
390,490,8,1705.349819,36.278865


In [31]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(max_depth_list)):
    param = error_rate_df1[error_rate_df1[max_depth_parameter] == max_depth_list[p]]
    plt.plot(param[n_estimators_parameter], param['MAE'], marker = 's',)
    legend_list.append(max_depth_parameter + '_' + str(max_depth_list[p]))
    
plt.xlabel(n_estimators_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('n_estimators and max_depth Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [32]:
optimum = error_rate_df1[error_rate_df1['MSE'] == error_rate_df1['MSE'].min()]

print('Minimum MSE Value:', error_rate_df1['MSE'].min())
print('Optimum n_estimators Parameter:', optimum['n_estimators'].values[0])
print('Optimum max_depth Parameter:', optimum['max_depth'].values[0])

Minimum MSE Value: 1356.5503608383842
Optimum n_estimators Parameter: 240
Optimum max_depth Parameter: 2


In [33]:
optimum = error_rate_df1[error_rate_df1['MAE'] == error_rate_df1['MAE'].min()]

n_estimators_optimum = optimum['n_estimators'].values[0]
max_depth_optimum = optimum['max_depth'].values[0]

print('Minimum MAE Value:', error_rate_df1['MAE'].min())
print('Optimum n_estimators Parameter:', optimum['n_estimators'].values[0])
print('Optimum max_depth Parameter:', optimum['max_depth'].values[0])

Minimum MAE Value: 32.304372955166556
Optimum n_estimators Parameter: 240
Optimum max_depth Parameter: 2


In [34]:
learning_rate_parameter = 'learning_rate'
learning_rate_list = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3]

min_child_weight_parameter = 'min_child_weight'
min_child_weight_list = range(5, 21, 1)

error_rates = {learning_rate_parameter: [], min_child_weight_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(learning_rate_list):
    for param2 in min_child_weight_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = param,
                                                             min_child_weight = param2,
                                                             subsample = subsample,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = gamma)
        error_rates[learning_rate_parameter].append(param)
        error_rates[min_child_weight_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df2 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df2

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(learning_rate_list):


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Minutes Taken to Execute: 1.1199437538782755


Unnamed: 0,learning_rate,min_child_weight,MSE,MAE
0,0.001,5,2224.685369,38.452212
1,0.001,6,2224.685369,38.452212
2,0.001,7,2225.057734,38.466177
3,0.001,8,2192.848576,38.367775
4,0.001,9,2216.338525,38.235513
...,...,...,...,...
107,0.300,16,990.407069,25.920693
108,0.300,17,1131.946349,26.565838
109,0.300,18,1100.154306,26.865732
110,0.300,19,2040.004118,34.733193


In [35]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(min_child_weight_list)):
    param = error_rate_df2[error_rate_df2[min_child_weight_parameter] == min_child_weight_list[p]]
    plt.plot(param[learning_rate_parameter], param['MAE'], marker = 's',)
    legend_list.append(min_child_weight_parameter + '_' + str(min_child_weight_list[p]))
    
plt.xlabel(learning_rate_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('learning_rate and min_child_weight Parameter Tuning', fontsize = 10, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right', bbox_to_anchor = (1.1, 1.1))

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [36]:
optimum = error_rate_df2[error_rate_df2['MSE'] == error_rate_df2['MSE'].min()]

print('Minimum MSE Value:', error_rate_df2['MSE'].min())
print('Optimum learning_rate Parameter:', optimum['learning_rate'].values[0])
print('Optimum min_child_weight Parameter:', optimum['min_child_weight'].values[0])

Minimum MSE Value: 967.3895661349792
Optimum learning_rate Parameter: 0.3
Optimum min_child_weight Parameter: 14


In [37]:
optimum = error_rate_df2[error_rate_df2['MAE'] == error_rate_df2['MAE'].min()]

learning_rate_optimum = optimum['learning_rate'].values[0]
min_child_weight_optimum = optimum['min_child_weight'].values[0]

print('Minimum MAE Value:', error_rate_df2['MAE'].min())
print('Optimum learning_rate Parameter:', optimum['learning_rate'].values[0])
print('Optimum min_child_weight Parameter:', optimum['min_child_weight'].values[0])

Minimum MAE Value: 25.920693417864697
Optimum learning_rate Parameter: 0.3
Optimum min_child_weight Parameter: 16


In [38]:
subsample_parameter = 'subsample'
subsample_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

gamma_parameter = 'gamma'
gamma_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

error_rates = {subsample_parameter: [], gamma_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(subsample_list):
    for param2 in gamma_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = param,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = param2)
        error_rates[subsample_parameter].append(param)
        error_rates[gamma_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df3 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df3

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(subsample_list):


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Minutes Taken to Execute: 0.4890357534090678


Unnamed: 0,subsample,gamma,MSE,MAE
0,0.1,0.1,1517.032721,32.068035
1,0.1,0.2,1517.032721,32.068035
2,0.1,0.3,1517.032721,32.068035
3,0.1,0.4,1517.032721,32.068035
4,0.1,0.5,1517.032721,32.068035
...,...,...,...,...
95,1.0,0.6,1041.395885,27.428001
96,1.0,0.7,1041.395885,27.428001
97,1.0,0.8,1011.409775,27.150964
98,1.0,0.9,1005.263047,27.169149


In [39]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(gamma_list)):
    param = error_rate_df3[error_rate_df3[gamma_parameter] == gamma_list[p]]
    plt.plot(param[subsample_parameter], param['MAE'], marker = 's',)
    legend_list.append(gamma_parameter + '_' + str(gamma_list[p]))
    
plt.xlabel(subsample_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('subsample and gamma Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [40]:
optimum = error_rate_df3[error_rate_df3['MSE'] == error_rate_df3['MSE'].min()]

print('Minimum MSE Value:', error_rate_df3['MSE'].min())
print('Optimum subsample Parameter:', optimum['subsample'].values[0])
print('Optimum gamma Parameter:', optimum['gamma'].values[0])

Minimum MSE Value: 1004.7907728874891
Optimum subsample Parameter: 1.0
Optimum gamma Parameter: 0.4


In [41]:
optimum = error_rate_df3[error_rate_df3['MAE'] == error_rate_df3['MAE'].min()]

subsample_optimum = optimum['subsample'].values[0]
gamma_optimum = optimum['gamma'].values[0]

print('Minimum MAE Value:', error_rate_df3['MAE'].min())
print('Optimum subsample Parameter:', optimum['subsample'].values[0])
print('Optimum gamma Parameter:', optimum['gamma'].values[0])

Minimum MAE Value: 26.898652156840754
Optimum subsample Parameter: 1.0
Optimum gamma Parameter: 0.4


In [42]:
colsample_bytree_parameter = 'colsample_bytree'
colsample_bytree_list = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

colsample_bylevel_parameter = 'colsample_bylevel'
colsample_bylevel_list = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

error_rates = {colsample_bytree_parameter: [], colsample_bylevel_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(colsample_bytree_list):
    for param2 in colsample_bylevel_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = subsample_optimum,
                                                             colsample_bytree = param,
                                                             colsample_bylevel = param2,
                                                             gamma = gamma_optimum)
        error_rates[colsample_bytree_parameter].append(param)
        error_rates[colsample_bylevel_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df4 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df4

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(colsample_bytree_list):


HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


Minutes Taken to Execute: 0.14172851641972858


Unnamed: 0,colsample_bytree,colsample_bylevel,MSE,MAE
0,0.5,0.5,1360.378916,31.576816
1,0.5,0.6,1276.652125,30.349075
2,0.5,0.7,1393.434574,31.623297
3,0.5,0.8,1401.511374,31.651357
4,0.5,0.9,1368.833445,31.336181
5,0.5,1.0,1515.091259,32.373683
6,0.6,0.5,1604.147497,31.385153
7,0.6,0.6,1256.450029,29.846854
8,0.6,0.7,1224.201996,29.253495
9,0.6,0.8,1396.390381,30.590389


In [43]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(colsample_bylevel_list)):
    param = error_rate_df4[error_rate_df4[colsample_bylevel_parameter] == colsample_bylevel_list[p]]
    plt.plot(param[colsample_bytree_parameter], param['MAE'], marker = 's',)
    legend_list.append(colsample_bylevel_parameter + '_' + str(colsample_bylevel_list[p]))
    
plt.xlabel(colsample_bytree_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('colsample_bytree and colsample_bylevel Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [44]:
optimum = error_rate_df4[error_rate_df4['MSE'] == error_rate_df4['MSE'].min()]

MSE_Optimum = error_rate_df4['MSE'].min()

print('Minimum MSE Value:', error_rate_df4['MSE'].min())
print('Optimum colsample_bytree Parameter:', optimum['colsample_bytree'].values[0])
print('Optimum colsample_bylevel Parameter:', optimum['colsample_bylevel'].values[0])

Minimum MSE Value: 942.0543299147914
Optimum colsample_bytree Parameter: 0.9
Optimum colsample_bylevel Parameter: 0.6


In [45]:
optimum = error_rate_df4[error_rate_df4['MAE'] == error_rate_df4['MAE'].min()]

MAE_Optimum = error_rate_df4['MAE'].min()

colsample_bytree_optimum = optimum['colsample_bytree'].values[0]
colsample_bylevel_optimum = optimum['colsample_bylevel'].values[0]

print('Minimum MAE Value:', error_rate_df4['MAE'].min())
print('Optimum colsample_bytree Parameter:', optimum['colsample_bytree'].values[0])
print('Optimum colsample_bylevel Parameter:', optimum['colsample_bylevel'].values[0])

Minimum MAE Value: 25.881364067309576
Optimum colsample_bytree Parameter: 0.9
Optimum colsample_bylevel Parameter: 0.6


### Overview of Optimized Values

I have graphed the opimized values to visually represent this analysis and clearly understand what the new values are and how the error values are impacted to show the effect of tuning parameters.

In [46]:
tuned_params_df = pd.DataFrame({'PARAMETER':['n_estimators', 'max_depth', 'learning_rate', 'min_child_weight', 'subsample', 'colsample_bytree', 'colsample_bylevel', 'gamma'],
                               'ORIGINAL':[n_estimators, max_depth, learning_rate, min_child_weight, subsample, colsample_bytree, colsample_bylevel, gamma],
                               'OPTIMUM':[n_estimators_optimum, max_depth_optimum, learning_rate_optimum, min_child_weight_optimum, subsample_optimum, colsample_bytree_optimum, colsample_bylevel_optimum, gamma_optimum]})

MSE_values_df = pd.DataFrame({'PARAMETER':['MSE'],
                                    'ORIGINAL':[MSE_Original],
                                    'OPTIMUM':[MSE_Optimum]})

MAE_values_df = pd.DataFrame({'PARAMETER':['MAE'],
                                    'ORIGINAL':[MAE_Original],
                                    'OPTIMUM':[MAE_Optimum]})

tuned_params_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,n_estimators,100.0,240.0
1,max_depth,3.0,2.0
2,learning_rate,0.1,0.3
3,min_child_weight,1.0,16.0
4,subsample,1.0,1.0
5,colsample_bytree,1.0,0.9
6,colsample_bylevel,1.0,0.6
7,gamma,0.0,0.4


In [47]:
MSE_values_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,MSE,1867.388177,942.05433


In [48]:
MAE_values_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,MAE,37.836211,25.881364


In [49]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_parameters = [tuned_params_df['ORIGINAL'], tuned_params_df['OPTIMUM']]
X_parameters = tuned_params_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_parameters)
_X = np.arange(len(X_parameters))
for i in range(len(vals_parameters)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_parameters[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_parameters, rotation=45)
plt.title('Orignial and Optimized Parameters For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Parameters', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

In [50]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_MSE = [MSE_values_df['ORIGINAL'], MSE_values_df['OPTIMUM']]
X_MSE = MSE_values_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_MSE)
_X = np.arange(len(X_MSE))
for i in range(len(vals_MSE)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_MSE[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_MSE, rotation=45)
plt.title('Orignial and Optimized MSE Values For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Error Calculation', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

In [51]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_MAE = [MAE_values_df['ORIGINAL'], MAE_values_df['OPTIMUM']]
X_MAE = MAE_values_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_MAE)
_X = np.arange(len(X_MAE))
for i in range(len(vals_MAE)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_MAE[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_MAE, rotation=45)
plt.title('Orignial and Optimized MAE Values For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Error Calculation', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

### Testing XGBoost Model on Test Dataset

After seeing that the optimum parameters are having a positive effect on the error values, making it more accurate model to use for predictions.

In [52]:
mse, mae, predicted_test = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             test['ADJ_CLOSE_MEAN'],
                                                             test['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = subsample_optimum,
                                                             colsample_bytree = colsample_bytree_optimum,
                                                             colsample_bylevel = colsample_bylevel_optimum,
                                                             gamma = gamma_optimum)


predit_test_df = pd.DataFrame({'PREDICT_VALUE':predicted_test,
                               'Y_TEST':y_test,
                               'DATE':test['DATE']})

print('MSE Value:', mse)
print('MAE Value:', mae)

MSE Value: 7769.77162002861
MAE Value: 75.99604051868995


In [53]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_test_df['DATE'], predit_test_df['PREDICT_VALUE'], 'r-')
plt.legend(['Train Data', 'Valid Data', 'Test Data', 'Test Data Predictions'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.xticks(rotation = 45)
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

### Conclusion/Analysis

__With the tuned XGBoost, I am still having a difficulty producing accurate predictions. After tuning my Mean Square Error is 7769, and my mean absolute error is 75. I am able to still get some level of a trend, but the accruacy is very low. As my test data showed greater accuracy than my validation set, that confirms the tuning of the model helped in my predictions.__

__Next Stpes__:
    - Change the data that we are feeding into the model for better accuracy. 
        - One of the inconsistencies to fix is taking out the OHLC data that is fed into the model, if we want to only focus on Swine Flu cases and their effect on the stock market. 
    - Use news and twitter sentimental analysis to see how news on Swine Flu can suppliment the analysis and predictions of the machine learning model. 
    - Use hospitilization and death rates along with case count to better account for the devastation Swine Flu.
    - Use 10Q (Quarterly Reports), 10K (Annual Reports), cash flow statements, and any other financial statments to get a better understanding of how the stay at home orders and freeze in the economy has affected companies
    
Although the model is not perfect there is great opportunites for improvement and make it much more accurate. As more data is released to the public over the year, and with other data incorporated into the model, I believe we can be more prepared for future pandemics that come our way.