# Ebola Cases to Predict S&P 500

### Introduction

__With COVID-19 griping everyone's lives and finances we have all noticed the financial market dive and unemployment sky rocket. The question that everyone has is when will we bounce back and are we on that track yet?__

__More questions come from the volitiliy of the market, and the question on everyone's question in the financial world is, have we hit rock bottom? In order to analyze this question, I thought it was best to look back into past pandemics that gripped the world. Although the world has never seen such a deadily virus in modern history, I want to know if there is something from the past that can help us with our future.__

__I investigate Ebola, a rare and deadly disease in peopel and non human primates. The Ebola Virus is mainly in sub-Saharan Africa, and people can get it through direct contact of blood, bodily fluids, and tissues of animals. It can also spread to other people through direct contact anyone who has died of Ebola. The United States of America faced the virus from 2014 to 2016.__

In this analysis I have used __XGBoost__ as my machine learning model, and measured my accuracy with mean squared error and mean absolute error. __XGBoost__ is an optimized distributed gradient boosting library to be highly efficient, flexible, and portable. It implements machine learning under the __Gradient Boosting__ framework, as it provides a parallel tree boosting to solve data science problems in a fast and accurate way.

__Gradient boosting__ is a method that goes through cycles to iteratively add models into an ensemble. It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. However, subsequent additions to the ensemble will address those errors.

__Steps of Gradient Boosting:__
    - First use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble
    - The predictions are then used to calculate a loss function, which are my error values
    - Next we use the loss function to fit a new model that will be added to the ensemble. Specifically, I determined model parameters so that adding this new model to the ensemble will reduce the loss
    - Finally, we add the new model to ensemble and repeat until we have an accurate model.
    
__XGBoost Parameters:__
    - n_estimators -> Specifies how many times to go through the modeling cycle
    - max_depth -> Maximum depth of a tree, large value can cause overfitting
    - learning_rate -> Shrinkage factor that avoids overfitting of the model
    - subsample -> Subsample ratio of the training instances, it will occur once in every boosting iteration
    - colsample_bytree -> Subsample ratio of columns when constructing each tree
    - colsample_bylevel -> Subsample ratio of columns for each level
    - gamma -> Minimum loss reduction required to make a futher partition of a leaf node of the tree

#### Import Necessary Libraries

In [1]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

%matplotlib notebook
import math
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
from pylab import rcParams
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm_notebook
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import matplotlib.dates as mdates
from sklearn.model_selection import train_test_split
import Common_Functions as cf
import time
import mplcursors

from matplotlib.backends.backend_agg import FigureCanvasAgg 
plt.get_backend()

test_size = 0.2
valid_size = 0.2
N = 21

n_estimators = 100
max_depth = 3
learning_rate = 0.1
min_child_weight = 1
subsample = 1
colsample_bytree = 1
colsample_bylevel = 1
gamma = 0
model_seed = 100

style.use('seaborn-darkgrid')

### Load and Format Dataset

The dataset has already been merged on the dates of Ebola cases dates, with the cases summed up per date, and the OHLC data along with cases. I have taken the dataset and formatted it for greater readability and analysis. The dates column is changed into datetime data type.

In [2]:
EBOLA_GSPC = pd.read_csv('EBOLA_GSPC.csv', parse_dates = True, index_col = 0)
EBOLA_GSPC.head()

Unnamed: 0,Date,value,Open,High,Low,Close,Adj Close,Volume
0,2014-10-03,1.0,1948.119995,1971.189941,1948.119995,1967.900024,1967.900024,3560970000
1,2014-10-08,1.0,1935.550049,1970.359985,1925.25,1968.890015,1968.890015,4441890000
2,2014-10-10,1.0,1925.630005,1936.97998,1906.050049,1906.130005,1906.130005,4550540000
3,2014-10-15,2.0,1874.180054,1874.180054,1820.660034,1862.48999,1862.48999,6090800000
4,2014-10-17,3.0,1864.910034,1898.160034,1864.910034,1886.76001,1886.76001,4482120000


In [3]:
EBOLA_GSPC.columns = [str(col).upper().replace(' ', '_') for col in EBOLA_GSPC.columns]
EBOLA_GSPC.head()

Unnamed: 0,DATE,VALUE,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME
0,2014-10-03,1.0,1948.119995,1971.189941,1948.119995,1967.900024,1967.900024,3560970000
1,2014-10-08,1.0,1935.550049,1970.359985,1925.25,1968.890015,1968.890015,4441890000
2,2014-10-10,1.0,1925.630005,1936.97998,1906.050049,1906.130005,1906.130005,4550540000
3,2014-10-15,2.0,1874.180054,1874.180054,1820.660034,1862.48999,1862.48999,6090800000
4,2014-10-17,3.0,1864.910034,1898.160034,1864.910034,1886.76001,1886.76001,4482120000


In [4]:
EBOLA_GSPC = EBOLA_GSPC.rename(columns = {'VALUE':'CASES'})
EBOLA_GSPC.head()

Unnamed: 0,DATE,CASES,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME
0,2014-10-03,1.0,1948.119995,1971.189941,1948.119995,1967.900024,1967.900024,3560970000
1,2014-10-08,1.0,1935.550049,1970.359985,1925.25,1968.890015,1968.890015,4441890000
2,2014-10-10,1.0,1925.630005,1936.97998,1906.050049,1906.130005,1906.130005,4550540000
3,2014-10-15,2.0,1874.180054,1874.180054,1820.660034,1862.48999,1862.48999,6090800000
4,2014-10-17,3.0,1864.910034,1898.160034,1864.910034,1886.76001,1886.76001,4482120000


In [5]:
EBOLA_GSPC.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 238 entries, 0 to 237
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   DATE       238 non-null    object 
 1   CASES      238 non-null    float64
 2   OPEN       238 non-null    float64
 3   HIGH       238 non-null    float64
 4   LOW        238 non-null    float64
 5   CLOSE      238 non-null    float64
 6   ADJ_CLOSE  238 non-null    float64
 7   VOLUME     238 non-null    int64  
dtypes: float64(6), int64(1), object(1)
memory usage: 16.7+ KB


In [6]:
EBOLA_GSPC.loc[:, 'DATE'] = pd.to_datetime(EBOLA_GSPC['DATE'], format = '%Y-%m-%d')

In [7]:
EBOLA_GSPC.dtypes

DATE         datetime64[ns]
CASES               float64
OPEN                float64
HIGH                float64
LOW                 float64
CLOSE               float64
ADJ_CLOSE           float64
VOLUME                int64
dtype: object

### Graphing Adjusted Close and Volume of GSPC

I have graphed the adjusted close and volume of DJI over time and we can see a significant drop in the price September of 2015 with an increase of volume of shares during the same period as more people went short on their stocks of companies in the S&P 500.

In [8]:
rcParams['figure.figsize'] = 8, 8

plt.figure()
ax1 = plt.subplot2grid((6,1), (0,0), rowspan = 5, colspan = 1)
ax2 = plt.subplot2grid((6,1), (5,0), rowspan = 1, colspan = 1, sharex = ax1)

ax1.plot(EBOLA_GSPC['DATE'], EBOLA_GSPC['ADJ_CLOSE'], 'b-')
ax1.set_ylabel('USD', fontweight = 'bold')
ax1.set_title('Adjusted Close Price and Volume Over Time', fontsize = 14, fontweight = 'bold')
ax1.grid(True)

ax2.bar(EBOLA_GSPC['DATE'], EBOLA_GSPC['VOLUME'], color = 'blue')
ax2.grid(True)
ax2.set_ylabel('Volume', fontweight = 'bold')

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

### Feature Engineering

For better analysis I have added the range of high and low as well as range of open and close of the prices each day. Lag columns are created for the past 21 days allowing constant feed of data into the XGBoost model for better predictions. The mean and standard deviation of columns is also calculated to help with scaling in the next steps.

In [9]:
EBOLA_GSPC['RANGE_HL'] = EBOLA_GSPC['HIGH'] - EBOLA_GSPC['LOW']
EBOLA_GSPC['RANGE_OC'] = EBOLA_GSPC['OPEN'] - EBOLA_GSPC['CLOSE']
EBOLA_GSPC.head()

Unnamed: 0,DATE,CASES,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC
0,2014-10-03,1.0,1948.119995,1971.189941,1948.119995,1967.900024,1967.900024,3560970000,23.069946,-19.780029
1,2014-10-08,1.0,1935.550049,1970.359985,1925.25,1968.890015,1968.890015,4441890000,45.109985,-33.339966
2,2014-10-10,1.0,1925.630005,1936.97998,1906.050049,1906.130005,1906.130005,4550540000,30.929931,19.5
3,2014-10-15,2.0,1874.180054,1874.180054,1820.660034,1862.48999,1862.48999,6090800000,53.52002,11.690064
4,2014-10-17,3.0,1864.910034,1898.160034,1864.910034,1886.76001,1886.76001,4482120000,33.25,-21.849976


In [10]:
EBOLA_GSPC['ORDER'] = [idx for idx in list(range(len(EBOLA_GSPC)))]

merging_keys = ['ORDER']

lag_cols = ['CASES', 'OPEN', 'HIGH', 'LOW', 'CLOSE', 'ADJ_CLOSE', 'VOLUME', 'RANGE_HL', 'RANGE_OC']
lag_cols

['CASES',
 'OPEN',
 'HIGH',
 'LOW',
 'CLOSE',
 'ADJ_CLOSE',
 'VOLUME',
 'RANGE_HL',
 'RANGE_OC']

In [11]:
days_lagging = [day+1 for day in range(N)]

for lag in tqdm_notebook(days_lagging):
    train_lag = EBOLA_GSPC[merging_keys + lag_cols].copy()
    
    train_lag['ORDER'] = train_lag['ORDER'] + lag
    
    foo = lambda num: '{}_LAG_{}'.format(num, lag) if num in lag_cols else num
    train_lag = train_lag.rename(columns = foo)
    
    EBOLA_GSPC = pd.merge(EBOLA_GSPC, train_lag, on = merging_keys, how = 'left')
    
del train_lag

EBOLA_GSPC = EBOLA_GSPC[N:]

EBOLA_GSPC.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for lag in tqdm_notebook(days_lagging):


HBox(children=(FloatProgress(value=0.0, max=21.0), HTML(value='')))




Unnamed: 0,DATE,CASES,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC,...,RANGE_OC_LAG_20,CASES_LAG_21,OPEN_LAG_21,HIGH_LAG_21,LOW_LAG_21,CLOSE_LAG_21,ADJ_CLOSE_LAG_21,VOLUME_LAG_21,RANGE_HL_LAG_21,RANGE_OC_LAG_21
21,2014-12-23,4.0,2081.47998,2086.72998,2079.77002,2082.169922,2082.169922,3043950000,6.95996,-0.689942,...,-33.339966,1.0,1948.119995,1971.189941,1948.119995,1967.900024,1967.900024,3560970000.0,23.069946,-19.780029
22,2014-12-24,4.0,2083.25,2087.560059,2081.860107,2081.879883,2081.879883,1416980000,5.699952,1.370117,...,19.5,1.0,1935.550049,1970.359985,1925.25,1968.890015,1968.890015,4441890000.0,45.109985,-33.339966
23,2014-12-26,4.0,2084.300049,2092.699951,2084.300049,2088.77002,2088.77002,1735230000,8.399902,-4.469971,...,11.690064,1.0,1925.630005,1936.97998,1906.050049,1906.130005,1906.130005,4550540000.0,30.929931,19.5
24,2014-12-29,4.0,2087.629883,2093.550049,2085.75,2090.570068,2090.570068,2452360000,7.800049,-2.940185,...,-21.849976,2.0,1874.180054,1874.180054,1820.660034,1862.48999,1862.48999,6090800000.0,53.52002,11.690064
25,2014-12-30,4.0,2088.48999,2088.48999,2079.530029,2080.350098,2080.350098,2440280000,8.959961,8.139892,...,14.180054,3.0,1864.910034,1898.160034,1864.910034,1886.76001,1886.76001,4482120000.0,33.25,-21.849976


In [12]:
EBOLA_GSPC.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217 entries, 21 to 237
Columns: 200 entries, DATE to RANGE_OC_LAG_21
dtypes: datetime64[ns](1), float64(197), int64(2)
memory usage: 340.8 KB


In [13]:
for col in lag_cols:
    EBOLA_GSPC = cf.get_mov_avg_std(EBOLA_GSPC, col)
    
EBOLA_GSPC.head()

Unnamed: 0,DATE,CASES,OPEN,HIGH,LOW,CLOSE,ADJ_CLOSE,VOLUME,RANGE_HL,RANGE_OC,...,CLOSE_MEAN,CLOSE_STD,ADJ_CLOSE_MEAN,ADJ_CLOSE_STD,VOLUME_MEAN,VOLUME_STD,RANGE_HL_MEAN,RANGE_HL_STD,RANGE_OC_MEAN,RANGE_OC_STD
21,2014-12-23,4.0,2081.47998,2086.72998,2079.77002,2082.169922,2082.169922,3043950000,6.95996,-0.689942,...,,,,,,,,,,
22,2014-12-24,4.0,2083.25,2087.560059,2081.860107,2081.879883,2081.879883,1416980000,5.699952,1.370117,...,2082.169922,,2082.169922,,3043950000.0,,6.95996,,-0.689942,
23,2014-12-26,4.0,2084.300049,2092.699951,2084.300049,2088.77002,2088.77002,1735230000,8.399902,-4.469971,...,2082.024903,0.205089,2082.024903,0.205089,2230465000.0,1150442000.0,6.329956,0.89096,0.340087,1.456682
24,2014-12-29,4.0,2087.629883,2093.550049,2085.75,2090.570068,2090.570068,2452360000,7.800049,-2.940185,...,2084.273275,3.896995,2084.273275,3.896995,2065387000.0,862270500.0,7.019938,1.350974,-1.263265,2.961956
25,2014-12-30,4.0,2088.48999,2088.48999,2079.530029,2080.350098,2080.350098,2440280000,8.959961,8.139892,...,2085.847473,4.476246,2085.847473,4.476246,2162130000.0,730144300.0,7.214966,1.169999,-1.682495,2.559649


### Splitting Train, Validation, and Test Dataset

I have used 60% of the data for training on, then 20% to validate and tune, finally the last 20% for testing the optimized XGBoost model

In [14]:
valid_len = int(valid_size * len(EBOLA_GSPC))
test_len = int(test_size * len(EBOLA_GSPC))
train_len = len(EBOLA_GSPC) - (valid_len + test_len)
print('Train Set Len: ' + str(train_len))
print('Test Set Len: ' + str(test_len))
print('Valid Set Len: ' + str(valid_len))

train = EBOLA_GSPC[:train_len]
valid = EBOLA_GSPC[train_len:(train_len + valid_len)]
test = EBOLA_GSPC[(train_len + valid_len):]
print('Train Set Shape: ' + str(train.shape))
print('Valid Set Shape: ' + str(valid.shape))
print('Test Set Shape: ' + str(test.shape))

Train Set Len: 131
Test Set Len: 43
Valid Set Len: 43
Train Set Shape: (131, 218)
Valid Set Shape: (43, 218)
Test Set Shape: (43, 218)


### Scaling Data

When looking at time-series stock data, the prices of stocks overall goes up, therefore the later dates will have a higher stock price compared to the earlier stock prices. In order to have a levled analysis I have scaled all of the data across the entire dataframe, this evens out the playing field for analysis. Also the unnecessary data columns were removed allowing for better readability while implementing XGBoost.

In [15]:
scaling_cols = ['ADJ_CLOSE']

for num in range(1, N+1):
    scaling_cols.append('CASES_LAG_'+str(num))
    scaling_cols.append('OPEN_LAG_'+str(num))
    scaling_cols.append('HIGH_LAG_'+str(num))
    scaling_cols.append('LOW_LAG_'+str(num))
    scaling_cols.append('ADJ_CLOSE_LAG_'+str(num))
    scaling_cols.append('VOLUME_LAG_'+str(num))
    scaling_cols.append('RANGE_HL_LAG_'+str(num))
    scaling_cols.append('RANGE_OC_LAG_'+str(num))
    
scaler = StandardScaler()
train_scale = scaler.fit_transform(train[scaling_cols])
train_df_scaled = pd.DataFrame(train_scale, columns = scaling_cols)
train_df_scaled[['DATE']] = train.reset_index()[['DATE']]
train_df_scaled

Unnamed: 0,ADJ_CLOSE,CASES_LAG_1,OPEN_LAG_1,HIGH_LAG_1,LOW_LAG_1,ADJ_CLOSE_LAG_1,VOLUME_LAG_1,RANGE_HL_LAG_1,RANGE_OC_LAG_1,CASES_LAG_2,...,RANGE_OC_LAG_20,CASES_LAG_21,OPEN_LAG_21,HIGH_LAG_21,LOW_LAG_21,ADJ_CLOSE_LAG_21,VOLUME_LAG_21,RANGE_HL_LAG_21,RANGE_OC_LAG_21,DATE
0,-0.101101,0.0,-0.512110,-0.546392,-0.183542,-0.216644,-0.129069,-1.056338,-0.640823,0.0,...,-2.213631,-5.895903,-2.460733,-2.329724,-2.139190,-2.104276,0.087716,0.357301,-1.285692,2014-12-23
1,-0.110109,0.0,-0.130785,-0.273415,0.118782,-0.103788,-0.720534,-1.334086,-0.057579,0.0,...,1.328659,-5.895903,-2.712124,-2.347392,-2.572128,-2.084195,1.510652,2.569295,-2.188959,2014-12-24
2,0.103897,0.0,-0.075461,-0.244984,0.179018,-0.112806,-3.676258,-1.472960,0.082621,0.0,...,0.805096,-5.895903,-2.910519,-3.057957,-2.935591,-3.357234,1.686152,1.146149,1.330865,2014-12-26
3,0.159806,0.0,-0.042641,-0.068939,0.249338,0.101413,-3.098092,-1.175381,-0.314832,0.0,...,-1.443364,-3.863368,-3.939487,-4.394788,-4.552059,-4.242438,4.174108,3.413347,0.810623,2014-12-29
4,-0.157624,0.0,0.061437,-0.039823,0.291126,0.157378,-1.795278,-1.241495,-0.210721,0.0,...,0.972020,-1.830833,-4.124881,-3.884323,-3.714389,-3.750139,1.575634,1.378997,-1.423577,2014-12-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126,1.049354,0.0,1.287314,1.305202,1.383697,1.329806,-0.353705,-1.091588,-0.107939,0.0,...,-0.672434,0.201702,0.997963,0.872927,0.914852,0.777911,1.523445,-0.734643,0.769323,2015-07-21
127,0.892190,0.0,1.309192,1.156894,1.145642,1.047812,-0.175994,-0.658445,0.556969,0.0,...,-0.048306,0.201702,0.826767,1.048125,0.972591,1.038769,-0.769918,-0.214748,-0.657537,2015-07-22
128,0.519473,0.0,1.017256,0.815073,0.990016,0.890491,0.460543,-1.163245,0.265687,0.0,...,1.031668,0.201702,1.039959,1.008955,1.112485,1.066149,-0.671111,-1.141092,-0.037367,2015-07-23
129,-0.179372,0.0,0.890666,0.758905,0.662327,0.517403,0.603590,-0.090814,0.806729,0.0,...,0.534251,0.201702,1.049758,0.946585,0.898385,0.749312,-0.652875,-0.300066,1.035758,2015-07-24


In [16]:
valid_df_scaled = valid[['DATE']]
for col in tqdm_notebook(lag_cols):
    feat_list = [col + '_LAG_' + str(shift) for shift in range(1, N+1)]
    temp = valid.apply(lambda row: cf.scale_row(row[feat_list], row[col+'_MEAN'], row[col+'_STD']), axis = 1)
    valid_df_scaled = pd.concat([valid_df_scaled, temp], axis = 1)
        
    
valid_df_scaled.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for col in tqdm_notebook(lag_cols):


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




Unnamed: 0,DATE,CASES_LAG_1,CASES_LAG_2,CASES_LAG_3,CASES_LAG_4,CASES_LAG_5,CASES_LAG_6,CASES_LAG_7,CASES_LAG_8,CASES_LAG_9,...,RANGE_OC_LAG_12,RANGE_OC_LAG_13,RANGE_OC_LAG_14,RANGE_OC_LAG_15,RANGE_OC_LAG_16,RANGE_OC_LAG_17,RANGE_OC_LAG_18,RANGE_OC_LAG_19,RANGE_OC_LAG_20,RANGE_OC_LAG_21
152,2015-07-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-5.141856,-1.901872,2.828776,-3.389644,-0.918253,-1.490688,-3.186217,-1.951283,4.283152,-1.508105
153,2015-07-29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.47186,-1.726508,-0.408974,1.514737,-1.013974,-0.008986,-0.241766,-0.93125,-0.429067,2.106156
154,2015-07-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.577198,-1.12046,-1.346908,-0.175276,1.535405,-0.71328,0.180417,-0.026585,-0.639717,-0.193144
155,2015-07-31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.144643,-0.452423,-1.021948,-1.259343,-0.031071,1.762309,-0.595083,0.341817,0.124808,-0.517964
156,2015-08-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.69247,0.396839,-0.37242,-1.106197,-1.412057,0.170449,2.481042,-0.556224,0.650879,0.371284


In [17]:
test_df_scaled = test[['DATE']]
for col in tqdm_notebook(lag_cols):
    feat_list = [col + '_LAG_' + str(shift) for shift in range(1, N+1)]
    temp = test.apply(lambda row: cf.scale_row(row[feat_list], row[col+'_MEAN'], row[col+'_STD']), axis = 1)
    test_df_scaled = pd.concat([test_df_scaled, temp], axis = 1)
        
    
test_df_scaled.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for col in tqdm_notebook(lag_cols):


HBox(children=(FloatProgress(value=0.0, max=9.0), HTML(value='')))




Unnamed: 0,DATE,CASES_LAG_1,CASES_LAG_2,CASES_LAG_3,CASES_LAG_4,CASES_LAG_5,CASES_LAG_6,CASES_LAG_7,CASES_LAG_8,CASES_LAG_9,...,RANGE_OC_LAG_12,RANGE_OC_LAG_13,RANGE_OC_LAG_14,RANGE_OC_LAG_15,RANGE_OC_LAG_16,RANGE_OC_LAG_17,RANGE_OC_LAG_18,RANGE_OC_LAG_19,RANGE_OC_LAG_20,RANGE_OC_LAG_21
195,2015-09-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.746387,2.76347,-5.278032,2.440789,-0.581525,-4.179518,5.78017,1.092654,-0.859242,-5.590611
196,2015-09-29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.326931,-1.384989,0.732071,-3.042848,0.580595,-0.838169,-2.527173,2.1482,-0.05226,-0.968537
197,2015-09-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.066004,-1.031462,-1.085561,0.887147,-2.630378,0.745999,-0.576026,-2.149864,2.206716,0.156296
198,2015-10-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.938119,0.213944,-0.470948,-0.509326,0.890105,-1.605212,0.789975,-0.147864,-1.264338,1.826201
199,2015-10-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.686574,-0.883823,0.258848,-0.42046,-0.458525,0.929496,-1.545477,0.830183,-0.10001,-1.207382


### Build XGBoost

I have made the features list of data that will be fed into the machine learning model. I also build the X and Y data for train, valid, and test datasets for the scled and unsclaed data. I then went ahead ran the model on the train and validation datasets.

In [18]:
features_list = []
for num in range(1, N+1):
    features_list.append('CASES_LAG_'+str(num))
    features_list.append('OPEN_LAG_'+str(num))
    features_list.append('HIGH_LAG_'+str(num))
    features_list.append('LOW_LAG_'+str(num))
    features_list.append('ADJ_CLOSE_LAG_'+str(num))
    features_list.append('VOLUME_LAG_'+str(num))
    features_list.append('RANGE_HL_LAG_'+str(num))
    features_list.append('RANGE_OC_LAG_'+str(num))
    
target = 'ADJ_CLOSE'

X_train = train[features_list]
y_train = train[target]
X_valid = valid[features_list]
y_valid = valid[target]
X_test = test[features_list]
y_test = test[target]

In [19]:
X_train_scaled = train_df_scaled[features_list]
y_train_scaled = train_df_scaled[target]
X_valid_scaled = valid_df_scaled[features_list]
X_test_scaled = test_df_scaled[features_list]

In [20]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.legend(['Train Data', 'Valid Data', 'Test Data'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xticks(rotation = 45)

plt.xlabel('Date', fontweight = 'bold')

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [21]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train_df_scaled['DATE'], train_df_scaled['ADJ_CLOSE'], 'b-')
plt.legend(['Train Data'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time With Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [22]:
model = XGBRegressor(seed = model_seed,
                    n_estimators = n_estimators,
                    max_depth = max_depth,
                    learning_rate = learning_rate,
                    min_child_weight = min_child_weight,
                    subsample = subsample,
                    colsample_bytree = colsample_bytree,
                    colsample_bylevel = colsample_bylevel,
                    gamma = gamma)

model.fit(X_train_scaled, y_train_scaled)

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.1, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=100, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, seed=100, subsample=1,
             tree_method=None, validate_parameters=False, verbosity=None)

In [23]:
predict_train_scaled = model.predict(X_train_scaled)
predict_train = predict_train_scaled * math.sqrt(scaler.var_[0]) + scaler.mean_[0]

predit_train_df = pd.DataFrame({'PREDICT_VALUE':predict_train,
                               'DATE':train['DATE']})

print('MSE on Train Set =', mean_squared_error(y_train, predict_train))
print('MAE on Train Set =', mean_absolute_error(y_train, predict_train))

MSE on Train Set = 0.4637238731847599
MAE on Train Set = 0.5192833699904496


In [24]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_train_df['DATE'], predit_train_df['PREDICT_VALUE'], 'r-')
plt.legend(['Train Data', 'Valid Data', 'Test Data', 'Train Data Predictions'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [25]:
predict_valid_scaled = model.predict(X_valid_scaled)
valid['PREDICT_VALID'] = predict_valid_scaled
valid['PREDICT'] = valid['PREDICT_VALID'] * valid['ADJ_CLOSE_STD'] + valid['ADJ_CLOSE_MEAN']

predit_valid_df = pd.DataFrame({'PREDICT_VALUE':valid['PREDICT'],
                               'Y_VALID':y_valid,
                               'DATE':valid['DATE']})

MSE_Original = mean_squared_error(y_valid, valid['PREDICT'])
MAE_Original = mean_absolute_error(y_valid, valid['PREDICT'])

print('MSE on Validation Set =', mean_squared_error(y_valid, valid['PREDICT']))
print('MAE on Validation Set =', mean_absolute_error(y_valid, valid['PREDICT']))

MSE on Validation Set = 1338.7060567923972
MAE on Validation Set = 25.111605657229084


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['PREDICT_VALID'] = predict_valid_scaled
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  valid['PREDICT'] = valid['PREDICT_VALID'] * valid['ADJ_CLOSE_STD'] + valid['ADJ_CLOSE_MEAN']


In [26]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_valid_df['DATE'], predit_valid_df['PREDICT_VALUE'], 'r-')
plt.legend(['Train Data', 'Valid Data', 'Test Data', 'Valid Data Predictions'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')
plt.xticks(rotation = 45)

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

### Preliminary Analysis

The analysis is clearly showing the predictions off, but at a much lower error score compared to the COVID-19 data. This confirms the high error in COVID-19 should be due to low data avaialbility for analysis with COVID-19. Although there is still an error in the data, the positive aspect is the model is able to predict the trend and to a great degree the accurate prices. To make this model better, I next went ahead and tuned the hyperparameters of the machine learning model.

### Tuning of XGBoost

First I needed to see what feature list is the most important or the most influential on machine learning model. I then took two parameters at a time of XGBoost and manually optimized the model and storing the optimum values to use for further tuning of the machine learning model.

- XGBoost Parameters:
    - n_estimators -> Specifies how many times to go through the modeling cycle
    - max_depth -> Maximum depth of a tree, large value can cause overfitting
    - learning_rate -> Shrinkage factor that avoids overfitting of the model
    - subsample -> Subsample ratio of the training instances, it will occur once in every boosting iteration
    - colsample_bytree -> Subsample ratio of columns when constructing each tree
    - colsample_bylevel -> Subsample ratio of columns for each level
    - gamma -> Minimum loss reduction required to make a futher partition of a leaf node of the tree

In [27]:
imp = list(zip(train[features_list], model.feature_importances_))
imp.sort(key = lambda tup: tup[1])
imp[-10:]

[('RANGE_OC_LAG_11', 0.01431336),
 ('LOW_LAG_6', 0.014691141),
 ('RANGE_HL_LAG_18', 0.014848984),
 ('HIGH_LAG_8', 0.017172486),
 ('OPEN_LAG_2', 0.017911443),
 ('RANGE_OC_LAG_9', 0.02093333),
 ('HIGH_LAG_16', 0.031234646),
 ('LOW_LAG_7', 0.057078507),
 ('LOW_LAG_1', 0.2127553),
 ('ADJ_CLOSE_LAG_1', 0.24982767)]

In [28]:
imp = list(zip(train[features_list], model.feature_importances_))
imp.sort(key = lambda tup: tup[1])

imp_10 = imp[-10:]

category = []
value = []

for imp in imp_10:
    category.append(imp[0])
    value.append(imp[1])

imp_df = pd.DataFrame({'FEATURE':category,
                      'IMPORTANCE':value})

imp_df

Unnamed: 0,FEATURE,IMPORTANCE
0,RANGE_OC_LAG_11,0.014313
1,LOW_LAG_6,0.014691
2,RANGE_HL_LAG_18,0.014849
3,HIGH_LAG_8,0.017172
4,OPEN_LAG_2,0.017911
5,RANGE_OC_LAG_9,0.020933
6,HIGH_LAG_16,0.031235
7,LOW_LAG_7,0.057079
8,LOW_LAG_1,0.212755
9,ADJ_CLOSE_LAG_1,0.249828


In [29]:
rcParams['figure.figsize'] = 10, 5

plt.figure()

plt.barh(imp_df['FEATURE'], imp_df['IMPORTANCE'], color = 'blue')
plt.legend(['Feature Importance'])
plt.ylabel('Volume')
plt.xlabel('Date')
plt.title('Importance of Features Used to Train and Predict Stock Value')

mplcursors.cursor(hover = True)

plt.show()

<IPython.core.display.Javascript object>

In [30]:
n_estimators_parameter = 'n_estimators'
n_estimators_list = range(10, 500, 10)

max_depth_parameter = 'max_depth'
max_depth_list = [2, 3, 4, 5, 6, 7, 8, 9]

error_rates = {n_estimators_parameter: [], max_depth_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(n_estimators_list):
    for param2 in max_depth_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = param,
                                                             max_depth = param2,
                                                             learning_rate = learning_rate,
                                                             min_child_weight = min_child_weight,
                                                             subsample = subsample,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = gamma)
        error_rates[n_estimators_parameter].append(param)
        error_rates[max_depth_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df1 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df1

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(n_estimators_list):


HBox(children=(FloatProgress(value=0.0, max=49.0), HTML(value='')))


Minutes Taken to Execute: 4.996700926621755


Unnamed: 0,n_estimators,max_depth,MSE,MAE
0,10,2,1294.268421,22.473907
1,10,3,1378.487886,23.210359
2,10,4,1330.312078,23.449485
3,10,5,1380.079766,23.427015
4,10,6,1369.773118,23.430599
...,...,...,...,...
387,490,5,1202.279360,24.111151
388,490,6,1255.746703,24.462287
389,490,7,1203.823247,24.093558
390,490,8,1261.730699,24.555656


In [31]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(max_depth_list)):
    param = error_rate_df1[error_rate_df1[max_depth_parameter] == max_depth_list[p]]
    plt.plot(param[n_estimators_parameter], param['MAE'], marker = 's',)
    legend_list.append(max_depth_parameter + '_' + str(max_depth_list[p]))
    
plt.xlabel(n_estimators_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('n_estimators and max_depth Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [32]:
optimum = error_rate_df1[error_rate_df1['MSE'] == error_rate_df1['MSE'].min()]

print('Minimum MSE Value:', error_rate_df1['MSE'].min())
print('Optimum n_estimators Parameter:', optimum['n_estimators'].values[0])
print('Optimum max_depth Parameter:', optimum['max_depth'].values[0])

Minimum MSE Value: 1160.6852240107796
Optimum n_estimators Parameter: 30
Optimum max_depth Parameter: 8


In [33]:
optimum = error_rate_df1[error_rate_df1['MAE'] == error_rate_df1['MAE'].min()]

n_estimators_optimum = optimum['n_estimators'].values[0]
max_depth_optimum = optimum['max_depth'].values[0]

print('Minimum MAE Value:', error_rate_df1['MAE'].min())
print('Optimum n_estimators Parameter:', optimum['n_estimators'].values[0])
print('Optimum max_depth Parameter:', optimum['max_depth'].values[0])

Minimum MAE Value: 22.269726428987518
Optimum n_estimators Parameter: 20
Optimum max_depth Parameter: 2


In [34]:
learning_rate_parameter = 'learning_rate'
learning_rate_list = [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3]

min_child_weight_parameter = 'min_child_weight'
min_child_weight_list = range(5, 21, 1)

error_rates = {learning_rate_parameter: [], min_child_weight_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(learning_rate_list):
    for param2 in min_child_weight_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = param,
                                                             min_child_weight = param2,
                                                             subsample = subsample,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = gamma)
        error_rates[learning_rate_parameter].append(param)
        error_rates[min_child_weight_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df2 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df2

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(learning_rate_list):


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Minutes Taken to Execute: 0.18971495628356932


Unnamed: 0,learning_rate,min_child_weight,MSE,MAE
0,0.001,5,2665.785207,30.498422
1,0.001,6,2665.785207,30.498422
2,0.001,7,2665.785207,30.498422
3,0.001,8,2665.785207,30.498422
4,0.001,9,2665.785207,30.498422
...,...,...,...,...
107,0.300,16,907.963192,21.266820
108,0.300,17,921.598224,22.068870
109,0.300,18,1063.780899,22.175666
110,0.300,19,1004.293272,21.958988


In [35]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(min_child_weight_list)):
    param = error_rate_df2[error_rate_df2[min_child_weight_parameter] == min_child_weight_list[p]]
    plt.plot(param[learning_rate_parameter], param['MAE'], marker = 's',)
    legend_list.append(min_child_weight_parameter + '_' + str(min_child_weight_list[p]))
    
plt.xlabel(learning_rate_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('learning_rate and min_child_weight Parameter Tuning', fontsize = 10, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right', bbox_to_anchor = (1.1, 1.1))

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [36]:
optimum = error_rate_df2[error_rate_df2['MSE'] == error_rate_df2['MSE'].min()]

print('Minimum MSE Value:', error_rate_df2['MSE'].min())
print('Optimum learning_rate Parameter:', optimum['learning_rate'].values[0])
print('Optimum min_child_weight Parameter:', optimum['min_child_weight'].values[0])

Minimum MSE Value: 762.9703193211196
Optimum learning_rate Parameter: 0.2
Optimum min_child_weight Parameter: 16


In [37]:
optimum = error_rate_df2[error_rate_df2['MAE'] == error_rate_df2['MAE'].min()]

learning_rate_optimum = optimum['learning_rate'].values[0]
min_child_weight_optimum = optimum['min_child_weight'].values[0]

print('Minimum MAE Value:', error_rate_df2['MAE'].min())
print('Optimum learning_rate Parameter:', optimum['learning_rate'].values[0])
print('Optimum min_child_weight Parameter:', optimum['min_child_weight'].values[0])

Minimum MAE Value: 19.519763135058824
Optimum learning_rate Parameter: 0.1
Optimum min_child_weight Parameter: 16


In [38]:
subsample_parameter = 'subsample'
subsample_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

gamma_parameter = 'gamma'
gamma_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

error_rates = {subsample_parameter: [], gamma_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(subsample_list):
    for param2 in gamma_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = param,
                                                             colsample_bytree = colsample_bytree,
                                                             colsample_bylevel = colsample_bylevel,
                                                             gamma = param2)
        error_rates[subsample_parameter].append(param)
        error_rates[gamma_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df3 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df3

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(subsample_list):


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))


Minutes Taken to Execute: 0.13516678810119628


Unnamed: 0,subsample,gamma,MSE,MAE
0,0.1,0.1,2086.478097,27.995020
1,0.1,0.2,2086.478097,27.995020
2,0.1,0.3,2086.478097,27.995020
3,0.1,0.4,2086.478097,27.995020
4,0.1,0.5,2086.478097,27.995020
...,...,...,...,...
95,1.0,0.6,774.144662,19.519763
96,1.0,0.7,774.144662,19.519763
97,1.0,0.8,774.144662,19.519763
98,1.0,0.9,774.144662,19.519763


In [39]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(gamma_list)):
    param = error_rate_df3[error_rate_df3[gamma_parameter] == gamma_list[p]]
    plt.plot(param[subsample_parameter], param['MAE'], marker = 's',)
    legend_list.append(gamma_parameter + '_' + str(gamma_list[p]))
    
plt.xlabel(subsample_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('subsample and gamma Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [40]:
optimum = error_rate_df3[error_rate_df3['MSE'] == error_rate_df3['MSE'].min()]

print('Minimum MSE Value:', error_rate_df3['MSE'].min())
print('Optimum subsample Parameter:', optimum['subsample'].values[0])
print('Optimum gamma Parameter:', optimum['gamma'].values[0])

Minimum MSE Value: 774.1446622198038
Optimum subsample Parameter: 1.0
Optimum gamma Parameter: 0.1


In [41]:
optimum = error_rate_df3[error_rate_df3['MAE'] == error_rate_df3['MAE'].min()]

subsample_optimum = optimum['subsample'].values[0]
gamma_optimum = optimum['gamma'].values[0]

print('Minimum MAE Value:', error_rate_df3['MAE'].min())
print('Optimum subsample Parameter:', optimum['subsample'].values[0])
print('Optimum gamma Parameter:', optimum['gamma'].values[0])

Minimum MAE Value: 19.519763135058824
Optimum subsample Parameter: 1.0
Optimum gamma Parameter: 0.1


In [42]:
colsample_bytree_parameter = 'colsample_bytree'
colsample_bytree_list = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

colsample_bylevel_parameter = 'colsample_bylevel'
colsample_bylevel_list = [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

error_rates = {colsample_bytree_parameter: [], colsample_bylevel_parameter: [], 'MSE':[], 'MAE':[]}
start = time.time()

for param in tqdm_notebook(colsample_bytree_list):
    for param2 in colsample_bylevel_list:
        mse, mae, predicted_values = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             valid['ADJ_CLOSE_MEAN'],
                                                             valid['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = subsample_optimum,
                                                             colsample_bytree = param,
                                                             colsample_bylevel = param2,
                                                             gamma = gamma_optimum)
        error_rates[colsample_bytree_parameter].append(param)
        error_rates[colsample_bylevel_parameter].append(param2)
        error_rates['MSE'].append(mse)
        error_rates['MAE'].append(mae)
        
error_rate_df4 = pd.DataFrame(error_rates)
end = time.time()
print('Minutes Taken to Execute: ' + str((end - start)/60.0))
error_rate_df4

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for param in tqdm_notebook(colsample_bytree_list):


HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


Minutes Taken to Execute: 0.056838361422220866


Unnamed: 0,colsample_bytree,colsample_bylevel,MSE,MAE
0,0.5,0.5,1152.492836,22.293075
1,0.5,0.6,1052.293998,21.489352
2,0.5,0.7,1120.720562,22.122107
3,0.5,0.8,900.444645,20.277486
4,0.5,0.9,911.052249,20.71868
5,0.5,1.0,910.774839,20.835464
6,0.6,0.5,1133.949257,21.840435
7,0.6,0.6,1022.532009,21.599109
8,0.6,0.7,969.698089,20.979934
9,0.6,0.8,945.023818,21.223766


In [43]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

legend_list = []

for p in range(len(colsample_bylevel_list)):
    param = error_rate_df4[error_rate_df4[colsample_bylevel_parameter] == colsample_bylevel_list[p]]
    plt.plot(param[colsample_bytree_parameter], param['MAE'], marker = 's',)
    legend_list.append(colsample_bylevel_parameter + '_' + str(colsample_bylevel_list[p]))
    
plt.xlabel(colsample_bytree_parameter, fontweight = 'bold')
plt.ylabel('MAE', fontweight = 'bold')
plt.title('colsample_bytree and colsample_bylevel Parameter Tuning', fontsize = 14, fontweight = 'bold')
plt.legend(legend_list, loc = 'upper right')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

In [44]:
optimum = error_rate_df4[error_rate_df4['MSE'] == error_rate_df4['MSE'].min()]

MSE_Optimum = error_rate_df4['MSE'].min()

print('Minimum MSE Value:', error_rate_df4['MSE'].min())
print('Optimum colsample_bytree Parameter:', optimum['colsample_bytree'].values[0])
print('Optimum colsample_bylevel Parameter:', optimum['colsample_bylevel'].values[0])

Minimum MSE Value: 774.1446622198038
Optimum colsample_bytree Parameter: 1.0
Optimum colsample_bylevel Parameter: 1.0


In [45]:
optimum = error_rate_df4[error_rate_df4['MAE'] == error_rate_df4['MAE'].min()]

MAE_Optimum = error_rate_df4['MAE'].min()

colsample_bytree_optimum = optimum['colsample_bytree'].values[0]
colsample_bylevel_optimum = optimum['colsample_bylevel'].values[0]

print('Minimum MAE Value:', error_rate_df4['MAE'].min())
print('Optimum colsample_bytree Parameter:', optimum['colsample_bytree'].values[0])
print('Optimum colsample_bylevel Parameter:', optimum['colsample_bylevel'].values[0])

Minimum MAE Value: 19.294963433367688
Optimum colsample_bytree Parameter: 0.9
Optimum colsample_bylevel Parameter: 0.8


### Overview of Optimized Values

I have graphed the opimized values to visually represent this analysis and clearly understand what the new values are and how the error values are impacted to show the effect of tuning parameters.

In [46]:
tuned_params_df = pd.DataFrame({'PARAMETER':['n_estimators', 'max_depth', 'learning_rate', 'min_child_weight', 'subsample', 'colsample_bytree', 'colsample_bylevel', 'gamma'],
                               'ORIGINAL':[n_estimators, max_depth, learning_rate, min_child_weight, subsample, colsample_bytree, colsample_bylevel, gamma],
                               'OPTIMUM':[n_estimators_optimum, max_depth_optimum, learning_rate_optimum, min_child_weight_optimum, subsample_optimum, colsample_bytree_optimum, colsample_bylevel_optimum, gamma_optimum]})

MSE_values_df = pd.DataFrame({'PARAMETER':['MSE'],
                                    'ORIGINAL':[MSE_Original],
                                    'OPTIMUM':[MSE_Optimum]})

MAE_values_df = pd.DataFrame({'PARAMETER':['MAE'],
                                    'ORIGINAL':[MAE_Original],
                                    'OPTIMUM':[MAE_Optimum]})

tuned_params_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,n_estimators,100.0,20.0
1,max_depth,3.0,2.0
2,learning_rate,0.1,0.1
3,min_child_weight,1.0,16.0
4,subsample,1.0,1.0
5,colsample_bytree,1.0,0.9
6,colsample_bylevel,1.0,0.8
7,gamma,0.0,0.1


In [47]:
MSE_values_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,MSE,1338.706057,774.144662


In [48]:
MAE_values_df

Unnamed: 0,PARAMETER,ORIGINAL,OPTIMUM
0,MAE,25.111606,19.294963


In [49]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_parameters = [tuned_params_df['ORIGINAL'], tuned_params_df['OPTIMUM']]
X_parameters = tuned_params_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_parameters)
_X = np.arange(len(X_parameters))
for i in range(len(vals_parameters)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_parameters[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_parameters, rotation=45)
plt.title('Orignial and Optimized Parameters For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Parameters', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

In [50]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_MSE = [MSE_values_df['ORIGINAL'], MSE_values_df['OPTIMUM']]
X_MSE = MSE_values_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_MSE)
_X = np.arange(len(X_MSE))
for i in range(len(vals_MSE)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_MSE[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_MSE, rotation=45)
plt.title('Orignial and Optimized MSE Values For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Error Calculation', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

In [51]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

vals_MAE = [MAE_values_df['ORIGINAL'], MAE_values_df['OPTIMUM']]
X_MAE = MAE_values_df['PARAMETER']
legend_list = ['Original Value', 'Optimized Value']
n = len(vals_MAE)
_X = np.arange(len(X_MAE))
for i in range(len(vals_MAE)):
    plt.bar(_X - 0.8/2. + i/float(n)*0.8, vals_MAE[i], width = 0.8/float(n), align = 'edge')
plt.xticks(_X, X_MAE, rotation=45)
plt.title('Orignial and Optimized MAE Values For XGBoost', fontsize = 14, fontweight = 'bold')
plt.ylabel('Value', fontweight = 'bold')
plt.xlabel('Error Calculation', fontweight = 'bold')
plt.legend(legend_list)

cursor = mplcursors.cursor(hover=True)
@cursor.connect("add")
def on_add(sel):
    x, y, width, height = sel.artist[sel.target.index].get_bbox().bounds
    sel.annotation.set(text=f"{height}", position=(0, 20))
    sel.annotation.xy = (x + width / 2, y + height)

plt.show();

<IPython.core.display.Javascript object>

### Testing XGBoost Model on Test Dataset

After seeing that the optimum parameters are having a positive effect on the error values, making it more accurate model to use for predictions.

In [52]:
mse, mae, predicted_test = cf.train_pred_eval_model(X_train_scaled,
                                                             y_train_scaled,
                                                             X_valid_scaled,
                                                             y_valid,
                                                             test['ADJ_CLOSE_MEAN'],
                                                             test['ADJ_CLOSE_STD'],
                                                             seed = model_seed,
                                                             n_estimators = n_estimators_optimum,
                                                             max_depth = max_depth_optimum,
                                                             learning_rate = learning_rate_optimum,
                                                             min_child_weight = min_child_weight_optimum,
                                                             subsample = subsample_optimum,
                                                             colsample_bytree = colsample_bytree_optimum,
                                                             colsample_bylevel = colsample_bylevel_optimum,
                                                             gamma = gamma_optimum)


predit_test_df = pd.DataFrame({'PREDICT_VALUE':predicted_test,
                               'Y_TEST':y_test,
                               'DATE':test['DATE']})

print('MSE Value:', mse)
print('MAE Value:', mae)

MSE Value: 15988.902853475369
MAE Value: 117.77868045114168


In [53]:
rcParams['figure.figsize'] = 8, 8

plt.figure()

plt.plot(train['DATE'], train['ADJ_CLOSE'], 'b-')
plt.plot(valid['DATE'], valid['ADJ_CLOSE'], 'y-')
plt.plot(test['DATE'], test['ADJ_CLOSE'], 'g-')
plt.plot(predit_test_df['DATE'], predit_test_df['PREDICT_VALUE'], 'r-')
plt.legend(['TRAIN DATA', 'VALID DATA', 'TEST DATA', 'PREDICTIONS'])
plt.ylabel('USD', fontweight = 'bold')
plt.title('Adjusted Close Price and Volume Over Time Without Scaling', fontsize = 14, fontweight = 'bold')
plt.xticks(rotation = 45)
plt.grid(True)

plt.xlabel('Date', fontweight = 'bold')

mplcursors.cursor(hover = True)

plt.show();

<IPython.core.display.Javascript object>

### Conclusion/Analysis

__With the tuned XGBoost, my predictions are not as accurate, therefore there could be a level of overfitting or underfitting with my model. After tuning my Mean Square Error is 15988, and my mean absolute error is 117. It might be a good idea to run the model again with the original parameter values.__

__Next Stpes__:
    - Change the data that we are feeding into the model for better accuracy. 
        - One of the inconsistencies to fix is taking out the OHLC data that is fed into the model, if we want to only focus on Ebola cases and their effect on the stock market. 
    - Use news and twitter sentimental analysis to see how news on Ebola can suppliment the analysis and predictions of the machine learning model. 
    - Use hospitilization and death rates along with case count to better account for the devastation Ebola.
    - Use 10Q (Quarterly Reports), 10K(Annual Reports), cash flow statements, and any other financial statments to get a better understanding of how the stay at home orders and freeze in the economy has affected companies
    
Although the model is not perfect there is great opportunites for improvement and make it much more accurate, allowing us to be more prepared for future pandemics that come our way.