# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

# Hyperparameter tuning

Hyperparameters cannot be learned by the model but need to be specified by the user before training the models. In this notebook, we will find the best hyperparameters for the random forest model created in the previous section using the random search and grid search cross-validation techniques.

Let's start with the below steps which you already know!
1. Import the data
2. Define predictor variables and a target variable
3. Split the data into train and test dataset

In [1]:
import pandas as pd
data = pd.read_csv('../data_modules/AAPL_2008_2018.csv')

# Returns
data['ret1'] = data.Adj_Close.pct_change()
data['ret5'] = data.ret1.rolling(5).sum()
data['ret10'] = data.ret1.rolling(10).sum()
data['ret20'] = data.ret1.rolling(20).sum()
data['ret40'] = data.ret1.rolling(40).sum()

# Standard Deviation
data['std5'] = data.ret1.rolling(5).std()
data['std10'] = data.ret1.rolling(10).std()
data['std20'] = data.ret1.rolling(20).std()
data['std40'] = data.ret1.rolling(40).std()

# Future returns
data['retFut1'] = data.ret1.shift(-1)

# Define predictor variables (X) and a target variable (y)
data = data.dropna()
predictor_list = ['ret1', 'ret5', 'ret10', 'ret20',
                  'ret40', 'std5', 'std10', 'std20', 'std40']
X = data[predictor_list]
y = data.retFut1

# Split the data into train and test dataset
train_length = int(len(data)*0.80)
X_train = X[:train_length]
X_test = X[train_length:]
y_train = y[:train_length]
y_test = y[train_length:]

The key hyperparameters in the random forest method are
- n_estimators,
- max_features, 
- max_depth, 
- min_samples_leaf, 
- and bootstrap.   

We have defined below a range of values for each of these hyperparameters.

In [2]:
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=10, stop=20, num=5)]

# Number of features to consider at every split
max_features = [round(x, 2) for x in np.linspace(start=0.3, stop=1.0, num=5)]

# Max depth of the tree
max_depth = [int(round(x, 2)) for x in np.linspace(start=2, stop=10, num=5)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start=300, stop=600, num=10)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Save these parameters in a dictionry
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'max_depth': max_depth,
              'min_samples_leaf': min_samples_leaf,
              'bootstrap': bootstrap
              }

# Print the dictionary
param_grid

{'n_estimators': [10, 12, 15, 17, 20],
 'max_features': [0.3, 0.48, 0.65, 0.82, 1.0],
 'max_depth': [2, 4, 6, 8, 10],
 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600],
 'bootstrap': [True, False]}

## Random Search
The RandomizedSearchCV function from sklearn.model_selection package is used to find the best hyperparameter values.

In [3]:
from sklearn.model_selection import RandomizedSearchCV

# Uncomment below line to see detail about RandomizedSearchCV function
# help(RandomizedSearchCV)

In [4]:
# Create the base model to tune
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()

The RandomizedSearchCV takes the following parameter as input

1. estimator: The base estimator model for which the best hyperparameter values are found.
2. param_distributions: Dictionary of parameter names and list of values to try.
3. n_iter: Number of parameters that are tried to find the best values.
4. random_state: The random seed value.
5. cv: cross-validation generator or iterable.

In [5]:
# Random search of parameters by searching across 50 different combinations
rf_random = RandomizedSearchCV(estimator=random_forest,
                               param_distributions=param_grid,
                               n_iter=50,
                               random_state=42,
                               cv=5
                               )

# Fit the model to find the best hyperparameter values
rf_random.fit(X_train, y_train)

The best hyperparameters values for the random forest model is found below.

In [6]:
rf_random.best_params_

{'n_estimators': 17,
 'min_samples_leaf': 333,
 'max_features': 0.65,
 'max_depth': 4,
 'bootstrap': False}

In this step, we train the model created using the best hyperparameter values.

In [8]:
# Assign the best model to best_random_forest
best_random_forest = rf_random.best_estimator_

# Initialize random_state to 42
best_random_forest.random_state = 42

# Fit the best random forest model on the train dataset
best_random_forest.fit(X_train, y_train)

# Grid search

Similarly, we can find the best model using the grid search cross-validation technique. Since this method gets time-consuming because it tries out all possible combinations, we have defined fewer hyperparameter values for illustration purpose only. You may choose to specify more values for hyperparameter.

In [9]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=10, stop=20, num=3)]

# Number of features to consider at every split
max_features = [round(x, 2) for x in np.linspace(start=0.3, stop=1.0, num=3)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start=300, stop=600, num=3)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Create the random grid
param_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'min_samples_leaf': min_samples_leaf,
              'bootstrap': bootstrap
              }

param_grid

{'n_estimators': [10, 15, 20],
 'max_features': [0.3, 0.65, 1.0],
 'min_samples_leaf': [300, 450, 600],
 'bootstrap': [True, False]}

The below code finds the best hyperparameter values.

In [10]:
from sklearn.model_selection import GridSearchCV

# Uncomment below line to see detail about GridSearchCV function
# help(GridSearchCV)

# Grid search of parameters by searching all the possible combinations
rf_grid = GridSearchCV(estimator=random_forest,
                       param_grid=param_grid, cv=5
                       )

# Fit the model to find the best hyperparameter values
rf_grid.fit(X_train, y_train)

# Best hyperparameter values
rf_grid.best_params_

{'bootstrap': False,
 'max_features': 0.65,
 'min_samples_leaf': 300,
 'n_estimators': 10}

## Practice

You can try it yourself and analyze how the random forest model created through RandomSearchCV and GridSearchCV performs on a test dataset.