# **Tuning Hyperparameters of Machine Learning Model**

Chanin Nantasenamat

<i>Data Professor YouTube channel, http://youtube.com/dataprofessor </i>

In this Jupyter notebook, we will be tuning hyperparameters of a classification model built by random forest algorithm using scikit-learn package in Python.

## **1. Make synthetic dataset**

### **1.1. Generate the dataset**

In [1]:
from sklearn.datasets import make_classification

X, Y = make_classification(n_samples=200, n_classes=2, n_features=10, n_redundant=0, random_state=1)

### **1.2. Let's examine the data dimension**

We can see that there are 100 rows (samples) and 5 columns (features) for the **X** variable and 100 rows and 1 column (class label) for the **Y** variable.

In [2]:
X.shape, Y.shape

((200, 10), (200,))

## **2. Data split (80/20 ratio)**

### **2.1. Data split**

A ratio of 80/20 is used for data splitting such that 80% goes to the training subset and 20% to the testing subset.

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

### **2.2. Let's examine the data dimension**

Here we see that the **training set** has 160 rows and 10 columns while there are 160 rows and 1 column for the **Y** variable.

In [3]:
X_train.shape, Y_train.shape

((160, 10), (160,))

The **testing set** has 40 rows and 10 columns for the **X** variable while there are 40 rows and 1 column for the **Y** variable.

In [4]:
X_test.shape, Y_test.shape

((40, 10), (40,))

# **3. Building a simple machine learning model using Random Forest**

In the following blocks of codes, we will first start with building a random forest model. Finally, we will explore how to tune the hyperparameters (e.g. **n_estimators** and **max_features**) of the random forest algorithm. 

We first start by importing the necessary libraries and assigning the random forest classifier to the **rf** variable.

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(max_features=5, n_estimators=100)

Now, we will be applying the random forest classifier to build a classification model using the **rf.fit()** function on the training data (e.g. **X_train** and **Y_train**).

In [6]:
rf.fit(X_train, Y_train)

RandomForestClassifier(max_features=5)

The **rf.score()** function will be used to calculate the accuracy score of the RF model in predicting the *test data* (**X_test**).

In [7]:
rf.score(X_test, Y_test)

0.775

The following 2 code cells also calculate the accuracy score of the RF model in predicting the test data (X_test) but performs it in 2 steps using **rf.predict()** and **accuracy_score()** functions.

In [8]:
Y_pred = rf.predict(X_test)

In [9]:
accuracy_score(Y_pred, Y_test)

0.775

The advantage of using this latter approach is that you have access to the predicted data values.

In [10]:
Y_pred, Y_test

(array([1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
        0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1]),
 array([1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0,
        0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1]))

# **4. Hyperparameter Tuning**

Now we will be performing the tuning of hyperparameters of Random forest model. The hyperparameters that we will tune includes **max_features** and the **n_estimators**.

Note: Some codes modified from [scikit-learn](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)

Firstly, we will import the necessary modules.

The **GridSearchCV()** function from scikit-learn will be used to perform the hyperparameter tuning. Particularly, **GridSearchCV()** function can perform the typical functions of a classifier such as ***fit***, ***score*** and ***predict*** as well as ***predict_proba***, ***decision_function***, ***transform*** and ***inverse_transform***.

Secondly, we define variables that are necessary input to the GridSearchCV() function.


In [11]:
from sklearn.model_selection import GridSearchCV
import numpy as np

max_features_range = np.arange(1,6,1)
n_estimators_range = np.arange(10,210,10)
param_grid = dict(max_features=max_features_range, n_estimators=n_estimators_range)

rf = RandomForestClassifier()

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

In [12]:

grid.fit(X_train, Y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_features': array([1, 2, 3, 4, 5]),
                         'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
       140, 150, 160, 170, 180, 190, 200])})

In [13]:
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'max_features': 4, 'n_estimators': 20} with a score of 0.93


# **5. Dataframe of Grid search parameters and their Accuracy scores**

Finally, we will be exporting the grid search parameters and their resulting accuracy scores into a dataframe.

In [14]:
import pandas as pd

grid_results = pd.concat([pd.DataFrame(grid.cv_results_["params"]),pd.DataFrame(grid.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
grid_results.head()

Unnamed: 0,max_features,n_estimators,Accuracy
0,1,10,0.725
1,1,20,0.85625
2,1,30,0.85625
3,1,40,0.88125
4,1,50,0.88125


# **6. Preparing data for making contour plots**

Prior to making contour plots, we will have to reshape the data into a compatible format that will be recognized by the contour plot functions.

Firstly, we will be using Pandas' **groupby()** function to segment the data into groups based on the 2 hyperparameters: **max_features** and **n_estimators**.

In [15]:
grid_contour = grid_results.groupby(['max_features','n_estimators']).mean()
grid_contour

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy
max_features,n_estimators,Unnamed: 2_level_1
1,10,0.72500
1,20,0.85625
1,30,0.85625
1,40,0.88125
1,50,0.88125
...,...,...
5,160,0.91875
5,170,0.92500
5,180,0.92500
5,190,0.91250


## **Pivoting the data**

Data is reshaped by pivoting the data into an m by n matrix where rows and columns correspond to the **max_features** and **n_estimators**, respectively.

In [16]:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_features', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot('max_features', 'n_estimators')
grid_pivot

Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy
n_estimators,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200
max_features,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1,0.725,0.85625,0.85625,0.88125,0.88125,0.9125,0.9125,0.89375,0.8875,0.9,0.90625,0.9,0.9125,0.9125,0.9125,0.89375,0.9,0.9125,0.90625,0.90625
2,0.825,0.8875,0.8875,0.90625,0.9125,0.9,0.9125,0.90625,0.9125,0.90625,0.90625,0.90625,0.90625,0.90625,0.9125,0.9125,0.90625,0.91875,0.91875,0.90625
3,0.85,0.90625,0.89375,0.9125,0.91875,0.91875,0.91875,0.9125,0.9125,0.9,0.91875,0.91875,0.91875,0.91875,0.9125,0.925,0.9125,0.9125,0.9125,0.9125
4,0.88125,0.93125,0.8875,0.9125,0.91875,0.91875,0.91875,0.91875,0.91875,0.91875,0.91875,0.91875,0.925,0.91875,0.91875,0.9125,0.925,0.9125,0.91875,0.9125
5,0.9125,0.90625,0.91875,0.91875,0.91875,0.9125,0.925,0.925,0.91875,0.91875,0.91875,0.925,0.925,0.925,0.925,0.91875,0.925,0.925,0.9125,0.925


Finally, we assign the pivoted data into the respective ***x***, ***y*** and ***z*** variables.

In [17]:
x = grid_pivot.columns.levels[1].values
y = grid_pivot.index.values
z = grid_pivot.values

# **2D Contour Plot**

Now, comes the fun part, we will be visualizing the landscape of the 2 hyperparameters that we are tuning and their influence on the accuracy score.

In [18]:
import plotly.graph_objects as go

# X and Y axes labels
layout = go.Layout(
            xaxis=go.layout.XAxis(
              title=go.layout.xaxis.Title(
              text='n_estimators')
             ),
             yaxis=go.layout.YAxis(
              title=go.layout.yaxis.Title(
              text='max_features') 
            ) )

fig = go.Figure(data = [go.Contour(z=z, x=x, y=y)], layout=layout )

fig.update_layout(title='Hyperparameter tuning', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

ModuleNotFoundError: No module named 'plotly'

# **3D Surface Plot**

Let's add an extra dimension to the plot and we now have a 3D surface plot. The cool thing about this plot is that you can rotate the graph.

In [19]:
import plotly.graph_objects as go


fig = go.Figure(data= [go.Surface(z=z, y=y, x=x)], layout=layout )
fig.update_layout(title='Hyperparameter tuning',
                  scene = dict(
                    xaxis_title='n_estimators',
                    yaxis_title='max_features',
                    zaxis_title='Accuracy'),
                  autosize=False,
                  width=800, height=800,
                  margin=dict(l=65, r=50, b=65, t=90))
fig.show()

ModuleNotFoundError: No module named 'plotly'

---