## Hyperparameter tuning

## Youtube tutorial

### **1. Make synthetic dataset**

### **1.1. Generate the dataset**

In [1]:
from sklearn.datasets import make_classification

X, Y = make_classification(n_samples=200, n_classes=2, n_features=10, n_redundant=0, random_state=1)

In [2]:
X.shape, Y.shape

((200, 10), (200,))

### **2. Data split (80/20 ratio)**

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [4]:
X_train.shape, Y_train.shape

((160, 10), (160,))

In [5]:
X_test.shape, Y_test.shape

((40, 10), (40,))

### **3. Building a simple machine learning model using Random Forest**

In the following blocks of codes, we will first start with building a random forest model. Finally, we will explore how to tune the hyperparameters (e.g. **n_estimators** and **max_features**) of the random forest algorithm. 

We first start by importing the necessary libraries and assigning the random forest classifier to the **rf** variable.

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(max_features=5, n_estimators=100)

In [7]:
rf.fit(X_train, Y_train)

In [8]:
rf.score(X_test, Y_test)

0.825

The following 2 code cells also calculate the accuracy score of the RF model in predicting the test data (X_test) but performs it in 2 steps using **rf.predict()** and **accuracy_score()** functions.

In [9]:
Y_pred = rf.predict(X_test)
accuracy_score(Y_pred, Y_test)

0.825

### **4. Hyperparameter Tuning**

Now we will be performing the tuning of hyperparameters of Random forest model. The hyperparameters that we will tune includes **max_features** and the **n_estimators**.

Note: Some codes modified from [scikit-learn](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)

Firstly, we will import the necessary modules.

The **GridSearchCV()** function from scikit-learn will be used to perform the hyperparameter tuning. Particularly, **GridSearchCV()** function can perform the typical functions of a classifier such as ***fit***, ***score*** and ***predict*** as well as ***predict_proba***, ***decision_function***, ***transform*** and ***inverse_transform***.

Secondly, we define variables that are necessary input to the GridSearchCV() function.


In [10]:
from sklearn.model_selection import GridSearchCV
import numpy as np

max_features_range = np.arange(1,6,1)
n_estimators_range = np.arange(10,210,10)
param_grid = dict(max_features=max_features_range, n_estimators=n_estimators_range)

rf = RandomForestClassifier()

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

In [11]:
grid.fit(X_train, Y_train)

In [12]:
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'max_features': 2, 'n_estimators': 90} with a score of 0.89


### **5. Dataframe of Grid search parameters and their Accuracy scores**

Finally, we will be exporting the grid search parameters and their resulting accuracy scores into a dataframe.

In [13]:
import pandas as pd

grid_results = pd.concat([pd.DataFrame(grid.cv_results_["params"]),pd.DataFrame(grid.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
grid_results.head()

Unnamed: 0,max_features,n_estimators,Accuracy
0,1,10,0.78125
1,1,20,0.8375
2,1,30,0.825
3,1,40,0.83125
4,1,50,0.83125


### **6. Preparing data for making contour plots**

Prior to making contour plots, we will have to reshape the data into a compatible format that will be recognized by the contour plot functions.

Firstly, we will be using Pandas' **groupby()** function to segment the data into groups based on the 2 hyperparameters: **max_features** and **n_estimators**.

In [14]:
grid_contour = grid_results.groupby(['max_features','n_estimators']).mean()
grid_contour

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy
max_features,n_estimators,Unnamed: 2_level_1
1,10,0.78125
1,20,0.83750
1,30,0.82500
1,40,0.83125
1,50,0.83125
...,...,...
5,160,0.87500
5,170,0.88125
5,180,0.87500
5,190,0.88125


### **Pivoting the data**

Data is reshaped by pivoting the data into an m by n matrix where rows and columns correspond to the **max_features** and **n_estimators**, respectively.

In [15]:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_features', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot(index='max_features', columns='n_estimators', values='Accuracy')

In [16]:
x = grid_pivot.columns.values
y = grid_pivot.index.values
z = grid_pivot.values

### **2D Contour Plot**

Now, comes the fun part, we will be visualizing the landscape of the 2 hyperparameters that we are tuning and their influence on the accuracy score.

In [17]:
import plotly.graph_objects as go

# X and Y axes labels
layout = go.Layout(
            xaxis=go.layout.XAxis(
              title=go.layout.xaxis.Title(
              text='n_estimators')
             ),
             yaxis=go.layout.YAxis(
              title=go.layout.yaxis.Title(
              text='max_features') 
            ) )

fig = go.Figure(data = [go.Contour(z=z, x=x, y=y)], layout=layout )

fig.update_layout(title='Hyperparameter tuning', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

## Hyperparameter hands-on

In [18]:
# Step 1: Load the Iris Dataset
from sklearn.datasets import load_iris
import pandas as pd

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Convert to DataFrame for better readability
iris_df = pd.DataFrame(X, columns=iris.feature_names)
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


In [19]:
#Step 2: Choose a Model
from sklearn.ensemble import RandomForestClassifier

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Step 3: Hyperparameter Tuning Techniques
# 3.1 Grid Search
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 4, 6, None],
    'min_samples_split': [2, 4, 6]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X, y)

# Best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best score found: ", grid_search.best_score_)

Best parameters found:  {'max_depth': 4, 'min_samples_split': 2, 'n_estimators': 50}
Best score found:  0.9666666666666668


In [20]:
# Random Search
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define the parameter distribution
param_dist = {
    'n_estimators': np.arange(50, 200),
    'max_depth': [2, 4, 6, None],
    'min_samples_split': np.arange(2, 7)
}

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X, y)

# Best parameters and score
print("Best parameters found: ", random_search.best_params_)
print("Best score found: ", random_search.best_score_)

Best parameters found:  {'n_estimators': 51, 'min_samples_split': 4, 'max_depth': 6}
Best score found:  0.9666666666666668
