<a href="https://colab.research.google.com/github/dotmanjohn/kosh/blob/master/Hyperparameter_Tuning_of_RF_Classifier_on_a_ISOLET_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training a Random Forest Classifier with Hyperparameter Tuning on a ISOLET Dataset
You are working for a technology company and they are planning to launch a new voice assistant product. You have been tasked with building a classification model that will recognize the letters spelled out by a user based on the signal frequencies captured. Each sound can be captured and represented as a signal composed of multiple frequencies.

We will carry out hyperparameter tuning and select the model with the set of hyperparameters that gives the best performance (least overfitting model).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter04/Dataset/phpB0xrNj.csv'
df = pd.read_csv(file_url)
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,...,f579,f580,f581,f582,f583,f584,f585,f586,f587,f588,f589,f590,f591,f592,f593,f594,f595,f596,f597,f598,f599,f600,f601,f602,f603,f604,f605,f606,f607,f608,f609,f610,f611,f612,f613,f614,f615,f616,f617,class
0,-0.4394,-0.093,0.1718,0.462,0.6226,0.4704,0.3578,0.0478,-0.1184,-0.231,-0.2958,-0.2704,-0.262,-0.217,-0.0874,-0.0564,0.0254,0.0958,0.4226,0.6648,0.9184,0.9718,0.9324,0.707,0.6986,0.755,0.8816,1.0,0.938,0.845,0.7268,0.5578,-0.433,-0.1982,0.127,0.3666,0.4496,0.4258,0.2646,-0.0368,...,1,-1,-1.0,-1.0,-1.0,0.1334,-1,-0.077,0.0512,0.2564,0.5642,0.4872,0.077,0.4358,0.7436,0.5128,0.6666,0.641,0.6154,1.0,0.8206,0.641,0.359,0.6924,0.4358,0.1538,0.4616,0.6154,0.3334,0.3334,0.4102,0.2052,0.3846,0.359,0.5898,0.3334,0.641,0.5898,-0.4872,'1'
1,-0.4348,-0.1198,0.2474,0.4036,0.5026,0.6328,0.4948,0.0338,-0.052,-0.1302,-0.0964,-0.2084,-0.0494,-0.0494,-0.2942,0.0704,0.0546,0.1302,0.5652,0.6848,0.776,0.9558,0.8542,0.7474,0.6094,0.7708,0.8282,1.0,0.9974,0.948,0.7422,0.5678,-0.2196,0.109,0.5892,0.8768,1.0,0.9936,0.7852,0.3712,...,-1,-1,-1.0,-1.0,-1.0,-1.0,-1,0.0228,-0.091,0.2728,0.8636,0.75,0.4318,0.7272,0.659,0.409,0.7728,1.0,0.7272,0.4772,0.4772,0.4772,0.659,0.1818,0.4318,0.3864,0.841,0.8864,0.25,0.2272,0.0,0.2954,0.2046,0.4772,0.0454,0.2046,0.4318,0.4546,-0.091,'1'
2,-0.233,0.2124,0.5014,0.5222,-0.3422,-0.584,-0.7168,-0.6342,-0.8614,-0.8318,-0.7228,-0.6312,-0.4986,-0.708,-0.6666,-0.5428,-0.413,-0.3776,-0.0472,0.1356,0.6136,0.8024,1.0,0.9794,0.9352,0.8732,0.944,0.9588,0.6962,0.4838,0.3982,0.2064,-0.327,0.0134,0.362,0.3218,-0.4558,-0.8096,-0.7748,-0.7238,...,-1,1,-0.8,-1.0,-0.6,-0.8334,-1,-0.4286,-0.254,-0.365,-0.0952,-0.0794,0.0318,-0.2064,0.0634,0.1112,0.1746,0.238,0.1904,0.508,0.5396,0.0318,-0.0158,0.7142,1.0,0.4126,-0.0794,-0.0476,0.0,0.0952,-0.1112,-0.0476,-0.1746,0.0318,-0.0476,0.1112,0.254,0.1588,-0.4762,'2'
3,-0.3808,-0.0096,0.2602,0.2554,-0.429,-0.6746,-0.6868,-0.665,-0.841,-0.9614,-0.7374,-0.7084,-0.6772,-0.6338,-0.6482,-0.624,-0.3976,-0.5662,-0.2168,0.0458,0.3832,0.6168,0.8988,1.0,0.9156,0.8796,0.9132,0.7132,0.759,0.7278,0.5856,0.506,-0.371,-0.0868,0.4114,0.3438,-0.1816,-0.5964,-0.6888,-0.6686,...,-1,1,-1.0,-1.0,-1.0,-0.8334,-1,-0.2374,-0.5396,0.1798,0.2086,0.0792,0.036,0.3238,0.3956,0.41,0.2662,0.5252,0.367,0.9136,1.0,0.41,0.1224,0.5252,0.4388,0.0216,-0.0792,0.3812,0.2806,0.0648,-0.0504,-0.036,-0.1224,0.1366,0.295,0.0792,-0.0072,0.0936,-0.151,'2'
4,-0.3412,0.0946,0.6082,0.6216,-0.1622,-0.3784,-0.4324,-0.4358,-0.4966,-0.5406,-0.5472,-0.544,-0.4494,-0.2332,-0.2332,-0.1148,0.0068,0.0778,0.4864,0.9054,0.956,0.7602,0.777,0.7636,0.8818,1.0,0.9426,0.7162,0.5472,0.4122,0.277,0.2364,-0.4684,-0.1394,0.421,0.4316,-0.3106,-0.5448,-0.5132,-0.6368,...,1,-1,-1.0,-1.0,-1.0,1.0,-1,0.25,0.5,0.0624,0.3438,0.25,0.25,0.625,0.25,0.5312,0.4376,0.4688,0.5626,0.5938,0.3438,0.5626,0.25,1.0,0.9376,0.3438,0.2812,-0.0312,0.4376,0.2812,0.1562,0.3124,0.25,-0.0938,0.1562,0.3124,0.3124,0.2188,-0.25,'3'


In [None]:
df.shape

(7797, 618)

In [None]:
# Extract the class target variable into a new variable called y using the .pop() method:
y = df.pop('class')

Split the data into tranining ans testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=188)

Instantiate RandomForestClassifier with random_state=42 and then fit the model with the training set

In [None]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Predict the outcome of the training and testing sets with the .predict()method, save the results in a variable called 'train_preds' and 'test_preds'

In [None]:
train_preds = rf_model.predict(X_train)
test_preds = rf_model.predict(X_test)

Calculate the accuracy score on the training and testing sets, save the result in a variable called train_acc and test_acc respectively and print their values

In [None]:
train_acc = accuracy_score(y_train, train_preds)
test_acc = accuracy_score(y_test, test_preds)

In [None]:
print(train_acc)
print(test_acc)

1.0
0.9320512820512821


Our model achieved a perfect accuracy score of 1 on the training set but 0.93 on the testing set. This means our model is overfitting and is not general enough. The ideal situation would be for the model to achieve a very similar, high-accuracy score on both the training and testing sets.

As a result of this, we will tune the hyperparameters of our model to get optimal values that will give us high-accuracy score that does not overfit.

In [None]:
# Instantiate another RandomForestClassifier with random_state=42 and n_estimators=20, and then fit the model with the training set:
rf_model2 = RandomForestClassifier(random_state=42, n_estimators=20)
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds2 and test_preds2

In [None]:
train_preds2 = rf_model2.predict(X_train)
test_preds2 = rf_model2.predict(X_test)

# Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc2 and test_acc2:
train_acc2 = accuracy_score(y_train, train_preds2)
test_acc2 = accuracy_score(y_test, test_preds2)

# Print the accuracy scores: train_acc and test_acc:
print(train_acc2)
print(test_acc2)

1.0
0.9226495726495727


The accuracy score decreased for the testing sets and now the difference is larger compared to the results from rf_model

Instantiate another RandomForestClassifier with random_state=42 and n_estimators=50, and then fit the model with the training set

In [None]:
rf_model3 = RandomForestClassifier(random_state=42, n_estimators=50)
rf_model3.fit(X_train, y_train)

# Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds3 and test_preds3:
train_preds3 = rf_model3.predict(X_train)
test_preds3 = rf_model3.predict(X_test)

# Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc3 and test_acc3:
train_acc3 = accuracy_score(y_train, train_preds3)
test_acc3 = accuracy_score(y_test, test_preds3)

# Print the accuracy scores: train_acc3 and test_acc3:
print(train_acc3)
print(test_acc3)

1.0
0.926923076923077


This output shows us the model is still overfitting and the rf_model gives a better result compared to rf_model2 and rf_model3. Hence we will take our optimal value for n_estimator to be the default value i.e. 100

Now we will proceed to tuning max_depth

In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=100, max_depth=5 and then fit the model with the training set:
rf_model4 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5)
rf_model4.fit(X_train, y_train)

# Make predictions on the training and testing sets with .predict() and save the results into two new variables called train_preds4 and test_preds4:
train_preds4 = rf_model4.predict(X_train)
test_preds4 = rf_model4.predict(X_test)
# Calculate the accuracy score for the training and testing sets and save the results in two new variables called train_acc4 and test_acc4:
train_acc4 = accuracy_score(y_train, train_preds4)
test_acc4 = accuracy_score(y_test, test_preds4)
# Print the accuracy scores: train_acc4 and test_acc4:
print(train_acc4)
print(test_acc4)

0.8704416345977644
0.8311965811965812


In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=50, max_depth=10 and then fit the model with the training set:
rf_model5 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=10)
rf_model5.fit(X_train, y_train)

train_preds5 = rf_model5.predict(X_train)
test_preds5 = rf_model5.predict(X_test)
train_acc5 = accuracy_score(y_train, train_preds5)
test_acc5 = accuracy_score(y_test, test_preds5)
print(train_acc5)
print(test_acc5)

0.9864394355873191
0.9260683760683761


max_depth=5 gives a better result hence, we will take this as our optimal value for max_depth

Now we will proceed to tuning min_leaf_samples

In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=5, and min_samples_leaf=10, and then fit the model with the training set:
rf_model6 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5, min_samples_leaf=10)
rf_model6.fit(X_train, y_train)

train_preds6 = rf_model6.predict(X_train)
test_preds6 = rf_model6.predict(X_test)
train_acc6 = accuracy_score(y_train, train_preds6)
test_acc6 = accuracy_score(y_test, test_preds6)
print(train_acc6)
print(test_acc6)

0.8697086311159978
0.8290598290598291


In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=50, and then fit the model with the training set:
rf_model7 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5, min_samples_leaf=50)
rf_model7.fit(X_train, y_train)

train_preds7 = rf_model7.predict(X_train)
test_preds7 = rf_model7.predict(X_test)
train_acc7 = accuracy_score(y_train, train_preds7)
test_acc7 = accuracy_score(y_test, test_preds7)
print(train_acc7)
print(test_acc7)

0.862561847168774
0.8282051282051283


min_samples_leaf=50 reduces the overfitting, hence this is our optimal value

Finally we will proceed to tuning max_features

In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=50, max_features=0.5 and then fit the model with the training set:
rf_model8 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5, min_samples_leaf=50, max_features=0.5)
rf_model8.fit(X_train, y_train)

train_preds8 = rf_model8.predict(X_train)
test_preds8 = rf_model8.predict(X_test)
train_acc8 = accuracy_score(y_train, train_preds8)
test_acc8 = accuracy_score(y_test, test_preds8)
print(train_acc8)
print(test_acc8)

0.7425325270295033
0.7175213675213675


In [None]:
# Instantiate RandomForestClassifier with random_state=42, n_estimators=30, max_depth=2, and min_samples_leaf=50, max_features=0.3 and then fit the model with the training set:
rf_model9 = RandomForestClassifier(random_state=42, n_estimators=100, max_depth=5, min_samples_leaf=50, max_features=0.3)
rf_model9.fit(X_train, y_train)

train_preds9 = rf_model9.predict(X_train)
test_preds9 = rf_model9.predict(X_test)
train_acc9 = accuracy_score(y_train, train_preds9)
test_acc9 = accuracy_score(y_test, test_preds9)
print(train_acc9)
print(test_acc9)

0.7311709730621221
0.697008547008547


This final set of hyperparameters still doesn't achieve better results than the one we find with n_estimators=100, max_depth=5, min_samples_leaf=50, max_features=0.5.

We built several RandomForest classifier models that accurately predict the letters spoken from audio signals. We tried several values for the hyperparameters n_estimators, max_depth, min_samples_leaf, and max_features. The best combination of hyperparameters we came up with is n_estimators=100, max_depth=5, min_samples_leaf=50, and max_features='0.5'.

We achieved a final accuracy score of 0.74 for the training set and 0.71 for the testing set. The model is still slightly overfitting and could still be improved but it is a remarkable result.

**Our selected model is rf_model8**