# **Lab 4: Random Forest & ExtraTrees**

During the fourth lecture, you have been introduced to Random Forest and ExtraTrees. In this lab, we will see how to train such models using sklearn.


## Exercise 2: Binary Classification with Automated Hyperparameter Tuning

We are going to use a modified version of the dataset from UCI Machine Learning Repository which contains the income census data for 30718 americans (https://archive.ics.uci.edu/ml/datasets/Adult).




The steps are:
1.   Load Dataset
2.   Hyperparameter Tuning with Grid Search
3.   Hyperparameter Tuning with Random Search


### 1. Load Dataset

**[1.1]** Let's install specific version of the packages to be used

In [None]:
#!pip install numpy==1.18.5
#!pip install pandas==1.0.5
#!pip install scikit-learn=="0.22.2.post1"
#!pip install matplotlib==3.2.2
#!pip install altair==4.1.0

**[1.2]** Task: Import the pandas and numpy package

In [None]:
# Placeholder for student's code (2 line of code)
# Task: Import the pandas and numpy package

In [1]:
# Solution
import pandas as pd
import numpy as np
df = pd.read_csv('../data/raw/adult.csv')
df=df.drop(['fnlwgt','education','capital-gain','capital-loss','native-country','race'],axis=1)
df.head()

Unnamed: 0,age,workclass,educational-num,marital-status,occupation,relationship,gender,hours-per-week,income
0,25,Private,7,Never-married,Machine-op-inspct,Own-child,Male,40,<=50K
1,38,Private,9,Married-civ-spouse,Farming-fishing,Husband,Male,50,<=50K
2,28,Local-gov,12,Married-civ-spouse,Protective-serv,Husband,Male,40,>50K
3,44,Private,10,Married-civ-spouse,Machine-op-inspct,Husband,Male,40,>50K
4,18,?,10,Never-married,?,Own-child,Female,30,<=50K


In [3]:
df_cleaned = df.copy()
cat_cols = ['workclass','marital-status','occupation','relationship','gender']
df_cleaned = pd.get_dummies(df_cleaned, columns=cat_cols)
df_cleaned

Unnamed: 0,age,educational-num,hours-per-week,income,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,occupation_Tech-support,occupation_Transport-moving,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,gender_Female,gender_Male
0,25,7,40,<=50K,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,True
1,38,9,50,<=50K,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,True
2,28,12,40,>50K,False,False,True,False,False,False,...,False,False,True,False,False,False,False,False,False,True
3,44,10,40,>50K,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,True
4,18,10,30,<=50K,True,False,False,False,False,False,...,False,False,False,False,False,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,12,38,<=50K,False,False,False,False,True,False,...,True,False,False,False,False,False,False,True,True,False
48838,40,9,40,>50K,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,True
48839,58,9,40,<=50K,False,False,False,False,True,False,...,False,False,False,False,False,False,True,False,True,False
48840,22,9,20,<=50K,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,True


**[1.3]** Task: Load the features (X) and target (y) variables for the training, validation and testing sets

In [None]:
# Placeholder for student's code (6 lines of code)
# Task: Load the features (X) and target (y) variables for the training, validation and testing sets

In [4]:
X = df_cleaned.drop('income', axis=1)
y = df_cleaned['income']

In [5]:
from sklearn.model_selection import train_test_split
X_data, X_test, y_data, y_test = train_test_split (X, y, test_size=0.2, random_state=8)
y_test.value_counts(normalize=True)
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

**[1.4]** Task: Display the dimensions (shape) of the features for the training, validation and testing sets

In [None]:
# Placeholder for student's code (3 lines of code)
# Task: Display the dimensions (shape) of the features for the training, validation and testing sets

In [6]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(31258, 42)
(7815, 42)
(9769, 42)


## 2. Hyperparameter Tuning with Grid Search

**[2.1]** Task: Import GridSearchCV from sklearn.model_selection


In [None]:
# Placeholder for student's code (1 line of code)
# Task: Import GridSearchCV from sklearn.model_selection

In [7]:
# Solution
from sklearn.model_selection import GridSearchCV

**[2.2]** Let's create a dictionary containing the grid search parameters


In [8]:
hyperparams_grid = {
    'n_estimators': np.arange(10, 100, 20),
    'max_depth': np.arange(5, 30, 5),
    'min_samples_leaf': np.arange(2, 20, 4)
    }
hyperparams_grid

{'n_estimators': array([10, 30, 50, 70, 90]),
 'max_depth': array([ 5, 10, 15, 20, 25]),
 'min_samples_leaf': array([ 2,  6, 10, 14, 18])}

**[2.3]** Task: Import the RandomForestClassifier from sklearn.ensemble and instantiate the RandomForestClassifier class called rf with a random state=8

In [None]:
# Placeholder for student's code (2 lines of code)
# Task: Import the RandomForestClassifier from sklearn.ensemble and instantiate the RandomForestClassifier class called rf with a random state=8

In [9]:
# Solution
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=8)

**[2.4]** Task: Instantiate a GridSearchCV with the hyperparameter grid and the random forest model

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Instantiate a GridSearchCV with the hyperparameter grid and the random forest model

In [10]:
# Solution
grid_search_rf = GridSearchCV(rf, hyperparams_grid, cv=2, verbose=1)

**[2.5]** Task: Fit the GridSearchCV on the training set

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Fit the GridSearchCV on the training set

In [11]:
# Solution
grid_search_rf.fit(X_train, y_train)

Fitting 2 folds for each of 125 candidates, totalling 250 fits


  _data = np.array(data, dtype=dtype, copy=copy,


**[2.6]** Task: Display the best set of hyperparameters

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Display the best set of hyperparameters

In [12]:
# Solution
grid_search_rf.best_params_

{'max_depth': np.int64(20),
 'min_samples_leaf': np.int64(6),
 'n_estimators': np.int64(50)}

**[2.6]** Task: Display the accuracy score on all 3 sets

In [None]:
# Placeholder for student's code (3 lines of code)
# Task: Display the accuracy score on all 3 sets

In [13]:
# Solution
print(grid_search_rf.score(X_train, y_train))
print(grid_search_rf.score(X_val, y_val))
print(grid_search_rf.score(X_test, y_test))

0.853477509757502
0.838131797824696
0.8419490224178524


## 3. Hyperparameter Tuning with Random Search

**[3.1]** Task: Import randint from scipy.stats

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Import randint from scipy.stats

In [14]:
# Solution
from scipy.stats import randint

**[3.2]** Let's define the hyperparameters value randomly

In [15]:
hyperparams_dist = {
    'n_estimators': randint(10, 100),
    'max_depth': randint(5, 30),
    'min_samples_leaf': randint(2, 20)
    }

**[3.3]** Task: Import RandomizedSearchCV and KFold from sklearn.model_selection

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Import RandomizedSearchCV and KFold from sklearn.model_selection

In [16]:
# Solution
from sklearn.model_selection import RandomizedSearchCV, KFold

**[3.4]** Task: Instantiate a KFold with 5 splits

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Instantiate a KFold with 5 splits

In [17]:
# Solution
kf_cv = KFold(n_splits=5)

**[3.5]** Task: Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Instantiate a RandomizedSearchCV with the hyperparameter values and the random forest model

In [18]:
# Solution
random_search_rf = RandomizedSearchCV(rf, hyperparams_dist, random_state=8, cv=kf_cv, verbose=1)

**[3.6]** Task: Fit the RandomizedSearchCV on the training set

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Fit the RandomizedSearchCV on the training set

In [19]:
# Solution
random_search_rf.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


**[3.7]** Task: Display the best set of hyperparameters

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Display the best set of hyperparameters

In [20]:
# Solution
random_search_rf.best_params_

{'max_depth': 20, 'min_samples_leaf': 4, 'n_estimators': 69}

**[3.8]** Task: Display the accuracy score on all 3 sets

In [None]:
# Placeholder for student's code (3 lines of code)
# Task: Display the accuracy score on all 3 sets

In [21]:
# Solution
print(random_search_rf.score(X_train, y_train))
print(random_search_rf.score(X_val, y_val))
print(random_search_rf.score(X_test, y_test))

0.8596199372960522
0.8410748560460652
0.8440986794963661
