# Chapter 7 - Ensemble Learning

# Exercise 1

* This notebook is created to execute and experiment with the exercise from Chapter 7 of "Hands-on ML" book. 
* The exercise details are as follows, 
    * Load the MNIST dataset (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing).
    * Then train various classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier. 
    * Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. 
    * Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

## Plan
* After experimenting in previous we realized that the model with Augmented dataset performs a lot better than the default regular dataset. So thats what I want to use for this exercise. 
* The plan is as follows,
    * Split the data into training set, validation set and test set. 
    * Augment the entire dataset (instead of just the training set as before) and save it as CSV for future use.     
    * Train the following classifiers, (we've trained them in previous notebooks)
        * Logistic Regression
        * SVC
        * Random Foreset
        * KNN
        * Extra Trees Classifier
        * Gradient Boosting Classifier
    * We'll just try `Hard Voting` for now since for `Soft Voting` we need probabilities and calculating those for `SVM` can be time consuming. 
    * For the first version I think I'll train them using default params and then may be try training them using best params I got in previous notebooks. 
    * Compare the performance against individual classifiers and best classifier in previous notebook



## Import Libraries

In [5]:
from sklearn.preprocessing import Binarizer, OneHotEncoder, MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, cross_val_predict,GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
from pathlib import Path
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import ConfusionMatrixDisplay, f1_score, roc_auc_score, roc_curve, accuracy_score
from sklearn.dummy import DummyClassifier


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import joblib
import json
import gdown
import os
import sys

## Import Custom Scripts

In [None]:
# Get the directory of the notebook
notebook_dir = os.getcwd()

# Adjust this if the 'api' folder is not directly under the notebook directory
project_root = os.path.abspath(os.path.join(notebook_dir, ".."))  

# Add to Python path
sys.path.append(project_root)

from api.utils.common import augment_dataset,download_data_from_gdrive


ModuleNotFoundError: No module named 'api'

## Download Data

In [None]:
## download file from google drive
file_id = '1-4Z4OJb1Q9g0L5y5XZf5rQ4z4zD9WpJ6'
raw_mnist_data_path = Path("../data/")

In [44]:
data_dir = project_root + '/data/'

In [45]:
download_data_from_gdrive(data_dir=Path(data_dir))

Retrieving folder contents


Processing file 115anHiDBal3ihxnyC9xVclJOZPkRedS0 augmented_train_X.csv
Processing file 115eSiSJW4CG9FzTNrl14GAhY37sl9AST augmented_train_Y.csv
Processing file 116w163kTklxezTnWdD6taquAHQl_dOmK mnist_test_set.csv
Processing file 115BUCOT_NNbrP_yrHVFmYPLInYBAM1qm mnist_train_set.csv
Processing file 118Z6JMbaScMP9EmoOrTbVEyXFBi5xv5N raw_mnist_data.csv
Processing file 115pvnBtRXTGpCY7BHVxW_tPm3hSiH5tv user_input.csv
Processing file 115gS3mb1K8SH3R5vYVZvRS37eZvN_6CK user_prediction_request.csv


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=115anHiDBal3ihxnyC9xVclJOZPkRedS0
From (redirected): https://drive.google.com/uc?id=115anHiDBal3ihxnyC9xVclJOZPkRedS0&confirm=t&uuid=01e9ce2e-64c9-4e2f-b643-b99c5358810d
To: /Users/gaurangdave/workspace/mnist_digits_recognition/data/augmented_train_X.csv
100%|██████████| 511M/511M [00:14<00:00, 35.5MB/s] 
Downloading...
From: https://drive.google.com/uc?id=115eSiSJW4CG9FzTNrl14GAhY37sl9AST
To: /Users/gaurangdave/workspace/mnist_digits_recognition/data/augmented_train_Y.csv
100%|██████████| 560k/560k [00:00<00:00, 12.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=116w163kTklxezTnWdD6taquAHQl_dOmK
To: /Users/gaurangdave/workspace/mnist_digits_recognition/data/mnist_test_set.csv
100%|██████████| 25.6M/25.6M [00:00<00:00, 35.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=115BUCOT_NNbrP_yrHVFmYPLInYBAM

Data downloaded successfully!



Download completed


## Read Data

In [46]:
raw_data = pd.read_csv(data_dir + 'raw_mnist_data.csv')

In [47]:
raw_data.head()

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784,class
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,9


In [48]:
raw_data.shape

(70000, 785)

## Train, Validation & Test Split

In [49]:
## split the data into variables and target
X = raw_data.drop(columns=['class'])
y = raw_data['class']


In [50]:
## split train, validate and test data splits
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

X_train.shape, X_val.shape, X_test.shape

((44800, 784), (11200, 784), (14000, 784))

## Augment Data

In [None]:
augmented_X, augmented_y = augment_dataset(X_train, y_train)

Augmenting Data: 100%|██████████| 44800/44800 [00:00<00:00, 53074.09it/s]


converting lists to DataFrame and Series


In [None]:
augmented_X.shape, augmented_y.shape

((224000, 784), (224000,))

In [None]:
## save the augmented data
augmented_data = pd.concat([augmented_X, augmented_y], axis=1)
augmented_data.to_csv(data_dir + 'augmented_ensemle_learning_mnist_data.csv', index=False)

## Train Models using default params

In [None]:
model_dir = project_root + '/models/ensemble/'


### Logistic Regression

In [21]:
from sklearn.linear_model import LogisticRegression

# initialize LogisticRegression
logistic_regression = LogisticRegression(n_jobs=-1, random_state=42, max_iter=10000)

# create pipeline
logistic_regression_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("logistic_regression", logistic_regression)
])

In [23]:
## fit the model
logistic_regression_pipeline.fit(augmented_X, augmented_y)

In [25]:
## save the model
saved_location = joblib.dump(logistic_regression_pipeline, model_dir + 'logistic_regression_model.pkl')

### SVC

In [None]:
svc_base = SVC(probability=True, random_state=42)  # Enable probability for AUC

# create pipeline
default_svc_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("svc", svc_base)
])

In [27]:
default_svc_pipeline.fit(augmented_X, augmented_y)

In [29]:
## save the model
saved_location = joblib.dump(default_svc_pipeline, model_dir + 'svc_model.pkl')

### Random Forest

In [28]:
rfc = RandomForestClassifier(random_state=42, n_jobs=-1)

# Create a pipeline
rfc_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("randomforest", rfc)
])

In [30]:
rfc_pipeline.fit(augmented_X, augmented_y)

In [33]:
## save the model
saved_location = joblib.dump(rfc_pipeline, model_dir + 'random_forest_model.pkl')

### KNN

In [32]:
knn = KNeighborsClassifier( n_jobs=-1)

# Create a pipeline
knn_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("knn", knn)
])

In [34]:
knn_pipeline.fit(augmented_X, augmented_y)

In [35]:
## save the model
saved_location = joblib.dump(knn_pipeline, model_dir + 'knn_model.pkl')

### Extra Trees Classifier

In [37]:
extra_trees = ExtraTreesClassifier(n_jobs=-1, random_state=42)

# Create a pipeline
extra_trees_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("extra_trees", extra_trees)
])

extra_trees_pipeline.fit(augmented_X, augmented_y)

In [38]:
## save the model
saved_location = joblib.dump(extra_trees_pipeline, model_dir + 'extra_trees_model.pkl')

### Gradient Boosting Classifier

In [39]:
## Gradient Boosting Classifier
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Create a pipeline
gb_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("gb", gb)
])

gb_pipeline.fit(augmented_X, augmented_y)

In [40]:
## save the model
saved_location = joblib.dump(gb_pipeline, model_dir + 'gradient_boosting_model.pkl')

### Ensemble with Voting Classifier (Hard Voting)


In [12]:
## read all saved details in case the notebook is restarted
model_dir = project_root + '/models/ensemble/'
data_dir = project_root + '/data/'

## read the augmented data from augmented_ensemle_learning_mnist_data.csv
augmented_data = pd.read_csv(data_dir + 'augmented_ensemle_learning_mnist_data.csv')

## split the data into variables and target
augmented_X = augmented_data.drop(columns=['0'])
augmented_y = augmented_data['0']


## read estimators from saved models
logistic_regression = joblib.load(model_dir + 'logistic_regression_model.pkl')
svc = joblib.load(model_dir + 'svc_model.pkl')
random_forest = joblib.load(model_dir + 'random_forest_model.pkl')
knn = joblib.load(model_dir + 'knn_model.pkl')
extra_trees = joblib.load(model_dir + 'extra_trees_model.pkl')
gradient_boosting = joblib.load(model_dir + 'gradient_boosting_model.pkl')  

In [12]:
voting_clf = VotingClassifier(
    estimators=[
        ("LogReg", logistic_regression),
        ("SVC", svc),
        ("RF", random_forest),
        ("KNN", knn),
        ("ExtraTrees", extra_trees),
        ("GB", gradient_boosting)
    ],
    voting="hard"  # Use "soft" if all models support probability predictions
)

* fit (Optional but Necessary for scikit-learn VotingClassifier): While the models are pre-trained, the VotingClassifier still needs to be "fitted" to understand the labels (classes). 
* We can fit it with a small dummy dataset, or an empty dataset, or even just pass an empty array for X and y


In [13]:
voting_clf.fit(X_train, y_train)

In [14]:
## save the model
saved_location = joblib.dump(voting_clf, model_dir + 'voting_classifier_model.pkl')

### Prediction

In [15]:
## predict on the validation set
y_val_pred = voting_clf.predict(X_val)

In [16]:
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}

In [17]:
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)


Accuracy: 0.9721428571428572
Weighted F1: 0.9720869370807289
Per-class F1 scores:  {'Class_0': 0.9777000437254044, 'Class_1': 0.9890710382513661, 'Class_2': 0.96800360522758, 'Class_3': 0.9669565217391304, 'Class_4': 0.975191700496166, 'Class_5': 0.9796708615682478, 'Class_6': 0.9770642201834863, 'Class_7': 0.9712793733681462, 'Class_8': 0.9649122807017544, 'Class_9': 0.9482185273159145}


### Ensemble with Voting Classifier (Soft Voting)


Observations:
* So the `Accuracy` and `F1 Scores`

In [18]:
soft_voting_clf = VotingClassifier(
    estimators=[
        ("LogReg", logistic_regression),
        ("SVC", svc),
        ("RF", random_forest),
        ("KNN", knn),
        ("ExtraTrees", extra_trees),
        ("GB", gradient_boosting)
    ],
    voting="soft"  # Use "soft" if all models support probability predictions
)

In [19]:
soft_voting_clf.fit(X_train, y_train)

In [20]:
## save the model
saved_location = joblib.dump(soft_voting_clf, model_dir + 'soft_voting_classifier_model.pkl')

In [21]:
## predict on the validation set
y_val_pred = soft_voting_clf.predict(X_val)

In [22]:
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}

In [23]:
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)

Accuracy: 0.9727678571428572
Weighted F1: 0.9727278884463526
Per-class F1 scores:  {'Class_0': 0.9802544975866608, 'Class_1': 0.9852140077821012, 'Class_2': 0.9717668488160291, 'Class_3': 0.9698558322411533, 'Class_4': 0.9755213055303718, 'Class_5': 0.9739130434782609, 'Class_6': 0.9744292237442922, 'Class_7': 0.9722463139635733, 'Class_8': 0.967741935483871, 'Class_9': 0.9538606403013182}


Observations:
* So voting classifier significantly outperforms most of the individual default estimators.
* The closest model was default `SVC`, but its still not better then tuned `SVM` that we ended up using in prod, which had accuract and F1 score of .9897. 
* Next step would be to tune all the models and create an ensemble using the best model.
    * For that we'll tune all the models using GridSearch except SVC, since we already have best params for that and training is time consuming. 

## Hyper-param Tuning

* In order to save time, on training models and focusing on learning, we are going to train the models using the best params that we found in previous work sheet. 
* Even though the best params were found on non-augmented data, they are likely to perform well on augmented data too.
* We can add hyper-param tuning as future enhancements

### Logistic Regression

In [26]:
## Below are the hyperparameters for the models
#  Preprocessing: `normalize`
#  Solver : `newton-cg`
#  C : `0.1`
#  Penalty : `l2`
# initialize LogisticRegression
logistic_regression = LogisticRegression(solver="newton-cg", C=0.1, penalty="l2", n_jobs=-1, random_state=42, max_iter=10000)

# create pipeline
tuned_logistic_regression_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("logisticregression", logistic_regression)
])

In [27]:
## start timer
import time
start = time.time()
tuned_logistic_regression_pipeline.fit(augmented_X, augmented_y)
end = time.time()
print(f"Training took {end - start} seconds")

Training took 71.50335788726807 seconds


In [28]:
## save the model
saved_location = joblib.dump(tuned_logistic_regression_pipeline, model_dir + 'tuned_logistic_regression_model.pkl')

In [29]:
## predict on the validation set
y_val_pred = tuned_logistic_regression_pipeline.predict(X_val)
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)


Accuracy: 0.9199107142857142
Weighted F1: 0.9196107770174043
Per-class F1 scores:  {'Class_0': 0.959825327510917, 'Class_1': 0.9521276595744681, 'Class_2': 0.9101019462465245, 'Class_3': 0.9083665338645418, 'Class_4': 0.9291479820627803, 'Class_5': 0.8826108134437408, 'Class_6': 0.940045766590389, 'Class_7': 0.9346204475647214, 'Class_8': 0.8726765799256505, 'Class_9': 0.8973172987974098}


### SVM

In [24]:
## training svc on the augmented data using best hyperparameters
svc = SVC(C=10, gamma="scale", kernel="rbf", random_state=42, probability=True)

# create pipeline

print("Creating pipeline...")
svc_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("svc", svc)
], verbose=True)

Creating pipeline...


In [25]:
svc_pipeline.fit(augmented_X, augmented_y.values.ravel())

[Pipeline] ............ (step 1 of 2) Processing scaler, total=   1.4s
[Pipeline] .............. (step 2 of 2) Processing svc, total=225.1min


In [26]:
## save the model
saved_location = joblib.dump(svc_pipeline, model_dir + 'tuned_svc_model.pkl')

In [31]:
## predict on the validation set
y_val_pred = svc_pipeline.predict(X_val)
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)


Accuracy: 0.98875
Weighted F1: 0.9887464218354948
Per-class F1 scores:  {'Class_0': 0.9911347517730497, 'Class_1': 0.9949159170903402, 'Class_2': 0.9882459312839059, 'Class_3': 0.9855579868708971, 'Class_4': 0.9873303167420815, 'Class_5': 0.9889049686444766, 'Class_6': 0.9876543209876543, 'Class_7': 0.9886759581881533, 'Class_8': 0.9876316994961063, 'Class_9': 0.9864549276039234}


### Random Forest

In [33]:
## below are the best hyperparameters for the random forest classifier
## Best Parameters: {'preprocessing__kw_args': {'method': 'normalize'}, 'randomforest__max_depth': None, 'randomforest__min_samples_leaf': 1, 'randomforest__min_samples_split': 2, 'randomforest__n_estimators': 200}

rfc = RandomForestClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=42, n_jobs=-1)

tuned_rfc_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("randomforest", rfc)
])

start = time.time()
tuned_rfc_pipeline.fit(augmented_X, augmented_y)
end = time.time()
print(f"Training took {end - start} seconds")

## save the model
saved_location = joblib.dump(tuned_rfc_pipeline, model_dir + 'tuned_random_forest_model.pkl')

## predict on the validation set
y_val_pred = tuned_rfc_pipeline.predict(X_val)
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)


Training took 27.962707996368408 seconds
Accuracy: 0.9800892857142857
Weighted F1: 0.9800985350706908
Per-class F1 scores:  {'Class_0': 0.984561093956771, 'Class_1': 0.9909626719056974, 'Class_2': 0.9791477787851315, 'Class_3': 0.977292576419214, 'Class_4': 0.97955474784189, 'Class_5': 0.9864472410454985, 'Class_6': 0.9813041495668035, 'Class_7': 0.9807860262008734, 'Class_8': 0.9758101323596531, 'Class_9': 0.9632728963272896}


### KNN

In [34]:
## Best KNN Params are 
## {'knn__algorithm': 'auto', 'knn__n_neighbors': 5, 'knn__p': 2, 'knn__weights': 'distance'}

knn = KNeighborsClassifier(algorithm='auto', n_neighbors=5, p=2, weights='distance', n_jobs=-1)

tuned_knn_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("knn", knn)
])

start = time.time()
tuned_knn_pipeline.fit(augmented_X, augmented_y)
end = time.time()

print(f"Training took {end - start} seconds")

## save the model
saved_location = joblib.dump(tuned_knn_pipeline, model_dir + 'tuned_knn_model.pkl')

## predict on the validation set
y_val_pred = tuned_knn_pipeline.predict(X_val)
# Compute metrics
accuracy = accuracy_score(y_val, y_val_pred)
weighted_f1 = f1_score(y_val, y_val_pred, average='weighted')
# Compute per-class F1 scores
per_class_f1_scores = f1_score(y_val, y_val_pred, average=None)
per_class_f1_dict = {f"Class_{i}": score for i, score in enumerate(per_class_f1_scores)}
## print the metrics
print(f"Accuracy: {accuracy}")
print(f"Weighted F1: {weighted_f1}")
print("Per-class F1 scores: ", per_class_f1_dict)

Training took 1.853855848312378 seconds
Accuracy: 0.97875
Weighted F1: 0.9787129524087814
Per-class F1 scores:  {'Class_0': 0.9846153846153847, 'Class_1': 0.9841637697952877, 'Class_2': 0.9803383630544125, 'Class_3': 0.9812472743131269, 'Class_4': 0.9803921568627451, 'Class_5': 0.9806949806949807, 'Class_6': 0.9809264305177112, 'Class_7': 0.9813449023861172, 'Class_8': 0.9661971830985916, 'Class_9': 0.9656453110492108}


### Extra Trees Classifier

* We don't have best params for `Extra Trees Classifier` so we'll need to run grid search for this. 
* We'll try to see if we can fit and find the best params quickly else we'll skip this model

In [1]:
param_grid = {
    "extra_trees__n_estimators": [100, 200, 300],
    "extra_trees__max_depth": [None, 10, 20, 30],
    "extra_trees__min_samples_split": [2, 5, 10],
    "extra_trees__min_samples_leaf": [1, 2, 4],
    "extra_trees__bootstrap": [True, False]
}

extra_trees = ExtraTreesClassifier(n_jobs=-1, random_state=42)

# Create a pipeline
extra_trees_pipeline = Pipeline([
    ("scaler", MinMaxScaler()),
    ("extra_trees", extra_trees)
])

extra_trees_grid_search = RandomizedSearchCV(extra_trees_pipeline, param_distributions=param_grid, cv=3, verbose=2, n_jobs=-1, n_iter=5, random_state=42)


start = time.time()
extra_trees_grid_search.fit(augmented_X, augmented_y)
end = time.time()
print(f"Training took {end - start} seconds")

## print the best parameters
print("Best Parameters: ", extra_trees_grid_search.best_params_)
## print the best score
print("Best Score: ", extra_trees_grid_search.best_score_)

## save the model
saved_location = joblib.dump(extra_trees_grid_search, model_dir + 'tuned_extra_trees_model.pkl')

NameError: name 'ExtraTreesClassifier' is not defined