# Michael Navarro: Competency 4 - Project 1

## Project Instructions

Task 1: Complete Exercise 8 from Chapter 7 in Textbook:
>* Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation,and 10,000 for testing). 
>* Then train various classifiers, such as a RandomForest classifier, an Extra-Trees classifier, and an SVM. 
>* Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. 
>* Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

Task 2: 
>* Repeat the previous task on a different dataset  (you can pick one of the datasets used in previous projects or pick a new one). 

## Notebook Imports & Globals

In [1]:
#----------------------------------------------------------------------------------------
#                                       Imports 
#----------------------------------------------------------------------------------------
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from scipy import stats

#----------------------------------- sklearn imports -----------------------------------
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import fetch_openml

#----------------------------------------------------------------------------------------
#                                   Global Constants
#----------------------------------------------------------------------------------------

## Task 1

### Load MNIST Data

In [2]:
# Load MNIST dataset from OpenML
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

### Split Data
* 50K - Training
* 10K - Validation
* 10K - Testing

In [3]:
# Validation Set
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)

# Test Set
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

## Train Classifiers

### Setup Modeling Variables

In [4]:
# Setup Model Parameters
random_forest_clf   = RandomForestClassifier(n_estimators=100, random_state=42)
extra_trees_clf     = ExtraTreesClassifier(n_estimators=100, random_state=42)
svm_clf             = LinearSVC(max_iter=100, tol=20, random_state=42)
mlp_clf             = MLPClassifier(random_state=42)

In [5]:
# List of models and their respective names for iterative purposes
estimators      = [random_forest_clf, extra_trees_clf, svm_clf, mlp_clf]
estimator_names = ['Random Forest Classifier', 'Extra Trees Classifier', 'Linear SVC', 'MLP Classifier']

### Train Models

In [6]:
# Train Models
idx = 0
for estimator in estimators:
    print(f'Training: {estimator_names[idx]}')
    # Train model
    estimator.fit(X_train, y_train)
    idx += 1
# end for

Training: Random Forest Classifier
Training: Extra Trees Classifier
Training: Linear SVC
Training: MLP Classifier


In [7]:
# Display Scores of each model
idx = 0
for estimator in estimators:
    print(f'Model: {estimator_names[idx]: <25} Score: {estimator.score(X_val, y_val): .4f}')
    idx += 1
# end for

Model: Random Forest Classifier  Score:  0.9692
Model: Extra Trees Classifier    Score:  0.9715
Model: Linear SVC                Score:  0.8590
Model: MLP Classifier            Score:  0.9577


The least performant model was Linear SVC while the best performant model was Random Forest Classifier.

### Ensemble Testing

#### Setup Voting Ensemble Classifier

In [8]:
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("extra_trees_clf", extra_trees_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

voting_clf = VotingClassifier(named_estimators)

voting_clf.fit(X_train, y_train)
print() # Silence output




In [9]:
# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

Ensemble Score: 0.9703


#### Performance Tuning

In [10]:
# "Turn off" two lowest performing models
voting_clf.set_params(svm_clf=None)
voting_clf.set_params(mlp_clf=None)

# Remove Trained Estimators
# We have to reference the same index due to order of operations
del voting_clf.estimators_[2]   # SVM
del voting_clf.estimators_[2]   # MLP

# Show remaining estimators
print(voting_clf.estimators_)

[RandomForestClassifier(random_state=42), ExtraTreesClassifier(random_state=42)]


In [11]:
# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

Ensemble Score: 0.9713


Eliminating the two lowest performing models resulted in a score lower than the highest model score.

#### Voting: Soft Vs. Hard 

In [12]:
voting_clf.voting = "soft"

# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score (Soft): {ens_score}')

Ensemble Score (Soft): 0.9719


Setting the voting to "soft" yields the highest score so far.

#### Comparison With Test Set

In [13]:
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

# Display Scores of each model
idx = 0
for estimator in voting_clf.estimators_:
    tmp_score = estimator.score(X_val, y_val)
    
    tmp_operator = '<' if (tmp_score < ens_score) else '>'
         
    print(f'Model: {estimator_names[idx]: <25} Score: {tmp_score: .4f} {tmp_operator} {ens_score}')
    idx += 1
# end for

Ensemble Score: 0.9719
Model: Random Forest Classifier  Score:  0.9692 < 0.9719
Model: Extra Trees Classifier    Score:  0.9715 < 0.9719


By eliminating the two lowest performaing models, our soft ensemble score is higher than any one individual model.

## Task 2

In order to reinforce the methodology process from the first task, I chose to use another image based MNIST dataset. The fashion dataset was originally found on Kaggle, but after further resarch on the dataset, I found that it was also on OpenML. This data set in particular is well documented and maintained. Furthermore, the dataset itself also had 70,000 entries. By using this dataset, I was able to see how exactly the same methodology yielded different results due to the differing nature of the Fashion dataset.

#### Load Data

In [14]:
# Load MNIST dataset from OpenML
fashion_mnist = fetch_openml('Fashion-MNIST', version=1, as_frame=False)
fashion_mnist.target = fashion_mnist.target.astype(np.uint8)

#### Split Data

In [15]:
# Validation Set
X_train_val, X_test, y_train_val, y_test = train_test_split(
    fashion_mnist.data, fashion_mnist.target, test_size=10000, random_state=42)

# Test Set
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

#### Train Models

In [16]:
# Train Models
idx = 0
for estimator in estimators:                    # Estimators defined in Task 1
    print(f'Training: {estimator_names[idx]}')  # Estimator names defined in Task 1
    # Train model
    estimator.fit(X_train, y_train)
    idx += 1
# end for

Training: Random Forest Classifier
Training: Extra Trees Classifier
Training: Linear SVC
Training: MLP Classifier




In [17]:
# Display Scores of each model
idx = 0
for estimator in estimators:
    print(f'Model: {estimator_names[idx]: <25} Score: {estimator.score(X_val, y_val): .4f}')
    idx += 1
# end for

Model: Random Forest Classifier  Score:  0.8845
Model: Extra Trees Classifier    Score:  0.8844
Model: Linear SVC                Score:  0.8127
Model: MLP Classifier            Score:  0.8574


#### Ensemble Model

In [18]:
voting_clf = VotingClassifier(named_estimators) # named_estimators defined in Task 1

voting_clf.fit(X_train, y_train)
print() # Silence output






In [19]:
# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

Ensemble Score: 0.8839


#### Performance Tuning

In [20]:
# "Turn off" lowest performing models
voting_clf.set_params(svm_clf=None)

# Remove Trained Estimator
del voting_clf.estimators_[2]   # SVM

# Show remaining estimators
print(voting_clf.estimators_)

[RandomForestClassifier(random_state=42), ExtraTreesClassifier(random_state=42), MLPClassifier(random_state=42)]


In [21]:
# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

Ensemble Score: 0.889


Eliminating the lowest performing model (SVM) yields the best result thus far.

In [22]:
voting_clf.voting = "soft"

# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score (Soft): {ens_score}')

Ensemble Score (Soft): 0.8864


This time around the "soft" ensemble score was lower than the "hard" score. We will revert back to the "hard" method.

In [23]:
# Set voting back to 'hard'
voting_clf.voting = "hard"

# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score (Hard): {ens_score}')

Ensemble Score (Hard): 0.889


In [24]:
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

# tmp_names = estimator_names.remove('Linear SVC')

# Display Scores of each model
idx = 0
for estimator in voting_clf.estimators_:
    tmp_score = estimator.score(X_val, y_val)
    
    tmp_operator = '<' if (tmp_score < ens_score) else '>'
         
    print(f'Model: {estimator_names[idx]: <25} Score: {tmp_score: .4f} {tmp_operator} {ens_score}')
    idx += 1
# end for

Ensemble Score: 0.889
Model: Random Forest Classifier  Score:  0.8845 < 0.889
Model: Extra Trees Classifier    Score:  0.8844 < 0.889
Model: Linear SVC                Score:  0.8574 < 0.889


Using "hard" voting, our ensemble score is better than any single model score.

#### What If: Remove MLP

In [25]:
# "Turn off" lowest performing models
voting_clf.set_params(mlp_clf=None)

# Remove Trained Estimator
del voting_clf.estimators_[2]   # MLP

# Show remaining estimators
print(voting_clf.estimators_)

[RandomForestClassifier(random_state=42), ExtraTreesClassifier(random_state=42)]


In [26]:
# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

Ensemble Score: 0.8831


Removing MLP actually reduced our previous ensemble score.

In [27]:
voting_clf.voting = "soft"

# Display Ensemble score 
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score (Soft): {ens_score}')

Ensemble Score (Soft): 0.8856


Interstingly, by removing MLP, the "soft" voting score is higher than "hard" voting.

In [28]:
ens_score = voting_clf.score(X_val, y_val)
print(f'Ensemble Score: {ens_score}')

# tmp_names = estimator_names.remove('Linear SVC')

# Display Scores of each model
idx = 0
for estimator in voting_clf.estimators_:
    tmp_score = estimator.score(X_val, y_val)
    
    tmp_operator = '<' if (tmp_score < ens_score) else '>'
         
    print(f'Model: {estimator_names[idx]: <25} Score: {tmp_score: .4f} {tmp_operator} {ens_score}')
    idx += 1
# end for

Ensemble Score: 0.8856
Model: Random Forest Classifier  Score:  0.8845 < 0.8856
Model: Extra Trees Classifier    Score:  0.8844 < 0.8856


While the ensemble score is still higher, the performance margin was reduced after removing MLP.

## Closing Thoughts

I found it very interesting how SVM performed the worst with both datasets. However, the most interesting thing I found was when I removed MLP in the second task. The removal of MLP resulted in a sort of "turning the tables" in the result score. The ensemble score was actually worse than when MLP was included and, alternatively, the "soft" voting score was higher than the "hard"voting score. This is the exact opposite of when MLP was included in the ensemble scoring.