# Exercise

# #1

 Load the MNIST data, and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, a Logistic Regression classifier, an SVM, and a MLPClassifier (I haven't taught you yet, but its a simple neural network with multi-layers). 
 
Next, try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?
 
Last, attemp to use XGBoost.  Does it improve the accuracy?

# Solution

# #1

In [1]:
from sklearn.datasets import fetch_openml
import numpy as np

#fetching
mnist = fetch_openml('mnist_784', version=1)

#make sure is int
mnist.target = mnist.target.astype(np.int)

In [2]:
from sklearn.model_selection import train_test_split

#Load the MNIST data and split it into a training set, a validation set, and 
#a test set (e.g., use 50,000 instances for training, 10,000 for 
#validation, and 10,000 for testing)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=10000, random_state=42)

In [3]:
#define the classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42)
mlp_clf = MLPClassifier(random_state=42)

In [4]:
models = [rnd_clf, log_clf, svm_clf, mlp_clf]
for model in models:
    print("Training the", model.__class__.__name__)
    model.fit(X_train, y_train)

Training the RandomForestClassifier
Training the LogisticRegression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training the SVC
Training the MLPClassifier


In [5]:
[model.score(X_val, y_val) for model in models]

[0.9692, 0.9186, 0.9788, 0.9629]

Seems like Logistic Regression is hurting performance.  Not sure whether we should include it on the ensemble but we will think it later....Let's first define a Voting Classifier

In [6]:
from sklearn.ensemble import VotingClassifier

named_estimators = [
    ("rnd_clf", rnd_clf),
    ("log_clf", log_clf),
    ("svm_clf", svm_clf),
    ("mlp_clf", mlp_clf),
]

voting_clf = VotingClassifier(named_estimators)
voting_clf.fit(X_train, y_train)
voting_clf.score(X_val, y_val)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9737

In [7]:
#let's print out the individual estimator
[estimator.score(X_val, y_val) for estimator in voting_clf.estimators_]

[0.9692, 0.9186, 0.9788, 0.9629]

Clearly, we can see that Logistic Regression is hurting the ensemble performance.  We can easily remove using del or fit again.  

In [8]:
del voting_clf.estimators_[1]  #second estimator

#print estimators to make sure its deleted
voting_clf.estimators_

[RandomForestClassifier(random_state=42),
 SVC(probability=True, random_state=42),
 MLPClassifier(random_state=42)]

In [9]:
#check the score again
voting_clf.score(X_val, y_val)  #yay increase a little!

0.9784

By default, voting classifier is hard voting.  Let's try soft voting.  We can either fit again, but we can also easily set params to soft (this works because soft/hard is simply a function of all the models we already fitted)

In [10]:
voting_clf.voting = "soft"
voting_clf.score(X_val, y_val)

0.977

Hmm...seems like hard wins....

Finally, let's try on our testing set which we should never touch it until the final testing

In [11]:
voting_clf.voting = "hard"
print("Voting classifier score:", voting_clf.score(X_test, y_test))
print("Each classifier scor: ")
[estimator.score(X_test, y_test) for estimator in voting_clf.estimators_]

Voting classifier score: 0.9747
Each classifier scor: 


[0.9645, 0.976, 0.9603]

Hmm...it seems that the voting classifier only slighly increase the accuracy by little, when compared to the best estimator

let's try XGBoost

In [12]:
import xgboost

xgb_clf = xgboost.XGBClassifier() 

#not improved after 2 iterations
xgb_clf.fit(X_train, y_train,
                eval_set=[(X_val, y_val)], early_stopping_rounds=2)
print("Score: ", xgb_clf.score(X_test, y_test))


[0]	validation_0-merror:0.15470
Will train until validation_0-merror hasn't improved in 2 rounds.
[1]	validation_0-merror:0.11460
[2]	validation_0-merror:0.09720
[3]	validation_0-merror:0.08680
[4]	validation_0-merror:0.07850
[5]	validation_0-merror:0.07580
[6]	validation_0-merror:0.07070
[7]	validation_0-merror:0.06700
[8]	validation_0-merror:0.06270
[9]	validation_0-merror:0.05950
[10]	validation_0-merror:0.05900
[11]	validation_0-merror:0.05450
[12]	validation_0-merror:0.05210
[13]	validation_0-merror:0.04930
[14]	validation_0-merror:0.04810
[15]	validation_0-merror:0.04720
[16]	validation_0-merror:0.04530
[17]	validation_0-merror:0.04360
[18]	validation_0-merror:0.04310
[19]	validation_0-merror:0.04100
[20]	validation_0-merror:0.03910
[21]	validation_0-merror:0.03810
[22]	validation_0-merror:0.03720
[23]	validation_0-merror:0.03650
[24]	validation_0-merror:0.03530
[25]	validation_0-merror:0.03400
[26]	validation_0-merror:0.03430
[27]	validation_0-merror:0.03310
[28]	validation_0-me