# Chapter 7 Exercises

1. _If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?_<br>
<br>
Yes, they may have different precision per instance, so comparing their per instance predictions, such as taking the mode or average, could be more precise.
<br>
<br>
1. _What is the difference between hard and soft voting classifiers?_<br>
<br>
Hard classifiers pick one class, soft gives a confidence score per class<br><br>
1. _Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?_<br>
<br>
bagging, pasting (and random forests, which are trained with those techniques): yes<br>
boosting, no, it needs sequential training<br>
stacking ensembles: yes for the cross-validation predictions on each predictor, but no after that, since those predictions are inputs to the blender for training
<br><br>
1. _What is the benefit of out-of-bag evaluation?_<br>
<br>
you can use them similar to a validation set, without needing to hold one out, by gathering predictions from estimators for which an instance is oob
<br><br>
1. _What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?_<br>
<br>
the split is at a random feature threshold, instead of searching for the best<br>
it trades more bias for less variance, so it may help avoid overfitting<br>
faster<br>
<br><br>
1. _If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?_<br>
<br>
increase the number of estimators or reduce regularization of the base estimator.
<br><br>
1. _If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?_<br>
<br>
decrease
<br><br>
1. _Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?_

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import StratifiedShuffleSplit

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist["data"], mnist["target"]

In [3]:
sss = StratifiedShuffleSplit(n_splits=1, random_state=0, test_size=10000)
for train_index, test_index in sss.split(X, y):
    X_train_val, X_test = X[train_index], X[test_index]
    y_train_val, y_test = y[train_index], y[test_index]

for train_index, test_index in sss.split(X_train_val, y_train_val):
    X_train, X_val = X[train_index], X[test_index]
    y_train, y_val = y[train_index], y[test_index]
    
print(y.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(70000,)
(50000,)
(10000,)
(10000,)


In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score

#svc = make_pipeline(StandardScaler(), LinearSVC(random_state=0))
sgd = SGDClassifier(random_state=0)
dtree = DecisionTreeClassifier(random_state=0)
xtree = ExtraTreesClassifier(random_state=0)

clfs = [dtree, xtree, sgd]
for clf in clfs:
    clf.fit(X_train, y_train)
    y_val_pred = clf.predict(X_val)
    print(f1_score(y_val_pred, y_val, average="macro"))

0.8723170024854632
0.970646993269115
0.8748138917263514


In [5]:
import numpy as np
from scipy.stats import mode

def ensemble_predict(clfs, X, len_y):
    y_preds = np.zeros((len_y, len(clfs)))

    for i, clf in enumerate(clfs):
        y_p = clf.predict(X)
        for j, p in enumerate(y_p):
            y_preds[j, i] = p

    prediction = np.zeros(len_y)
    for i, z in enumerate(prediction):
        modes, counts = mode(y_preds[i])
        prediction[i] = modes[0]

    return prediction
    
y_val_ensem_predict = ensemble_predict(clfs, X_val, len(y_val))

In [6]:
y_val_int = list(map(int, y_val))
f1_score(y_val_ensem_predict, y_val_int, average="macro")

0.9452319972731414

In [7]:
np.mean([0.8723170024854632, 0.970646993269115, 0.8748138917263514])

0.9059259624936432

In [8]:
y_test_ensem_predict = ensemble_predict(clfs, X_test, len(y_test))
y_test_int = list(map(int, y_test))
f1_score(y_test_ensem_predict, y_test_int, average="macro")

0.9853954065635522

9. _Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier? Now try again using a StackingClassifier instead: do you get better performance? If so, why?_

In [9]:
def concat_predict(clfs, X, len_y):
    y_preds = np.zeros((len_y, len(clfs)))

    for i, clf in enumerate(clfs):
        y_p = clf.predict(X)
        for j, p in enumerate(y_p):
            y_preds[j, i] = p

    return y_preds

y_val_concat_p = concat_predict(clfs, X_val, len(y_val))

In [11]:
y_val_concat_p.shape

(10000, 3)

In [25]:
from sklearn.linear_model import LogisticRegression
#svc_stack = make_pipeline(StandardScaler(), LinearSVC(random_state=0))
#svc_stack.fit(y_val_concat_p, y_val)
stack = LogisticRegression(random_state=0, max_iter=1000)
#stack = ExtraTreesClassifier(random_state=0)

stack.fit(y_val_concat_p, y_val)
y_test_concat_p = concat_predict(clfs, X_test, len(y_test))
y_test_p_stack = stack.predict(y_test_concat_p)
f1_score(y_test_p_stack, y_test, average="macro")

0.973076766749875

(slightly worse than the voting classifier from Ex. 8)

In [24]:
from sklearn.ensemble import StackingClassifier
sgd = SGDClassifier(random_state=0)
dtree = DecisionTreeClassifier(random_state=0)
xtree = ExtraTreesClassifier(random_state=0)
estimators = [
     ('dt', dtree),
     ('xt', xtree),
     ('sgd', sgd)
]

clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(random_state=0, max_iter=1000))
#clf.fit(X_val, y_val)
clf.fit(X_train, y_train)
y_test_p_stack2 = clf.predict(X_test)
f1_score(y_test_p_stack2, y_test, average="macro")

0.8873102473559408

no, even worse (however, in general it is a good idea to use, as you can fold the validation set into training since it uses CV, and it uses more accurate stacking methods, if available for the estimators)