Exercício do livro Hands-On Machine Learning.

Capítulo 6: Decision_trees
https://github.com/cacaprog/handson-ml3/blob/main/06_decision_trees.ipynb

Exercise 7 - train and fine-tune a Decision Tree for the moons dataset.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

In [1]:
# a. Generate a moons dataset
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=10000,
                    noise=0.4, random_state=42)

In [21]:
X_moons

array([[ 0.9402914 ,  0.12230559],
       [ 0.12454026, -0.42477546],
       [ 0.26198823,  0.50841438],
       ...,
       [-0.24177973,  0.20957199],
       [ 0.90679645,  0.54958215],
       [ 2.08837082, -0.05050728]], shape=(10000, 2))

In [22]:
y_moons

array([1, 0, 0, ..., 1, 0, 1], shape=(10000,))

In [3]:
# b. Split it into a training set and a test
X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons,
                                                    test_size=0.2,
                                                    random_state=42)

In [4]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 2)
(2000, 2)
(8000,)
(2000,)


In [7]:
# c. Use grid search with cross-validation (with the help of the GridSearchCV class) 
# to find good hyperparameter values for a DecisionTreeClassifier. 
# Hint: try various values for max_leaf_nodes

params = {
    'max_leaf_nodes': list(range(2, 100)),
    'max_depth': list(range(1, 7)),
    'min_samples_split': [2, 3, 4]
}

model = DecisionTreeClassifier(random_state=42)
grid_search_cv = GridSearchCV(model, params, cv=3)
grid_search_cv.fit(X_train, y_train)

In [8]:
grid_search_cv.best_estimator_

In [9]:
# d. Train it on the full training set using these hyperparameters, and measure your model's 
# performance on the test set. You should get roughly 85% to 87% accuracy.
y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8595

Exercise 8 - Grow a forest

In [13]:
# a. Continuing the previous exercise, generate 1,000 subsets of the training set, 
# each containing 100 instances selected randomly. 
# Hint: you can use Scikit-Learn's ShuffleSplit class for this.

from sklearn.model_selection import ShuffleSplit
from sklearn.base import clone
from scipy.stats import mode
import numpy as np

In [11]:
n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances,
                  random_state=42)

for mini_train_idx, mini_test_idx in rs.split(X_train):
    X_mini_train = X_train[mini_train_idx]
    y_mini_train = y_train[mini_train_idx]
    mini_sets.append((X_mini_train, y_mini_train))

In [14]:
# b. Train one Decision Tree on each subset, using the best hyperparameter values found above.
#  Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller 
# sets, these Decision Trees will likely perform worse than the first Decision Tree, 
# achieving only about 80% accuracy.

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)

    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

np.float64(0.8056605)

In [15]:
# c. Now comes the magic. For each test set instance, generate the predictions of the 
# 1,000 Decision Trees, and keep only the most frequent prediction. 
# This gives you majority-vote predictions over the test set.
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_idx, tree in enumerate(forest):
    Y_pred[tree_idx] = tree.predict(X_test)

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0) 

In [16]:
# d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy
# than your first model (about 0.5 to 1.5% higher). 
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.873

Congratulations, you have trained a Random Forest classifier!