In [1]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time

7. Train and fine-tune a decision tree for the moons dataset by following these steps


a. Use make_moons(n_samples=10000, noise=0.4) to generate amoons dataset

In [2]:
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

b. Use train_test_split() to split the dataset into a training set and a test set

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier.

In [4]:
params = {
    "max_depth": list(range(1, 7)),
    "max_leaf_nodes": list(range(2, 100)),
    "min_samples_split": [2,3,4]
}
grid_search_cv = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=params,
    cv=3,
    # verbose=2
)

d. Train it on the full training set using these hyperparameters, and measure your modelâ€™s performance on the test set. You should get roughly 85% to 87% accuracy.

In [5]:
start = time.perf_counter()
grid_search_cv.fit(X_train, y_train)
print("Time elapsed: {:.4f} seconds".format(time.perf_counter()-start))

Time elapsed: 33.7010 seconds


In [6]:
grid_search_cv.best_estimator_

In [7]:
y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8595

8. Grow a forest by following these steps

a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use scikit Learnâ€™s ShuffleSplit class for this

In [8]:
from sklearn.model_selection import ShuffleSplit
from sklearn.base import clone
from scipy.stats import mode

In [9]:
n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train)-n_instances, random_state=42)

for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

 b. Train one decision tree on each subset, using the best hyperparameter values
 found in the previous exercise. Evaluate these 1,000 decision trees on the test
 set. Since they were trained on smaller sets, these decision trees will likely
 perform worse than the first decision tree, achieving only about 80% accuracy

In [10]:
forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]
accuracy_scores = list()

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    # DecisionTreeClassifier().score(X, y) internally calls accuracy_score()
    accuracy_scores.append(tree.score(X_test, y_test))

np.mean(accuracy_scores)

0.8056605

 c. Now comes the magic. For each test set instance, generate the predictions of
 the 1,000 decision trees, and keep only the most frequent prediction (you can
 use SciPyâ€™s mode() function for this). This approach gives you majority-vote
 predictions over the test set.

In [11]:
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

In [12]:
y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

 d. Evaluate these predictions on the test set: you should obtain a slightly higher
 accuracy than your first model (about 0.5 to 1.5% higher).

In [13]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.873