## Chapter 6 - Decision Tree

### Question 7 - training and fine-tuning decision tree on the moons dataset

a - using make_moon the make the dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
from sklearn.datasets import make_moons

In [9]:
X , y = make_moons(n_samples=10000 , noise=0.4 , random_state=42)

b - using train_test_split to split the dataset into training set and test set

In [10]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X ,y , test_size= 0.2 , random_state=42)

c - using grid search with cross-validation to find best hyperparameter values for decision tree classifier

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = [{
    'max_depth' : [5,10,15,20] , 'max_leaf_nodes' : [10,20,30,40] 
}]

tree_clf = DecisionTreeClassifier()

In [12]:
grid_search = GridSearchCV(tree_clf , param_grid , cv = 5 , scoring='accuracy' , return_train_score=True)

grid_search.fit(X_train , y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid=[{'max_depth': [5, 10, 15, 20],
                          'max_leaf_nodes': [10, 20, 30, 40]}],
             return_train_score=True, scoring='accuracy')

In [13]:
print(grid_search.best_score_ , grid_search.best_params_)

0.858625 {'max_depth': 10, 'max_leaf_nodes': 20}


d - training the best classifier on the full training set and measure the model performance on the test set

In [14]:
tree_clf_best = DecisionTreeClassifier(max_depth=10 , max_leaf_nodes= 20)

tree_clf_best.fit(X_train , y_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=20)

In [15]:
y_pred = tree_clf_best.predict(X_test)

In [17]:
from sklearn.metrics import accuracy_score , confusion_matrix

print(accuracy_score(y_pred, y_test))

confusion_matrix(y_pred, y_test)

0.87


array([[896, 143],
       [117, 844]], dtype=int64)

### Question 8 - growing a forest

a - making 1000 subsets of the training set, each contains 100 instances selected randomly

In [18]:
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1000 , train_size=100 , random_state=42)

multi_tree_dataset = []

for trainid , testid in rs.split(X_train):
    multi_tree_dataset.append((X_train[trainid] , y_train[trainid]))

b - training one decision tree on each subset using the best hyperparameter values from last question, and evaluating these    1000 trees on the test set

In [43]:
accuracy_scores = []
trees = []

for (X_train_multi , y_train_multi) in multi_tree_dataset:
    new_tree_clf_best = DecisionTreeClassifier(max_depth=10 , max_leaf_nodes=20)
    new_tree_clf_best.fit(X_train_multi , y_train_multi)
    trees.append(new_tree_clf_best)
    y_pred_multi = tree_clf_best.predict(X_test)
    accuracy_scores.append(accuracy_score(y_pred , y_test))
    
np.mean(accuracy_scores)

0.8234999999999998

c - generating the predictions of each of the 1000 trees on the test set, and keeping the most frequent using mode()

In [44]:
y_pred = []

for tree in trees:
    y_pred.append((tree.predict(X_test)))

In [45]:
from scipy.stats import mode

y_pred_final , votes = mode(y_pred , axis=0)

d - evaluating these prediction on the test set, we got an increase of 0.25% from the model in the last question

In [46]:
accuracy_score(y_pred_final.reshape(-1) , y_test)

0.8725