# Setup

In [1]:
# handle math and data
import numpy as np
import pandas as pd

# to plot nice figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# handle files
import os
import joblib

# Get Moon Data

We will first generate 10,000 instances of moon dataset using sklearn's `make_moons` function

In [157]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
X.shape, y.shape

((10000, 2), (10000,))

In [158]:
X[:3], y[:3]

(array([[ 0.9402914 ,  0.12230559],
        [ 0.12454026, -0.42477546],
        [ 0.26198823,  0.50841438]]),
 array([1, 0, 0]))

In [8]:
np.unique(y)

array([0, 1])

Ok we can see that there are two features and the labels are 0s and 1s.

## Split Train and Test Sets

In [159]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, y_train.shape

((8000, 2), (8000,))

Nice, now let's set aside the test set and use our 8000 instances in the train set.

# Fine-Tune D-Tree Model

We don't need to do model selection since we are only going to use decision-tree. We can jump straight to fine-tuning with `GridSearchCV`.

In [160]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# using our hint
param_grid = {'max_leaf_nodes': list(range(2, 100))}

grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42),
                           param_grid, cv=3, verbose=1)

grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 98 candidates, totalling 294 fits


GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                                            13, 14, 15, 16, 17, 18, 19, 20, 21,
                                            22, 23, 24, 25, 26, 27, 28, 29, 30,
                                            31, ...]},
             verbose=1)

Now let's check the best estimator we got.

In [161]:
grid_search.best_estimator_

DecisionTreeClassifier(max_leaf_nodes=17, random_state=42)

In [162]:
grid_search.best_score_

0.8555001986342105

The score is within our bounds! Now let's train our model and evaluate it on the test set!

## Train Model

In [163]:
dtree_clf = grid_search.best_estimator_

dtree_clf.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=17, random_state=42)

In [164]:
from sklearn.metrics import accuracy_score

accuracy_score(dtree_clf.predict(X_test), y_test)

0.8695

Great! This is within .85 and .87
<br>
Next, let's try creating our own Random Forest!

# Creating a Forest

Our plan is to train 1000 decision tree classifiers on 1000 smaller subsets of the training data. Then evaluate any new instances using all 1000 trees and returning the most common class amongst all 1000 predictions.
<br><br>
We will need to do the following:
- Generate 1000 random datasets, each containing 100 instances
- Train 1000 d-trees on each dataset using our hyperparameters from fine-tuning previously
    - Evaluate the accuracy of each d-tree on the test set and average all 1000 accuracies (probably won't be very good)
- Create a function that makes a prediction on an instance using all 1000 d-trees
    - Gets all 1000 predictions using the d-trees
    - Finds most common prediction and return it
    - Create a function that can do the same for an array of instances
- Evaluate our forest on the test set

## Generate 1000 Smaller Datasets

In [165]:
from sklearn.model_selection import ShuffleSplit


n_trees = 1000
n_instances = 100
s_split = ShuffleSplit(n_splits=n_trees, train_size=n_instances,
                       random_state=42)

mini_sets = []

for mini_train_index, _ in s_split.split(X_train, y_train):
    # ignore test_indices
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    
    mini_sets.append((X_mini_train, y_mini_train))
    
mini_sets[0][0].shape, mini_sets[0][1].shape

((100, 2), (100,))

`mini_sets` is a list of 1000 tuples of (X, y) in the shapes above

Ok now we have a list of 1000 datasets of size 100 for both X (input features) and y (target feature)

## Train 1000 D-Trees

Now we need to iterate through the datasets and train a list of 1000 d-trees. Then evaluate their accuracies.

In [182]:
from sklearn.base import clone

# make sure to use [] because using () produces a generator object (aka it hasn't generated the list yet lmao)
# I was getting 50% accuracy because the generator object was returning only 0s!
forest = [clone(grid_search.best_estimator_) for _ in range(len(mini_sets))]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.8054499999999999

Produces only around 80% accuracy. Let's use the entire forest to predict.

## Predict Function

In [183]:
from scipy.stats import mode

def forest_predict(forest, X):
    Y_pred = np.empty([len(forest), len(X)], dtype=np.uint8)

    for tree_index, tree in enumerate(forest):
        Y_pred[tree_index] = tree.predict(X_test)
    
    return mode(Y_pred, axis=0)[0].reshape([-1])

Y_pred = forest_predict(forest, X_test)
Y_pred.shape

(2000,)

Let's check the accuracy!

In [184]:
accuracy_score(Y_pred, y_test)

0.872