# Exercise 2
* Grow a forest by following these steps:
    * Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn’s ShuffleSplit class for this.
    * Train one decision tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1,000 decision trees on the test set. Since they were trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy.
    * Now comes the magic. For each test set instance, generate the predictions of the 1,000 decision trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This approach gives you majority-vote predictions over the test set.
    * Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a random forest classifier!

## Import Libraries

In [12]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from scipy.stats import randint
from sklearn.metrics import classification_report
from sklearn.tree import export_graphviz
import graphviz



import pandas as pd
import numpy as np

In [3]:
## create the moons dataset
X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [4]:
## reading the features
X_moons

array([[ 0.9402914 ,  0.12230559],
       [ 0.12454026, -0.42477546],
       [ 0.26198823,  0.50841438],
       ...,
       [-0.24177973,  0.20957199],
       [ 0.90679645,  0.54958215],
       [ 2.08837082, -0.05050728]])

In [5]:
y_moons

array([1, 0, 0, ..., 1, 0, 1])

Observations:
* So we have 2 features in this dataset, lets callthem $x_1$ and $x_2$, and our target variable is $y$
* Lets combine them into a dataframe for easier data exploration (if needed)

In [6]:
data = pd.concat([pd.DataFrame(X_moons, columns=["x1","x2"]),pd.DataFrame(y_moons,columns=["y"])], axis=1)
data.head()

Unnamed: 0,x1,x2,y
0,0.940291,0.122306,1
1,0.12454,-0.424775,0
2,0.261988,0.508414,0
3,-0.495238,0.072589,0
4,-0.879413,0.549373,0


In [7]:
data.shape

(10000, 3)

In [36]:
## Splitting Train/Test
training_data, testing_data = train_test_split(data, test_size=0.2, random_state=42)

In [37]:
training_data.shape

(8000, 3)

In [None]:
training_data.head()

Unnamed: 0,x1,x2,y
9254,-0.564135,0.292837,0
1561,-1.160335,0.965126,0
1670,-0.065988,-0.151911,1
6087,-0.386136,0.411831,0
6669,0.053037,0.373754,1


In [43]:
testing_data.shape

(2000, 3)

## Generating Subsets

In [58]:
X_train = training_data.drop(columns=["y"])
y_train = training_data["y"]


X_test = testing_data.drop(columns=["y"])
y_test = testing_data["y"]


In [49]:
## initializing shuffle split
n_trees = 1000
n_instances = 100

ss = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances,
                  random_state=42)
subsets = []

for mini_train_index, mini_test_index in ss.split(X_train):
    X_mini_train = X_train.iloc[mini_train_index]
    y_mini_train = y_train.iloc[mini_train_index]
    subsets.append((X_mini_train, y_mini_train))


In [54]:
len(subsets)

1000

## Training The Forest
* So to train the forest we need 1000 instances of `DecisionTreeClassifier` with the following best params

```
Best Parameters {'criterion': 'gini', 'max_depth': 8, 'max_leaf_nodes': 25, 'splitter': 'best'}
```

In [56]:
forest = [DecisionTreeClassifier(random_state=42, criterion="gini", max_depth=8, max_leaf_nodes=25, splitter="best") for _ in range(n_trees)]

* Now we need to fit each decision tree on each subset instance and then add predictions for the test data. 

In [59]:
## list for all predictions from the forest
forest_predictions = []

## list for trained forest
trained_forest = []

for tree, (X,y) in zip(forest,subsets):
    tree.fit(X,y)
    trained_forest.append(tree)
    tree_prediction = tree.predict(X_test)
    forest_predictions.append(tree_prediction)
    

In [60]:
forest_predictions

[array([0, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 1, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([0, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 1, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([0, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 1, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1,

In [61]:
trained_forest

[DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, m

* Now that we have forest predictions, which would be a $n_{tress} \times \text{lenth of test data}$ 
* We need to take each column and find the mode as predictions

In [90]:
from scipy.stats import mode
frequent_forest_predictions = []
temp = np.array(forest_predictions)
for i in range(temp.shape[1]):
    prediction,_ = mode(temp[::,i])
    frequent_forest_predictions.append(prediction)


In [91]:
frequent_forest_predictions

[1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,


In [92]:
print(classification_report(y_test, frequent_forest_predictions))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observations:
* So the book hypothesis was I should see about 0.5 to 1.5% higher accuracy, but I don't see that improvement. 
* There is some improvement but not significant. I wonder if I can see better results if I generate denser forests. 