# Exercise 2
* Grow a forest by following these steps:
    * Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn’s ShuffleSplit class for this.
    * Train one decision tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1,000 decision trees on the test set. Since they were trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy.
    * Now comes the magic. For each test set instance, generate the predictions of the 1,000 decision trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This approach gives you majority-vote predictions over the test set.
    * Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a random forest classifier!

## Import Libraries

In [12]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from scipy.stats import randint
from sklearn.metrics import classification_report
from sklearn.tree import export_graphviz
import graphviz



import pandas as pd
import numpy as np

In [3]:
## create the moons dataset
X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [4]:
## reading the features
X_moons

array([[ 0.9402914 ,  0.12230559],
       [ 0.12454026, -0.42477546],
       [ 0.26198823,  0.50841438],
       ...,
       [-0.24177973,  0.20957199],
       [ 0.90679645,  0.54958215],
       [ 2.08837082, -0.05050728]])

In [5]:
y_moons

array([1, 0, 0, ..., 1, 0, 1])

Observations:
* So we have 2 features in this dataset, lets callthem $x_1$ and $x_2$, and our target variable is $y$
* Lets combine them into a dataframe for easier data exploration (if needed)

In [6]:
data = pd.concat([pd.DataFrame(X_moons, columns=["x1","x2"]),pd.DataFrame(y_moons,columns=["y"])], axis=1)
data.head()

Unnamed: 0,x1,x2,y
0,0.940291,0.122306,1
1,0.12454,-0.424775,0
2,0.261988,0.508414,0
3,-0.495238,0.072589,0
4,-0.879413,0.549373,0


In [7]:
data.shape

(10000, 3)

In [36]:
## Splitting Train/Test
training_data, testing_data = train_test_split(data, test_size=0.2, random_state=42)

In [37]:
training_data.shape

(8000, 3)

In [39]:
training_data.head()

Unnamed: 0,x1,x2,y
9254,-0.564135,0.292837,0
1561,-1.160335,0.965126,0
1670,-0.065988,-0.151911,1
6087,-0.386136,0.411831,0
6669,0.053037,0.373754,1


In [43]:
testing_data.shape

(2000, 3)

## Generating Subsets

In [58]:
X_train = training_data.drop(columns=["y"])
y_train = training_data["y"]


X_test = testing_data.drop(columns=["y"])
y_test = testing_data["y"]


In [None]:
## initializing shuffle split
n_trees = 1000
n_instances = 100
ss = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances,
                random_state=42)
subsets = []

for mini_train_index, mini_test_index in ss.split(X_train):
    X_mini_train = X_train.iloc[mini_train_index]
    y_mini_train = y_train.iloc[mini_train_index]
    subsets.append((X_mini_train, y_mini_train))

In [54]:
len(subsets)

1000

## Training The Forest
* So to train the forest we need 1000 instances of `DecisionTreeClassifier` with the following best params

```
Best Parameters {'criterion': 'gini', 'max_depth': 8, 'max_leaf_nodes': 25, 'splitter': 'best'}
```

In [56]:
forest = [DecisionTreeClassifier(random_state=42, criterion="gini", max_depth=8, max_leaf_nodes=25, splitter="best") for _ in range(n_trees)]

* Now we need to fit each decision tree on each subset instance and then add predictions for the test data. 

In [59]:
## list for all predictions from the forest
forest_predictions = []

## list for trained forest
trained_forest = []

for tree, (X,y) in zip(forest,subsets):
    tree.fit(X,y)
    trained_forest.append(tree)
    tree_prediction = tree.predict(X_test)
    forest_predictions.append(tree_prediction)
    

In [60]:
forest_predictions

[array([0, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 1, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([0, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 1, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([0, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 1, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 1]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([1,

In [61]:
trained_forest

[DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=42),
 DecisionTreeClassifier(max_depth=8, m

* Now that we have forest predictions, which would be a $n_{tress} \times \text{lenth of test data}$ 
* We need to take each column and find the mode as predictions

In [90]:
from scipy.stats import mode
frequent_forest_predictions = []
temp = np.array(forest_predictions)
for i in range(temp.shape[1]):
    prediction,_ = mode(temp[::,i])
    frequent_forest_predictions.append(prediction)


In [91]:
frequent_forest_predictions

[1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,


In [92]:
print(classification_report(y_test, frequent_forest_predictions))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observations:
* So the book hypothesis was I should see about 0.5 to 1.5% higher accuracy, but I don't see that improvement. 
* There is some improvement but not significant. I wonder if I can see better results if I generate denser forests. 
* In order to easily experiment with different forest size and number lets convert the above code into a function and then call it with different argument
    * We could also try GridSearch on it, but for now we'll just keep things manual

In [116]:
from scipy.stats import mode
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
import numpy as np

def generate_subsets(n_trees, n_instances, X_train, y_train):
    print(f"Trees : {n_trees}, Instances : {n_instances}")
    ss = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances, random_state=42)
    subsets = []
    X_train_np, y_train_np = X_train.to_numpy(), y_train.to_numpy()

    for mini_train_index, _ in ss.split(X_train_np):
        X_mini_train = X_train_np[mini_train_index]
        y_mini_train = y_train_np[mini_train_index]
        subsets.append((X_mini_train, y_mini_train))

    return subsets

def generate_trained_forest(n_trees, n_instances, X_train, y_train):
    subsets = generate_subsets(n_trees=n_trees, n_instances=n_instances, X_train=X_train, y_train=y_train)
    forest = [DecisionTreeClassifier(random_state=42, criterion="gini",
                                     max_depth=8, max_leaf_nodes=25, splitter="best") for _ in range(n_trees)]
    trained_forest = []
    for tree, (X, y) in zip(forest, subsets):
        tree.fit(X, y)
        trained_forest.append(tree)
    
    return trained_forest

def get_majority_predictions(forest_predictions):
    majority_predictions, _ = mode(forest_predictions, axis=0)
    return majority_predictions.ravel()

def forest_prediction(n_trees, n_instances, X_train, y_train, X_testing_data):
    forest_predictions = []
    trained_forest = generate_trained_forest(n_trees=n_trees, n_instances=n_instances, X_train=X_train, y_train=y_train)
    
    for tree in trained_forest:
        tree_prediction = tree.predict(X_testing_data.to_numpy())
        forest_predictions.append(tree_prediction)
    
    forest_predictions = np.array(forest_predictions)
    return get_majority_predictions(forest_predictions)

In [120]:
def run_forest_experiment(n_trees, n_instances):
    X_train = training_data.drop(columns=["y"])
    y_train = training_data["y"]
    X_test = testing_data.drop(columns=["y"])
    y_test = testing_data["y"]
    forest_predictions = forest_prediction(n_trees=n_trees, n_instances=n_instances,X_train=X_train, y_train=y_train, X_testing_data=X_test)
    print(f"***** Results for forest size {n_trees} and number of forests {n_instances} *****")
    print(classification_report(y_test, forest_predictions))

In [121]:
run_forest_experiment(n_trees=100, n_instances=1000)

Trees : 100, Instances : 1000
***** Results for forest size 100 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



* This was just to test the 1000 instance and 100 trees forest. 

### Increasing Forest Size

In [122]:
run_forest_experiment(n_trees=150, n_instances=1000)

Trees : 150, Instances : 1000
***** Results for forest size 150 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



In [123]:
run_forest_experiment(n_trees=200, n_instances=1000)

Trees : 200, Instances : 1000
***** Results for forest size 200 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



In [124]:
run_forest_experiment(n_trees=500, n_instances=1000)

Trees : 500, Instances : 1000
***** Results for forest size 500 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



In [125]:
run_forest_experiment(n_trees=800, n_instances=1000)

Trees : 800, Instances : 1000
***** Results for forest size 800 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.88      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observations:
* Intersting, even after increasing number of tree by 8 times, it doesn't make any significant difference in the scores. 
* The training and predicting time did decrease a bit. 

In [126]:
run_forest_experiment(n_trees=1200, n_instances=1000)

Trees : 1200, Instances : 1000
***** Results for forest size 1200 and number of forests 1000 *****
              precision    recall  f1-score   support

           0       0.88      0.87      0.87      1013
           1       0.87      0.88      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



### Increasing Number of forests

In [114]:
run_forest_experiment(n_trees=100, n_instances=1500)

***** Results for forest size 100 and number of forests 1500 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



In [127]:
run_forest_experiment(n_trees=100, n_instances=2000)

Trees : 100, Instances : 2000
***** Results for forest size 100 and number of forests 2000 *****
              precision    recall  f1-score   support

           0       0.88      0.86      0.87      1013
           1       0.86      0.88      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observation:
* Increasing number of forests did change the numbers a bit, but nothing significant.  

In [128]:
run_forest_experiment(n_trees=200, n_instances=3000)

Trees : 200, Instances : 3000
***** Results for forest size 200 and number of forests 3000 *****
              precision    recall  f1-score   support

           0       0.87      0.86      0.87      1013
           1       0.86      0.87      0.86       987

    accuracy                           0.86      2000
   macro avg       0.86      0.86      0.86      2000
weighted avg       0.86      0.86      0.86      2000



In [129]:
run_forest_experiment(n_trees=100, n_instances=5000)

Trees : 100, Instances : 5000
***** Results for forest size 100 and number of forests 5000 *****
              precision    recall  f1-score   support

           0       0.88      0.86      0.87      1013
           1       0.86      0.88      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



In [130]:
run_forest_experiment(n_trees=800, n_instances=3000)

Trees : 800, Instances : 3000
***** Results for forest size 800 and number of forests 3000 *****
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1013
           1       0.86      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observation:
* I am not seeing any huge improvement in the numbers, assuming the implementation is correct the only other explaination could be the parameters are not optimized for such small set or well the algorithm is not optimized. 
* Lets try to do a grid search that combines the DecisionTree params along with random forest params

### GridSearchCV

In [133]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import ShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from scipy.stats import mode
import numpy as np

class RandomForestBuilder(BaseEstimator, ClassifierMixin):
    def __init__(self, n_trees=100, n_instances=100, max_depth=None, max_leaf_nodes=None, criterion="gini", splitter="best", random_state=42):
        self.n_trees = n_trees
        self.n_instances = n_instances
        self.max_depth = max_depth
        self.max_leaf_nodes = max_leaf_nodes
        self.criterion = criterion
        self.splitter = splitter
        self.random_state = random_state
        self.forest = []

    def fit(self, X, y):
        # Create subsets
        ss = ShuffleSplit(n_splits=self.n_trees, test_size=len(X) - self.n_instances, random_state=self.random_state)
        self.forest = []
        
        for mini_train_index, _ in ss.split(X):
            X_mini_train = X.iloc[mini_train_index]
            y_mini_train = y.iloc[mini_train_index]
            
            # Train a decision tree on each subset
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                max_leaf_nodes=self.max_leaf_nodes,
                criterion=self.criterion,
                splitter=self.splitter,
                random_state=self.random_state
            )
            tree.fit(X_mini_train, y_mini_train)
            self.forest.append(tree)
        
        # Define classes_
        self.classes_ = np.unique(y)
        return self

    def predict(self, X):
        # Collect predictions from all trees
        forest_predictions = np.array([tree.predict(X) for tree in self.forest])
        majority_predictions, _ = mode(forest_predictions, axis=0)
        return majority_predictions.ravel()

    def score(self, X, y):
        # Use accuracy as the default scoring metric
        predictions = self.predict(X)
        return np.mean(predictions == y)

In [136]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    "n_trees": [10, 50, 100,500,1000],  # Test different numbers of trees
    "n_instances": [50, 100, 200,500,1000],  # Test different subset sizes
    "max_depth": [20, 50,100, None],  # Test tree depth
    "max_leaf_nodes": [500,750,100],  # Test leaf node sizes
    "criterion": ["gini", "entropy"]  # Test splitting criteria
}

# Initialize the custom transformer
forest_builder = RandomForestBuilder()

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=forest_builder,
    param_grid=param_grid,
    scoring="f1_weighted",  # Optimize for F1-score
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,  # Use all available cores
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best F1-Score: {grid_search.best_score_}")

Best Parameters: {'criterion': 'gini', 'max_depth': 20, 'max_leaf_nodes': 500, 'n_instances': 200, 'n_trees': 100}
Best F1-Score: 0.8633551200830955


In [137]:
## Training the whole dataset on the optimized params
optimized_forest = RandomForestBuilder(
    criterion= 'gini', 
    max_depth= 20,
    max_leaf_nodes= 500, 
    n_instances= 200, 
    n_trees= 100
)

In [138]:
optimized_forest.fit(X_train, y_train)

In [139]:
optimized_preditions = optimized_forest.predict(X_test)

In [141]:
print(classification_report(y_test, optimized_preditions))

              precision    recall  f1-score   support

           0       0.88      0.87      0.88      1013
           1       0.87      0.87      0.87       987

    accuracy                           0.87      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.87      0.87      0.87      2000



Observations:
* Slight improvements in the `f1` score,but not significant. 
* I think its safe to say given the dataset and model that we are trying to train, this is the best performance that we can get. 
