# Random Forests

Random Forest is a powerful **Ensemble Model** that is widely used for both classification and regression tasks. It combines the predictions of multiple decision trees using **Bagging**(Bootstrap Aggregation),  thus reducing variance, i.e. preventing overfitting and improve the accuracy of models.

## Bootstrap Sampling
Given a dataset $D = \{(x^1, y^1), (x^2, y^2), ..., (x^m, y^m)\}$ with $m$ samples and $n$ features for each datapoint $x^i$:
- Sample $T$ datasets $D_1, D_2, ..., D_T$ with replacement, each of size $m$.
- Each dataset is provided with only $k$ features out of the total $n$ features to prevent correlation amongst the trees. ($k \ll n$)
- Each dataset $D_i$ is used to train a decision tree.

The probability of a specific sample being included in a bootstrapped dataset is:
$$ P(\text{not selected}) = \left(1 - \frac{1}{m}\right)^m \approx \frac{1}{e} \approx 0.368 $$

Thus, about 36.8% of the samples are **Out-of-Bag** (OOB) for any given tree and can be used as Validation sets.

Now, for each dataset, a Decision Tree is created using the limited number of features, out of which the best feature and threshold is selected recursively to create the tree until a stopping condition is met.

## Predicting values from the Decision Trees
- **For Classification**: Use majority voting(mode) to get the highest frequency class label from the $T$ trees:
  $$ \hat{y} = \text{mode}(\hat{y}^1, \hat{y}^2, ..., \hat{y}^T) $$
- **For Regression**: Take the mean of the predictions from $T$ trees:
  $$ \hat{y} = \frac{1}{T} \sum_{i=1}^T \hat{y}^i $$


## Algorithm

1. Input: Dataset $D$, number of trees $T$, number of features to select $k$ (Commonly chosen as $\text{sqrt}(n)$).
2. For $t = 1$ to $T$:
   - Sample a bootstrapped dataset $D_t$ from $D$.
   - Train a decision tree $h_t$ on $D_t$:
     - At each node, randomly select $k$ features.
     - Split the node based on the best $j$ and $t$. (See DecisionTree.ipynb)
   - Store the formed tree $h_t$.
3. For a new input $x$, aggregate predictions:
   - Classification: Use majority voting across all $T$ trees.
   - Regression: Use the average prediction across all $T$ trees.


## **Out-of-Bag (OOB) Error**

OOB error is an estimate of prediction error based on "out-of-bag" samples (samples not included in the bootstrapped dataset for a given tree).

For each sample $x^i$:
- Aggregate predictions from all trees that did not include $x^i$ in their bootstrapped dataset.
- Compare the aggregated prediction with the true label $y^i$.

## Implementation

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('future.no_silent_downcasting', True)
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import mode

from ucimlrepo import fetch_ucirepo 
from DT_Package.decisiontree import Node, DecisionTree        # Custom Decision Tree package (see DecisionTree.ipynb)

### Implementing Random Forest class 

In [2]:
class RF:
    def __init__(self, tree_ct=40, max_depth=5):
        self.T = tree_ct
        self.max_d = max_depth
        self.trees = []
        self.oobs = []
    
    def bootstrap(self,X,y):
        indices = np.random.choice(X.shape[0], size=X.shape[0], replace=True)
        oobs = np.setdiff1d(np.arange(X.shape[0]),indices)
        
        features = np.random.choice(X.shape[1], size = int(X.shape[1]**0.5), replace=False)
        
        return indices, features, oobs

    def train(self, X, y):
        for i in range(self.T):
            indices, features, oob = self.bootstrap(X,y)
            X_train = X[indices,:][:, features]
            y_train = y[indices]
            
            tree = DecisionTree(max_depth=self.max_d)
            tree.train(X_train, y_train)

            self.trees.append((tree, features))
            self.oobs.append(set(oob))
    
    def predict_single(self, x):
        predictions = []
        for tree, features in self.trees:
            predictions.append(tree.predict_single(x[features], tree.root))
        return mode(predictions)[0]

    def predict(self, X):
        return np.array([self.predict_single(x) for x in X])

    def oob_score(self, X, y):
        score = 0
        for i in range(X.shape[0]):
            predictions = []
            for j, (tree, features) in enumerate(self.trees):
                if (i in self.oobs[j]):
                    predictions.append(tree.predict_single(X[:,features][i], tree.root))
            ans = mode(predictions)[0]

            score += 1 if (ans == y[i]) else 0
        return score/X.shape[0]
            
    

### Loading and Pre-Processing Data

In [3]:
iris = fetch_ucirepo(id=53) 

df = pd.concat([iris.data.features,iris.data.targets],axis=1).replace({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2})
df

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [4]:
dat = df.to_numpy()

np.random.shuffle(dat)
X = dat[:100,:-1]
y = dat[:100,-1].astype(int)
y_test = dat[100:,-1].astype(int)
X_test = dat[100:,:-1]

### Using Random Forest Class to find OOB Score

In [5]:
rf = RF(tree_ct=40, max_depth=5)
rf.train(X, y)
predictions = rf.predict(X)

oob = rf.oob_score(X, y)
print("OOB Score: "+str(oob*100)+" %")

OOB Score: 93.0 %


### Testing Final Accuracy of Model

In [6]:
predictions = rf.predict(X_test)
acc = np.sum(predictions==y_test)*100/y_test.shape[0]
print("Test Accuracy: ", acc,"%")

Test Accuracy:  92.0 %
