# Random Forest Classification  
## From a Single Decision Tree to an Ensemble Model

### Context
In the previous notebook, we trained a **single Decision Tree** to predict whether a student would **pass or fail** based on weekly study hours.

In this notebook, we extend the *same problem* to demonstrate the power of **Random Forests**.

### Key idea
A Random Forest:
- Trains **many decision trees**
- Each tree sees a **slightly different view** of the data
- Final predictions are made by **majority voting**

This reduces overfitting and improves generalization.

### Learning objectives
- Understand why a single decision tree can be unstable
- See how Random Forests reduce variance
- Apply Random Forests to the *same dataset*
- Interpret results in comparison with a single tree


In [1]:
# Core numerical and data handling libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Machine learning models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Evaluation tools
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Reproducibility
np.random.seed(42)


## 1) Recreate the same dataset (40 students)

To make a **fair comparison**, we use the same data-generating process
as in the Decision Tree example.


In [2]:
# Generate study hours
study_hours = np.random.randint(1, 25, size=40)

# Base rule with noise
passed = (study_hours >= 10).astype(int)
noise_indices = np.random.choice(range(40), size=4, replace=False)
passed[noise_indices] = 1 - passed[noise_indices]

students_df = pd.DataFrame({
    "study_hours": study_hours,
    "passed": passed
})

students_df.head(10)


Unnamed: 0,study_hours,passed
0,7,1
1,20,1
2,15,1
3,11,1
4,8,0
5,21,1
6,7,0
7,19,1
8,23,1
9,11,1


## 2) Trainâ€“test split

We keep the split logic identical so that performance differences
come from the **model**, not the data handling.


In [3]:
X = students_df[["study_hours"]]
y = students_df["passed"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


## 3) Baseline: single Decision Tree (for comparison)

This shallow tree mirrors the previous notebook.


In [4]:
dt_model = DecisionTreeClassifier(
    max_depth=2,
    random_state=42
)

dt_model.fit(X_train, y_train)

dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

dt_accuracy


0.9

## 4) Train a Random Forest classifier

Key differences from a single tree:
- Multiple trees (`n_estimators`)
- Bootstrap sampling
- Random feature selection at each split


In [5]:
rf_model = RandomForestClassifier(
    n_estimators=100,      # number of trees
    max_depth=3,           # keep trees shallow for interpretability
    random_state=42
)

rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

rf_accuracy


0.8

## 5) Compare performance


In [6]:
comparison_df = pd.DataFrame({
    "Model": ["Decision Tree", "Random Forest"],
    "Accuracy": [dt_accuracy, rf_accuracy]
})

comparison_df


Unnamed: 0,Model,Accuracy
0,Decision Tree,0.9
1,Random Forest,0.8


### Interpretation
- The Random Forest typically achieves **higher or more stable accuracy**
  than a single tree.
- Even when accuracy is similar, Random Forest predictions tend to be
  **less sensitive to noise**.


## 6) Detailed evaluation of Random Forest


In [7]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))

print("\nClassification Report:")
print(classification_report(y_test, rf_pred))


Confusion Matrix:
[[4 0]
 [2 4]]

Classification Report:
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         4
           1       1.00      0.67      0.80         6

    accuracy                           0.80        10
   macro avg       0.83      0.83      0.80        10
weighted avg       0.87      0.80      0.80        10



### Interpretation
- Fewer extreme errors compared to a single tree
- Better balance between precision and recall
- More reliable behavior on unseen students


## 7) Apply Random Forest to a new class


In [8]:
new_class = pd.DataFrame({
    "study_hours": [4, 7, 9, 11, 14, 18]
})

new_class["predicted_pass"] = rf_model.predict(new_class)
new_class["pass_probability"] = rf_model.predict_proba(new_class)[:, 1]

new_class


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- predicted_pass


### Interpretation
- Predictions are smoother and more confident near the decision boundary
- Probabilities reflect consensus across many trees
- This is why Random Forests generalize better in practice


## Final teaching summary

- A **Decision Tree** is easy to interpret but unstable
- A **Random Forest** trades a little interpretability for:
  - better generalization
  - reduced overfitting
  - higher robustness to noise
- In real systems, Random Forests are often a strong baseline
  before moving to more complex models

This progression mirrors how practitioners scale from
simple models to ensemble methods.
