# Random Forest Classifier

## ðŸŒ² What is Random Forest?

**Random Forest** is a powerful ensemble method built on **Bagging (Bootstrap Aggregating)** that:

1. Trains multiple decision trees (forest of trees)
2. On different random subsets of the training data (sampled with replacement)
3. With random feature selection at each split (adds extra randomness)
4. Combines their predictions (by voting for classification or averaging for regression)

### ðŸ’¬ In short:

> "Train many decision trees on slightly different data with random features â†’ vote or average â†’ robust, accurate model."

---

## ðŸ”„ How Random Forest Works (Example)

Suppose we have a dataset with 100 samples and 4 features.

1. **Random Forest** will randomly select bootstrapped samples â€” e.g. 80 random samples with replacement.
2. It trains one decision tree on each sample, but at each split, only considers a random subset of features (e.g., 2 out of 4).
3. Then it repeats that for 10, 20, 100, or more trees.

**Final prediction:**
- **Classification**: Majority vote across all trees
- **Regression**: Average prediction across all trees

---

## ðŸ“˜ How It Improves Accuracy

- **Reduces variance** (overfitting) â€” Each tree sees different data and features
- **Doesn't change bias** (average prediction) much
- **Works well with unstable models** like decision trees
- Each tree sees slightly different data + different features â†’ different errors â†’ averaging/voting cancels them out

### Mathematically:

$$\text{Var}(\bar{f}) = \frac{1}{M^2} \sum_{i=1}^{M} \text{Var}(f_i) \approx \frac{1}{M} \text{Var}(f)$$

â†’ The more estimators ($M$), the lower the variance.

---

## ðŸ”§ Key Parameters

| Parameter            | Description                                          |
| -------------------- | ---------------------------------------------------- |
| `n_estimators`       | Number of trees in the forest (default: 100)         |
| `max_depth`          | Maximum depth of each tree (None = grow fully)       |
| `max_features`       | Features considered at each split ('sqrt', 'log2')   |
| `min_samples_split`  | Minimum samples required to split a node             |
| `min_samples_leaf`   | Minimum samples required in a leaf node              |
| `bootstrap`          | Sampling with replacement (True by default)          |
| `oob_score`          | "Out-of-bag" score (built-in validation)             |
| `n_jobs`             | Use multiple CPU cores (-1 = use all cores)          |
| `random_state`       | Seed for reproducibility                             |
| `criterion`          | Split quality measure ('gini', 'entropy')            |

---

## ðŸ†š Random Forest vs Bagging

| Aspect              | Bagging                  | Random Forest               |
| ------------------- | ------------------------ | --------------------------- |
| Base Model          | Any model                | Always Decision Trees       |
| Feature Selection   | Uses all features        | Random subset of features   |
| Diversity           | From bootstrap samples   | Bootstrap + random features |
| Typical Performance | Good                     | Better (more decorrelation) |

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Step 1: Load Dataset and Split

We'll use the Iris dataset to demonstrate Random Forest classification.

In [None]:
data = load_iris()
x, y = data.data, data.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(x)
print(y)

## Step 2: Create, Train and Evaluate Random Forest

We create a `RandomForestClassifier` with:
- **100 trees**: More trees â†’ better performance (up to a point)
- **max_depth=None**: Trees grow fully (may overfit individually, but ensemble handles it)
- **oob_score=True**: Uses out-of-bag samples for validation (free cross-validation!)
- **n_jobs=-1**: Uses all CPU cores for parallel training

In [None]:
# Create model
rf_model = RandomForestClassifier(
    n_estimators=100,        # number of trees
    max_depth=None,          # let it grow fully
    random_state=42,
    oob_score=True,          # use out-of-bag validation
    n_jobs=-1                # use all CPU cores
)

# Train model
rf_model.fit(x_train, y_train)

#predict
y_pred = rf_model.predict(x_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.2f}") 

## Step 3: Feature Importance Analysis

Random Forest provides **feature importance scores** showing which features contribute most to predictions. This is calculated by measuring how much each feature decreases impurity across all trees.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

importances = rf_model.feature_importances_
features = load_iris().feature_names

# Display importance
df = pd.DataFrame({'Feature': features, 'Importance': importances})
df = df.sort_values('Importance', ascending=False)
print(df)

# Plot
plt.barh(df['Feature'], df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.gca().invert_yaxis()
plt.show()
