# **Random Forest Algorithm**

## **1. Introduction**

A **Random Forest** is an ensemble learning algorithm that combines multiple decision trees to improve prediction accuracy and prevent overfitting. It is widely used for both **classification** and **regression** tasks.
<br><br>

![Random Forest.png](../images/random_forest.png)

- **Key Idea**: Aggregate predictions from multiple trees to make a final prediction.
- **Why Random?**
  1. **Random Subset of Features**: At each split, only a random subset of features is considered.
  2. **Random Sampling**: Data is sampled with replacement (bootstrapping) to train individual trees.

##
---

## **2. Key Mathematical Concepts**

### **2.1 Ensemble Learning**
Random Forest uses **Bagging** (Bootstrap Aggregating) to combine predictions from multiple models.

#### Bagging Formula:
$$
\hat{y} = \frac{1}{T} \sum_{t=1}^{T} f_t(x)
$$
- \(T\): Number of trees.
- $(f_t(x))$: Prediction from tree \(t\).

### **2.2 Feature Randomness**
For a dataset with \(F\) features, only \($\sqrt{F}$\) (for classification) or \(F/3\) (for regression) features are used for splitting at each node. This introduces diversity among trees.

##
---

## **3. Comparison: Decision Tree vs Random Forest**


| **Aspect**              | **Decision Tree**                                      | **Random Forest**                                   |
|--------------------------|-------------------------------------------------------|---------------------------------------------------|
| **Overfitting**          | Prone to overfitting.                                 | Reduces overfitting by averaging multiple trees.  |
| **Accuracy**             | May have lower accuracy due to high variance.         | Generally higher accuracy.                        |
| **Interpretability**     | Easy to interpret and visualize.                      | Harder to interpret as it aggregates multiple trees. |
| **Training Speed**       | Faster to train.                                      | Slower due to training multiple trees.            |
| **Feature Importance**   | Provides direct feature importance.                  | Provides averaged feature importance.             |
| **Robustness**           | Sensitive to noisy data.                              | Robust to noise and outliers.                     |


##
---

## **4. Implementation**

### **Classification Example**

Dataset: Iris Classification

This example predicts the type of iris flower using a Random Forest Classifier.

```python
# Import libraries
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the Model
clf.fit(X_train, y_train)

# Predict and Evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Feature Importance
feature_importance = pd.Series(clf.feature_importances_, index=iris.feature_names)
print("Feature Importance:")
print(feature_importance.sort_values(ascending=False))
```

[Implement of Random Forest Algorithm](10%20-%20Implement%20Random%20Forest%20Algorithm.ipynb)
###
---

### **Regression Example**

Dataset: California Housing Prices

This example predicts house prices using a Random Forest Regressor.

```python
#Import libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error


# Load California Housing Dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the Model
regressor.fit(X_train, y_train)

# Predict and Evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
```

[Implement of Random Forest Algorithm](10%20-%20Implement%20Random%20Forest%20Algorithm.ipynb)
###
---

## **5. Visualization**

- Feature Importance Visualization

```python
import matplotlib.pyplot as plt

# Visualize Feature Importance
feature_importance.sort_values().plot(kind='barh', title="Feature Importance", figsize=(10, 6))
plt.show()
```

##
---

## **6. Mathematical Formulations**

### **6.1 Random Sampling**
For 'N' samples in the dataset, Random Forest creates multiple training datasets by sampling 'N' samples with replacement (bootstrapping).

### **6.2 Aggregation of Predictions**

#### Classification:

$$\hat{y}​=Mode(y1​,y2​,…,yT​)$$

  - Takes the majority vote of *T* trees.

#### Regression:

$$\hat{y} = \frac{1}{T} \sum_{t=1}^{T} y_t$$

  - Takes the average of predictions from *T* trees.

### **6.3 Out-of-Bag (OOB) Error**

Uses samples not included in the bootstrap to evaluate model performance:

$$OOB \, Error = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{y}_i^{OOB})$$

  - L: Loss function (e.g., accuracy or MSE).

##
---

## **7. Advantages and Disadvantages**

**Advantages**

- Reduces overfitting compared to Decision Trees.

- Handles large datasets with higher accuracy.

- Robust to noise and missing data.

- Provides feature importance rankings.

**Disadvantages**

- Requires more computational resources than a single Decision Tree.

- Harder to interpret due to the ensemble nature.

- May not perform well with sparse data.

![Advantages & Disadvantages.png](../images/random_forest_advantages.png)

##
---

## **8. Comparison Example**

```python
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Decision Tree Accuracy: {dt_accuracy:.2f}")
print(f"Random Forest Accuracy: {rf_accuracy:.2f}")
```

##
---

## **9. Conclusion**

Key Takeaways

- Random Forest is a more robust and accurate model compared to a single Decision Tree due to ensemble learning.

- It minimizes overfitting and improves generalization by combining predictions from multiple trees.

- Decision Trees are faster and easier to interpret but less accurate.

##
---

## **10. Extensions**

### Hyperparameter Tuning

Use GridSearchCV or RandomizedSearchCV to optimize hyperparameters like:

- Number of estimators (n_estimators).

- Maximum depth of trees (max_depth).

- Minimum samples split (min_samples_split).

### Ensemble Variants

- Gradient Boosting: Sequentially builds trees to minimize residual errors.

- XGBoost: An optimized version of Gradient Boosting.

```python
from sklearn.model_selection import GridSearchCV

params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), params, cv=3)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
```