## What is Random Forest Regression?

**Random Forest Regression** makes predictions by combining the results of several decision trees.  
Each tree is trained on a random part of the data (a process called *bootstrapping*), and the final answer is the **average** of all tree predictions.

## Example Dataset

Suppose you have this data:

| X | y |
|---|---|
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |
| 4 | 9 |

## Building Two Simple Trees (Bootstrapping)

Suppose you build **two simple trees** using random samples:

- **Tree 1** is built from: (1,3), (2,5), (4,9), (2,5)  
- **Tree 2** is built from: (2,5), (3,7), (4,9), (3,7)  

## Predict for X = 3:

### Tree 1's Prediction
If X < 2.5: average(3,5,5) = 4.33  
If X ≥ 2.5: predict 9

For X = 3, **Tree 1 predicts 9**

### Tree 2's Prediction
If X < 3.5: average(5,7,7) = 6.33  
If X ≥ 3.5: predict 9

For X = 3, **Tree 2 predicts 6.33**

## Final Random Forest Prediction

Final prediction = (9 + 6.33)/2 = 7.67

## Summary

Random Forest combines multiple decision trees, each trained on different data samples. The final prediction is the average of all individual tree predictions, leading to more robust and accurate results than a single decision tree.


# Random Forest Regression

Random Forest Regression is an ensemble learning method that combines multiple decision trees to make accurate predictions.

Key characteristics:
- Uses bagging (bootstrap aggregating) to randomly sample data with replacement
- Each tree is trained on a different subset of data
- Trees are grown independently using different random subsets of features
- Final prediction is the average of predictions from all trees

Advantages:
- More robust and accurate than single decision trees
- Reduces overfitting through averaging
- Can handle high-dimensional data
- Provides feature importance rankings
- Works well for both categorical and numerical data

In [15]:
data = [
    {'X': 1, 'y': 3},
    {'X': 2, 'y': 5},
    {'X': 3, 'y': 7},
    {'X': 4, 'y': 9}
]

In [16]:
# Tree 1 is built from: (1,3), (2,5), (4,9), (2,5)
def tree1_predict(x):
    # If X < 2.5, use mean of y for X = 1, 2, 2: 3, 5, 5
    if x < 2.5:
        return (3 + 5 + 5) / 3
    # Else, use y from X = 4: 9
    else:
        return 9

In [17]:
# Tree 2 is built from: (2,5), (3,7), (4,9), (3,7)
def tree2_predict(x):
    # If X < 3.5, use mean of y for X = 2, 3, 3: 5, 7, 7
    if x < 3.5:
        return (5 + 7 + 7) / 3
    # Else, use y from X = 4: 9
    else:
        return 9

In [18]:
# Predict for X = 3
x_query = 3
y1 = tree1_predict(x_query)
y2 = tree2_predict(x_query)

In [19]:
# Random Forest prediction = average of tree predictions
rf_prediction = (y1 + y2) / 2

In [20]:
rf_prediction

7.666666666666666

# Random Forest Regression with scikit-learn

In [21]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

In [22]:
# Dataset
X = np.array([[1], [2], [3], [4]])  # 2D array for sklearn
y = np.array([3, 5, 7, 9])

In [23]:
# Create Random Forest Regressor with 2 trees
rf = RandomForestRegressor(n_estimators=2, random_state=42, bootstrap=True)
rf.fit(X, y)


0,1,2
,n_estimators,2
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [24]:
# Predict for X = 3
prediction = rf.predict([[3]])

In [25]:
prediction

array([7.])

In [26]:
# Optionally, print each tree's prediction
tree_preds = [tree.predict([[3]])[0] for tree in rf.estimators_]

In [27]:
tree_preds

[np.float64(7.0), np.float64(7.0)]

## 1. **Ensemble of Decision Trees (Many Trees Working Together)**

* Random forest doesn’t use just one decision tree.
* It grows many trees, each making a prediction.
* The final answer is the **average** of all their answers.
* This helps make the prediction more accurate and less likely to be wrong because of weird data.

---

## 2. **Bagging and OOB Error (Mixing Data & Testing Ourselves)**

* Each tree gets its own random batch of the data (some rows may be picked more than once, some left out).
* The rows that aren’t picked for a tree are called **Out-Of-Bag (OOB)** data.
* After building the trees, random forest checks how well it predicts these left-out (OOB) rows.
* This gives us a “score” (OOB error) to see how good our model is, without needing extra test data.

---

## 3. **Feature Importance (What Matters Most?)**

* Random forest tells us which features (columns in your data) are most helpful for predictions.
* The higher the number, the more important that feature is.

---

In [29]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Small dataset
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10]) 

# Make the random forest
rf = RandomForestRegressor(n_estimators=5, oob_score=True, random_state=0, bootstrap=True)
rf.fit(X, y)

# Predict for x=3
print("Prediction for x=3:", rf.predict([[3]])[0])

# OOB score (how well the model does on data it didn't see)
print("OOB Score:", rf.oob_score_)

# Which features matter most (here, just one feature)
print("Feature Importances:", rf.feature_importances_)

Prediction for x=3: 6.4
OOB Score: -0.911111111111111
Feature Importances: [1.]


  warn(
