# Random Forest Regressor

**Random Forest** is an ensemble learning method that combines multiple decision trees to improve predictive accuracy and control overfitting. It operates on the principle of "bagging" (Bootstrap Aggregating) and introduces randomness in the selection of features to enhance the diversity of the individual trees.

## How Random Forest Works

### 1. Bootstrapping

The first step in constructing a random forest involves generating multiple subsets of the training dataset using the bootstrapping technique:
- Each tree in the forest is trained on a different random sample of the dataset, created by sampling with replacement.
- If the original dataset contains N samples, each tree is trained on n samples, where typically n ≈ N.

### 2. Building Decision Trees

For each bootstrapped sample, a decision tree is constructed:
- **Node Splitting**: At each node of the tree, a subset of features is randomly selected. Instead of considering all features for splitting, a smaller number m of features are randomly chosen.
  
  - For classification:
    - m = sqrt(M)
  
  - For regression:
    - m = M / 3

  where M is the total number of features.

- **Impurity Calculation**: For regression, the impurity is often measured using the Mean Squared Error (MSE):
  
  - MSE(t) = (1 / N_t) * Σ (y_i - ȳ_t)²

  where:
  - N_t is the number of observations in the node t.
  - y_i is the actual value of the i-th observation.
  - ȳ_t is the mean of the target values in the node t.

### 3. Tree Growth

Each decision tree is grown to its maximum depth without pruning:
- Nodes are split until the stopping criteria are met, which may include:
  - A maximum depth is reached.
  - A minimum number of samples in a leaf node (e.g., `min_samples_leaf`).
  - A minimum number of samples required to split an internal node (e.g., `min_samples_split`).

### 4. Prediction

Once all trees in the forest are constructed, predictions are made:
- **Regression**: The predicted value is obtained by averaging the predictions from all trees.

  - ŷ = (1 / T) * Σ y_t

  where:
  - ŷ is the predicted value.
  - y_t is the prediction from tree t.
  - T is the total number of trees.

### 5. Feature Importance

Random Forest can also provide insights into feature importance:
- The importance of each feature can be determined by measuring how much the model's accuracy decreases when the values of that feature are permuted.
- Another common method is to evaluate the average decrease in impurity (e.g., MSE) brought by each feature across all trees.

## Hyperparameters

Key hyperparameters in Random Forest include:

- **n_estimators**: The number of trees in the forest.
- **max_depth**: The maximum depth of each tree.
- **min_samples_split**: The minimum number of samples required to split an internal node.
- **min_samples_leaf**: The minimum number of samples required to be at a leaf node.
- **max_features**: The number of features to consider when looking for the best split. This can be:
  - A fraction of the total number of features (e.g., `sqrt` or `log2`).
  - A specific number of features.

## Advantages

- **Reduced Overfitting**: Random Forest helps mitigate overfitting by averaging the predictions of multiple trees.
- **Robustness**: It is less sensitive to noise in the data compared to a single decision tree.
- **Feature Importance**: Provides insights into which features are most important for predictions.

## Disadvantages

- **Complexity**: Random Forest is less interpretable than a single decision tree.
- **Computationally Intensive**: Training multiple trees requires more computational resources and time.



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [3]:
data = pd.read_csv("/kaggle/input/housing3/BostonHousing.csv")

In [4]:
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [9]:
target = data["medv"]
x = data.drop(columns = ["medv"])

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

x_train,x_test,y_train,y_test = train_test_split(x,target,test_size=0.2,shuffle=True)

In [19]:
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(x_train,y_train)

In [20]:
y_pred = rf_regressor.predict(x_test)

In [22]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 20.364713921568622
