# Random Forest

Random Forest uses **bagging** (Bootstrap Aggregating) as its core ensemble method.

### Bagging (Bootstrap Aggregating)

- **Definition**: Bagging involves training multiple models independently on different random subsets of the training data and then combining their predictions. The subsets are created using bootstrapping, which means sampling with replacement.
  
- **How It Works in Random Forest**:
  - **Bootstrapped Datasets**: For each decision tree in the Random Forest, a bootstrapped dataset is created by randomly sampling the original dataset with replacement. This means each tree is trained on a different random subset of the data, allowing for diversity among the trees.
  - **Independent Training**: Each decision tree is trained independently on its respective bootstrapped dataset.
  - **Aggregation**: After training, the predictions from all the trees are combined. For classification, this is typically done by majority voting, and for regression, by averaging the predictions.

- **Purpose**: The main goal of bagging is to reduce variance and improve the model's generalization by averaging out the predictions from multiple independent models. This leads to a more stable and robust model compared to a single decision tree, which might overfit to the training data.

### Boosting

Boosting is another ensemble technique, but it is different from bagging. Boosting builds models sequentially, with each new model attempting to correct the errors of the previous ones.

- **Sequential Process**: In boosting, models are trained one after the other, with each model trying to improve the performance of the ensemble by focusing on the errors made by previous models. This often involves reweighting the data or adjusting the importance of misclassified examples.
- **Adaptive Weights**: Boosting assigns more weight to the data points that were misclassified by previous models, so the new models focus more on these hard-to-classify examples.

**Popular Boosting Algorithms**:
- **AdaBoost** (Adaptive Boosting)
- **Gradient Boosting Machines** (GBM)
- **XGBoost** (Extreme Gradient Boosting)
- **LightGBM** (Light Gradient Boosting Machine)
- **CatBoost** (Categorical Boosting)

### Key Differences Between Bagging (Random Forest) and Boosting

1. **Training Approach**:
   - **Bagging**: Trains multiple models independently on different random subsets of data.
   - **Boosting**: Trains models sequentially, with each model focusing on the errors of the previous ones.

2. **Aggregation**:
   - **Bagging**: Combines predictions by majority vote (classification) or averaging (regression).
   - **Boosting**: Combines predictions by giving more weight to models that perform better.

3. **Error Reduction**:
   - **Bagging**: Aims to reduce variance by averaging multiple independent models.
   - **Boosting**: Aims to reduce bias by iteratively focusing on errors and refining the model.

In summary, **Random Forest** uses **bagging** to create an ensemble of decision trees, whereas **boosting** is a different technique that sequentially builds models to correct the errors of previous ones.
## How Random Forest Works?
Random Forest is an ensemble learning method used for both classification and regression tasks. It builds multiple decision trees during training and outputs either the mode of the classes (for classification) or the mean prediction (for regression) of the individual trees.

Here's a step-by-step explanation of how Random Forest works:

### 1. **Data Preparation**
- **Original Dataset**: Let's assume you have a dataset with features \(X\) and corresponding labels \(y\).
  
### 2. **Bootstrapping (Sampling with Replacement)**
- **Creating Multiple Datasets**:
  - From the original dataset, Random Forest creates multiple subsets by randomly sampling with replacement. This process is called **bootstrapping**.
  - Each of these subsets will be used to train a separate decision tree.
  - Because of the sampling with replacement, some data points might appear multiple times in a subset, while others might not appear at all.

### 3. **Growing the Decision Trees**
- **Training on Bootstrapped Data**:
  - Each decision tree in the Random Forest is trained on a different bootstrapped dataset.
  - While growing each tree, Random Forest adds an extra layer of randomness. At each split in the tree, instead of considering all features for the best split, it selects a random subset of features. This ensures that the trees are diverse and not highly correlated.

### 4. **Voting/Averaging Predictions**
- **Classification**:
  - Once all trees are trained, the Random Forest makes predictions by aggregating the predictions of all individual trees.
  - For a classification task, each tree casts a "vote" for a class. The final prediction is the class that gets the majority of votes (i.e., the mode).
  
![Random Forest Classification](https://scikit-learn.org/stable/_images/ensemble.png)

- **Regression**:
  - For a regression task, each tree provides a numerical prediction. The final prediction is the average of all the predictions from the trees.

### 5. **Out-of-Bag (OOB) Error (Optional)**
- **Internal Validation**:
  - Since each tree is trained on a bootstrapped sample, about one-third of the data is left out of each sample. This left-out data is called "Out-of-Bag" data.
  - The model can evaluate its performance on this OOB data without needing a separate validation set.
  - The average error on the OOB samples provides an estimate of the model's performance.

### 6. **Final Model**
- The Random Forest model combines the predictions of all the individual trees to make its final prediction, either by majority vote (classification) or averaging (regression).

### Key Features of Random Forest:

1. **Ensemble of Decision Trees**:
   - Random Forest is a collection of decision trees. Each tree is trained on a different subset of data and features, which helps in reducing overfitting and improving generalization.

2. **Randomness**:
   - Two sources of randomness:
     1. Bootstrapped sampling of the data.
     2. Random feature selection at each split in the tree.
   - This randomness makes the trees less correlated and the overall model more robust.

3. **Bias-Variance Tradeoff**:
   - The individual trees might have high variance (overfit) but low bias. By averaging or voting across many trees, Random Forest reduces the variance without increasing the bias, leading to a model that generalizes well to unseen data.

4. **Robust to Overfitting**:
   - While individual trees might overfit to their respective datasets, the ensemble of trees tends to generalize well, making Random Forests robust to overfitting compared to a single decision tree.

5. **Feature Importance**:
   - Random Forest provides insights into which features are most important for predictions. This can be useful for feature selection and understanding the model's decisions.

### Summary

- **Training**: Random Forest trains multiple decision trees using bootstrapped datasets and random feature subsets.
- **Prediction**: For classification, it uses majority voting among trees; for regression, it averages their predictions.
- **Advantages**: High accuracy, robust to overfitting, provides feature importance, works well with large datasets and high dimensionality.
- **Disadvantages**: Can be computationally intensive and less interpretable than a single decision tree.

Overall, Random Forest is a powerful and versatile machine learning algorithm that is widely used in practice due to its effectiveness and ease of use.

## Random Forest Regressor

In [1]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_squared_error

# Load dataset
data = sns.load_dataset('titanic')

# Drop rows with missing values
data = data.dropna()

# For classification: let's predict 'survived'
# For regression: let's predict 'fare'

In [2]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [3]:
# Preprocess the Data
# Encode categorical variables
data = pd.get_dummies(data,drop_first=True)

In [4]:
data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alone,sex_male,embarked_Q,...,who_woman,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,embark_town_Queenstown,embark_town_Southampton,alive_yes
1,1,1,38.0,1,0,71.2833,False,False,False,False,...,True,False,True,False,False,False,False,False,False,True
3,1,1,35.0,1,0,53.1000,False,False,False,False,...,True,False,True,False,False,False,False,False,True,True
6,0,1,54.0,0,0,51.8625,True,True,True,False,...,False,False,False,False,True,False,False,False,True,False
10,1,3,4.0,1,1,16.7000,False,False,False,False,...,False,False,False,False,False,False,True,False,True,True
11,1,1,58.0,0,0,26.5500,False,True,False,False,...,True,False,True,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,47.0,1,1,52.5542,False,False,False,False,...,True,False,False,True,False,False,False,False,True,True
872,0,1,33.0,0,0,5.0000,True,True,True,False,...,False,True,False,False,False,False,False,False,True,False
879,1,1,56.0,0,1,83.1583,False,False,False,False,...,True,False,True,False,False,False,False,False,False,True
887,1,1,19.0,0,0,30.0000,False,True,False,False,...,True,True,False,False,False,False,False,False,True,True


In [5]:
# Features and target for regression (predicting fare)
X_reg = data.drop('fare', axis=1)
y_reg = data['fare']

# Split the data into train and test sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

In [6]:
# Parameter grid for Random Forest Regressor
param_grid_reg = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [7]:
# Grid search for Random Forest Regressor
from sklearn.model_selection import GridSearchCV

grid_search_reg = GridSearchCV(estimator=RandomForestRegressor(),
                               param_grid=param_grid_reg,
                               cv=5,  # 5-fold cross-validation
                               n_jobs=-1,  # Use all available cores
                               verbose=2)

In [8]:
# Fit the grid search for the regressor
grid_search_reg.fit(X_train_reg, y_train_reg)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


In [9]:
best_rf_regressor = grid_search_reg.best_estimator_

In [10]:
# Best Random Forest Regressor
y_pred_reg = best_rf_regressor.predict(X_test_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)
print(f'Best Random Forest Regression Mean Squared Error: {mse:.2f}')
print(f'Best Parameters for Regressor: {grid_search_reg.best_params_}')

Best Random Forest Regression Mean Squared Error: 1943.05
Best Parameters for Regressor: {'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 50}


## Random Forest Classifier

In [11]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_squared_error

# Load dataset
data = sns.load_dataset('titanic')

# Drop rows with missing values
data = data.dropna()

# For classification: let's predict 'survived'
# For regression: let's predict 'fare'

In [12]:
# Encode categorical variables
data = pd.get_dummies(data, drop_first=True)

# Features and target for classification (predicting survival)
X_class = data.drop('survived', axis=1)
y_class = data['survived']

# Split the data into train and test sets
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(X_class, y_class, test_size=0.3, random_state=42)

In [13]:
# Initialize the classifier with some parameters
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10,min_samples_split=2,min_samples_leaf=4,bootstrap=True)

# Train the classifier
rf_classifier.fit(X_train_class, y_train_class)

# Make predictions on the test set
y_pred_class = rf_classifier.predict(X_test_class)

# Calculate accuracy
accuracy = accuracy_score(y_test_class, y_pred_class)
print(f'Classification Accuracy: {accuracy:.2f}')

Classification Accuracy: 0.93


Use Hyperparameter tuning for this classifier is our wish

### Verbose

- **`verbose`** is a parameter used in many machine learning libraries, including `GridSearchCV`, to control the level of output printed to the console during the execution of an operation.
  
- **Functionality**:
  - If `verbose=0`: No output is printed. The process runs silently.
  - If `verbose=1`: Some basic information is printed. For example, when using `GridSearchCV`, it might print progress on the number of iterations completed.
  - If `verbose=2` or higher: More detailed information is printed. This might include more granular updates, such as details about each fold of cross-validation in `GridSearchCV`.

- **Purpose**: The main purpose of `verbose` is to help monitor the progress of long-running operations. For example, if you're performing a grid search over many parameter combinations, setting `verbose` to a higher value will give you insight into how far along the process is, which can be especially useful when working with large datasets or complex models.

### Bootstrap

- **`bootstrap`** is a parameter used in ensemble methods like Random Forests, which determines whether or not bootstrapping is used to create the subsets of data on which each tree in the forest is trained.

- **Functionality**:
  - If `bootstrap=True`: Bootstrapping is enabled, meaning each tree in the Random Forest is trained on a randomly sampled subset of the data with replacement. This means that some samples might appear multiple times in a subset, while others might not appear at all.
  - If `bootstrap=False`: Bootstrapping is disabled, meaning each tree is trained on the entire dataset (or the subset of the data without replacement if a subset is specified).

- **Purpose**:
  - Bootstrapping helps in creating diverse trees in the forest. Since each tree is trained on a different subset of data, the trees become more varied, reducing the correlation between them. This increases the overall robustness and generalization ability of the Random Forest model.
  - Using `bootstrap=True` is the default setting in Random Forests because it tends to improve the model's ability to generalize to new, unseen data. It also gives the algorithm a sense of "built-in" randomness, which helps in preventing overfitting.

In summary:
- **`verbose`** controls the amount of information output during an operation.
- **`bootstrap`** determines whether each tree in a Random Forest is trained on a random subset of the data (with replacement) or on the entire dataset.

## Deep Dive in Bootstrapping

**Bootstrapping** is a statistical method that involves sampling with replacement. In the context of Random Forests, it refers to how the data is used to train each individual tree in the forest.

### Example to Understand Bootstrapping

Imagine you have a dataset with 5 samples:

```
Dataset: [A, B, C, D, E]
```

If we use **bootstrapping** (`bootstrap=True`), here's what happens when training a single tree in the Random Forest:

1. **Sampling with Replacement**: 
   - You randomly select samples from the original dataset, but with replacement. 
   - This means after selecting a sample, you put it back into the dataset, so it can be chosen again.

2. **Example of Bootstrapped Sample**:
   - You might end up with a sample like: `[A, C, E, A, B]`
   - Notice that `A` appears twice, `D` is missing entirely.

3. **Training the Tree**:
   - This tree is trained on the bootstrapped sample `[A, C, E, A, B]`.

When you build the next tree in the forest, a new bootstrapped sample is created, such as `[B, D, D, E, A]`.

### Why Use Bootstrapping?

1. **Diversity Among Trees**: 
   - Each tree in the forest is trained on a different random sample of the data. This introduces diversity among the trees, as they each see slightly different data.

2. **Reducing Overfitting**:
   - If every tree was trained on the exact same data, they might all make similar mistakes and overfit to the training data. Bootstrapping ensures that trees are more independent of each other, which improves the overall performance of the Random Forest.

### What Happens if `bootstrap=False`?

- If you set `bootstrap=False`, then each tree is trained on the entire dataset without sampling. All trees see the same data, leading to less diversity among the trees. This might make the Random Forest less robust because the trees are more likely to make the same errors.

## NOTE:
The random forest works with bagging so the classification problem output is based on Majority Voting and regression problem is based on bootstrap aggregating mean.

#### Prepared By,
Ahamed Basith