# Section 4: Introduction to Random Forest

## Exercise 4.1: Bagging vs Random Forest

**Objective:**  
Compare the performance of a bagging model and a random forest model on a classification problem.

**Instructions:**
1. Load a classification dataset (e.g., the Wine dataset).
2. Train a bagging model and a random forest model on the dataset.
3. Compare their performances in terms of accuracy, precision, recall, and computational efficiency.
4. Discuss the impact of feature randomization in Random Forest on model performance.

**Deliverables:**
- Jupyter notebook or Python script with code and comments.
- Performance comparison table for both models (accuracy, precision, recall).
- Discussion of the differences in performance and impact of feature randomization.

In [1]:
# needed libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.datasets import load_wine

This code snippet loads the wine dataset using the load_wine function from _sklearn.datasets_, with the _as_frame=True_ parameter to return the data as a pandas DataFrame. The dataset consists of features (stored in _wine.data_) and corresponding target labels (stored in _wine.target_). The next step splits the data into training and testing sets using the _train_test_split_ function from _sklearn.model_selection_. It separates the feature data X and the target labels y into training and testing subsets, where 80% of the data is used for training and 20% for testing. The _random_state=42_ ensures that the split is reproducible, meaning the same split will occur every time the code is run.

In [2]:
wine = load_wine(as_frame=True)

X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Two ensemble learning classifiers are initialized and trained on the wine dataset. The first classifier, _bagging_clf_, is a BaggingClassifier, which is an ensemble method that creates multiple models (in this case, 100) by training on different random subsets of the training data, and then combines their predictions. The second classifier, _rf_clf_, is a _RandomForestClassifier_, a specific type of bagging method that constructs multiple decision trees and aggregates their results to make predictions. Both classifiers are trained on the training data (_X_train_ and _y_train_) using the _fit()_ method. After training, predictions are made on the test data (_X_test_) for both classifiers. The predicted results are stored in _y_pred_bagging_ for the Bagging classifier and _y_pred_rf_ for the Random Forest classifier.

In [3]:
bagging_clf = BaggingClassifier(n_estimators=100, random_state=42)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

bagging_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

y_pred_bagging = bagging_clf.predict(X_test)
y_pred_rf = rf_clf.predict(X_test)

We create a pandas DataFrame to compare the performance of the two trained classifiers—Bagging and Random Forest—on the test set. The DataFrame is constructed with the following columns: 'Model', 'Accuracy', 'Precision', and 'Recall'. For each model, the accuracy, precision, and recall scores are calculated using the respective functions from sklearn.metrics. The accuracy score measures the overall proportion of correct predictions, while precision and recall are calculated with a 'weighted' average, which accounts for class imbalances by weighting each class's performance by its support (the number of true instances for each class). The resulting DataFrame allows for an easy comparison of the two models' performance across these metrics.

In [4]:
pd.DataFrame({
    'Model': ['Bagging', 'Random Forest'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_bagging), 
        accuracy_score(y_test, y_pred_rf)
    ],
    'Precision': [
        precision_score(y_test, y_pred_bagging, average='weighted'), 
        precision_score(y_test, y_pred_rf, average='weighted')
    ],
    'Recall': [
        recall_score(y_test, y_pred_bagging, average='weighted'), 
        recall_score(y_test, y_pred_rf, average='weighted')
    ]
})


Unnamed: 0,Model,Accuracy,Precision,Recall
0,Bagging,0.972222,0.974074,0.972222
1,Random Forest,1.0,1.0,1.0


The comparison between Bagging and Random Forest models reveals notable differences in performance, with Random Forest outperforming Bagging across all evaluation metrics (accuracy, precision, and recall). Bagging achieved an accuracy of 97.22%, precision of 97.41%, and recall of 97.22% (these same results did not vary between several executions of the code), whereas Random Forest achieved perfect scores of 100% across all metrics. The key difference between these two models lies in the **feature randomization technique** employed by Random Forest. While Bagging simply creates multiple bootstrap samples of the data and trains individual models independently, Random Forest introduces additional randomness by selecting random subsets of features for each decision tree. This feature randomization helps reduce the correlation between trees, leading to a more diverse set of models, which can enhance generalization and prevent overfitting. As a result, Random Forest tends to provide better performance, especially in complex datasets, by capturing more diverse patterns and reducing the risk of model bias.