# Day 3: Ensemble Learning & Random Forests

1.  **Ensemble Learning (Bagging):** Reducing variance using the Wine Dataset.
2.  **Random Forests:** Demonstrating robustness against noise and high dimensionality using the Digits Dataset.

---
## Topic 1: Ensemble Learning (Bagging)
**Goal:** Understand how Bootstrap Aggregating (Bagging) reduces the variance of a single model.

We will use the **Wine Dataset**. It is small, simple, and prone to overfitting if we use a complex Decision Tree.

### Step 1: Install and Import Basics
First, we ensure we have the necessary libraries.

In [None]:
!pip install scikit-learn pandas matplotlib

In [None]:


import pandas as pd
from sklearn.datasets import load_wine

### Step 2: Load the Dataset (Wine)
We use the Wine dataset. It classifies wines into 3 categories based on 13 chemical features.

In [None]:
# Load data
data = load_wine()

# specific features (X) and target (y)
X = data.data
y = data.target

# Let's look at the shape of our data
print(f"Features: {X.shape}")
print(f"Target Labels: {y.shape}")

### Step 3: Split the Data
We must split the data to see how the model performs on *unseen* data.

In [None]:
from sklearn.model_selection import train_test_split

# Split: 70% for training, 30% for testing
# random_state ensures we get the same split every time we run this
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split complete.")

### Step 4: The High Variance Model (Single Decision Tree)
We start with a single Decision Tree. Trees are high-variance models; they can change drastically with small changes in data.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize a standard Decision Tree
tree = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)

# Train the tree
tree.fit(X_train, y_train)

# Predict
y_pred_tree = tree.predict(X_test)

# Calculate Accuracy
acc_tree = accuracy_score(y_test, y_pred_tree)
print(f"Single Decision Tree Accuracy: {acc_tree:.4f}")

### Step 5: Bagging (Bootstrap Aggregating)
Now, we use Bagging.
1. We create multiple subsets of data (Bootstrapping).
2. We train a tree on each subset.
3. We average the results (Aggregating).

In [None]:
from sklearn.ensemble import BaggingClassifier

# Initialize Bagging Classifier
# n_estimators=50 means we create 50 different trees
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)

# Train the ensemble
bagging.fit(X_train, y_train)

# Predict
y_pred_bagging = bagging.predict(X_test)

# Calculate Accuracy
acc_bagging = accuracy_score(y_test, y_pred_bagging)

print(f"Bagging Ensemble Accuracy:    {acc_bagging:.4f}")
print("---------------------------------------")
print(f"Improvement: {(acc_bagging - acc_tree)*100:.4f}%")

**Conclusion for Topic 1:** By averaging 50 trees, the Bagging model smoothed out the errors and performed better on the test set.

---
## Topic 2: Random Forests
**Goal:** Demonstrate robustness against high dimensionality and noise.

We will use the **Digits Dataset** (images of handwritten numbers). This has 64 features (8x8 pixels), which is higher dimensionality than the previous example.

### Step 1: Import Visualization Tools

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits

### Step 2: Load Data (High Dimensionality)
We load the Digits dataset. Each row is an image flattened into 64 numbers.

In [None]:
digits = load_digits()

X_digits = digits.data
y_digits = digits.target

print(f"Feature shape: {X_digits.shape} (64 dimensions)")

# Let's visualize one sample to understand the data
plt.gray()
plt.matshow(digits.images[0])
plt.title(f"Target: {y_digits[0]}")
plt.show()

### Step 3: Add Noise (Robustness Test)
To prove Random Forest is "robust against noise," we will intentionally make the problem harder by adding random noise to the image data.

In [None]:
# Create random noise
noise = np.random.normal(0, 4, X_digits.shape)

# Add noise to our original data
X_noisy = X_digits + noise

print("Noise added to data.")

### Step 4: Split the Noisy Data

In [None]:
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_noisy, y_digits, test_size=0.2, random_state=42)

### Step 5: Train Random Forest
A Random Forest improves on Bagging by also selecting a **random subset of features** at each split. This handles high dimensionality very well.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
# n_estimators=100 (100 trees)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train
rf_model.fit(X_train_n, y_train_n)

# Predict
y_pred_rf = rf_model.predict(X_test_n)

# Evaluate
acc_rf = accuracy_score(y_test_n, y_pred_rf)
print(f"Random Forest Accuracy (on noisy data): {acc_rf:.4f}")

### Step 6: Why is it robust? (Feature Importance)
Random Forests can figure out which pixels (features) actually matter and ignore the noisy ones.

In [None]:
# Plotting feature importances
importances = rf_model.feature_importances_

# Reshape importance array back to 8x8 image to visualize
importance_image = importances.reshape(8, 8)

plt.figure(figsize=(5, 5))
plt.imshow(importance_image, cmap='hot')
plt.colorbar(label='Importance')
plt.title("Which pixels matter most?")
plt.show()