# 👩‍💻 Comparing Ensemble Models with Cross-Validation

## 📋 Overview
In this lab, you will apply cross-validation—a powerful method in your machine learning toolkit—to evaluate and compare the performance of different ensemble models. Understanding how cross-validation allows for a fair, comprehensive assessment of models' performances across different data configurations will provide you with insights into their reliability and generalizability. This exercise involves hands-on practice with the Medical Insurance Dataset, where you will implement cross-validation to compare two popular ensemble models: Random Forest and Gradient Boosting.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Implement cross-validation to evaluate model performance
- ✅ Compare multiple ensemble models using performance metrics
- ✅ Gain insights into model robustness and generalizability

## Task 1: Prepare the Dataset

**Context**: Properly loading and preprocessing the dataset is essential before applying any machine learning models.

**Steps:**

1. Perform necessary preprocessing steps such as handling missing values or encoding categorical features.

In [None]:
# Required Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.datasets import fetch_openml

# Load the dataset
dataset = fetch_openml(data_id=45064)  # Medical Insurance Dataset
X, y = dataset.data, dataset.target

# Task 1: Prepare the Dataset
# your code here...

💡 **Tip:** Use `pd.get_dummies()` for encoding categorical variables.

⚙️ **Test Your Work:**
- Display the first 5 rows of the dataset.

**Expected output:** A preview of the data with encoded categorical features as necessary.

## Task 2: Define Ensemble Models

**Context:** Defining multiple ensemble models provides diverse perspectives for evaluation.

**Steps:**

1. Initialize a `RandomForestRegressor`.
2. Initialize a `GradientBoostingRegressor`.

In [None]:
# Task 2: Define Ensemble Methods

💡 **Tip:** Use appropriate parameters like `n_estimators` and `random_state`.

⚙️ **Test Your Work:**
- Print the model configurations for `RandomForestRegressor` and `GradientBoostingRegressor`.

**Expected output:** Configuration details of both models.

## Task 3: Set up Cross-Validation

**Context:** Setting up cross-validation is crucial for assessing model performance across multiple data segments.

**Steps:**

1. Use k-fold cross-validation with `k=5`.

In [None]:
# Task 3: Set up Cross-Validation

💡 **Tip:** Use `KFold` with `n_splits=5`, `shuffle=True`, and `random_state` for reproducibility.

⚙️ **Test Your Work:**
- Print the k-fold cross-validation object configuration.


**Expected output:** Configuration of the k-fold cross-validation.

## Task 4: Evaluate Models with Cross-Validation

**Context:** Evaluating models with cross-validation provides a comprehensive assessment of their performance.

**Steps:**

1. Perform cross-validation for the `RandomForestRegressor`.
2. Perform cross-validation for the `GradientBoostingRegressor`.

💡 **Tip:** Use `cross_val_score` with `scoring='neg_mean_squared_error`'.

In [None]:
# Task 4: Evaluate Models with Cross-Validation

⚙️ **Test Your Work:**
- Print the cross-validation scores for both models.

**Expected output:** Cross-validation scores (mean squared error) for both models.

## Task 5: Interpret the Results

**Context:** Interpreting the results helps in understanding model performance and reliability.

**Steps:**

1. Compare the cross-validation results of the two models.

In [None]:
# Task 5: Interpret the Results

💡 **Tip:** Use `numpy` to calculate mean and standard deviation of the cross-validation scores.

⚙️ **Test Your Work:**
- Print the mean and standard deviation of the cross-validation scores for both models.

**Expected output:** Mean and standard deviation of the cross-validation scores for both models.

### ✅ Success Checklist

- Successfully loaded and prepared the dataset
- Defined and configured the RandomForestRegressor and GradientBoostingRegressor
- Set up k-fold cross-validation
- Evaluated models using cross-validation
- Compared and interpreted the cross-validation results
- Documented reflections and insights

### 🔍 Common Issues & Solutions

**Problem:** Dataset not loading.   
**Solution:** Ensure the dataset ID is correct and retry.

**Problem:** Preprocessing errors.   
**Solution:** Verify the preprocessing steps and ensure they are correctly implemented.

**Problem:** Cross-validation errors.   
**Solution:** Double-check the k-fold cross-validation setup and scoring parameter.

### 🔑 Key Points

- Cross-validation provides a comprehensive evaluation of model performance.
- Comparing multiple models helps identify the best-performing one.
- Reflection on results provides deeper insights into model robustness and generalizability.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
# Required Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.datasets import fetch_openml

# Load the dataset
dataset = fetch_openml(data_id=45064)  # Medical Insurance Dataset
X, y = dataset.data, dataset.target

# Preprocessing (e.g., one-hot encode categorical variables if any exist)
X = pd.get_dummies(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Setup k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
rf_scores = cross_val_score(rf_model, X_train, y_train, cv=kf, scoring='neg_mean_squared_error')
gb_scores = cross_val_score(gb_model, X_train, y_train, cv=kf, scoring='neg_mean_squared_error')

print(f"Random Forest CV MSE: {-np.mean(rf_scores):.2f} ± {np.std(rf_scores):.2f}")
print(f"Gradient Boosting CV MSE: {-np.mean(gb_scores):.2f} ± {np.std(gb_scores):.2f}")
```