# 👩‍💻 Building and Evaluating a StackingClassifier on Loan Default Data

## 📋 Overview
In this activity, you will put your machine learning skills to the test by creating a stacked ensemble model to predict loan defaults using a dataset from Lending Club. Stacking allows you to leverage multiple algorithms to improve predictive accuracy, an essential skill in a world where financial institutions rely on robust models to manage risk and maximize profitability. By the end of this lab, you will be proficient in constructing and assessing a stacking model using the `StackingClassifier`.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- ✅ Construct a stacked ensemble model using multiple base models and a meta-model
- ✅ Evaluate the performance of the stacking model using various classification metrics
- ✅ Reflect on model selection and explore techniques for improving predictive performance

## Task 1: Data Exploration and Preparation

**Context:** Understanding the dataset and ensuring it is clean is the first critical step.

**Steps:**

1. Load the Lending Club Loan Dataset from the provided CSV file.
2. Conduct exploratory data analysis (EDA) including:
    - Displaying summary statistics
    - Checking for missing values
    - Identifying categorical features

In [None]:
# Task 1: Data Exploration and Preparation
# Required Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load Data
df = pd.read_csv('lending_club_loan_data.csv')

# EDA
# your code here...

💡 **Tip:** Use `pd.read_csv()` to load the data, and `df.describe()` to get summary statistics.

⚙️ **Test Your Work:**

- Display the first 5 rows of the dataset.

**Expected output:** A preview of the data with columns such as 'loan_amnt', 'term', 'int_rate', etc.

## Task 2: Split the Data

**Context:** Splitting the data allows for an unbiased evaluation of the model.

**Steps:**

1. Divide the dataset into training and testing sets with an 80-20 split.

In [None]:
# Task 2: Split the Data

💡 **Tip:** Use `train_test_split` with a `test_size` of 0.2 and a `random_state` for reproducibility.

⚙️ **Test Your Work:**

- Print the shapes of the training and testing sets.

**Expected output:** Shapes that reflect the 80-20 split.

## Task 3: Define Base Models

**Context:** Base models capture different patterns within the dataset, enhancing the final model's performance.

**Steps:**

1. Choose a set of diverse base models such as Decision Tree, Support Vector Machine, and Logistic Regression.

In [None]:
# Task 3: Define Base Models

💡 **Tip:** Ensure each model is instantiated with appropriate parameters.

⚙️ **Test Your Work:**

- Print the base models' configurations.

**Expected output:** Configurations of the base models being used in the stack.

## Task 4: Build the Stacking Model

**Context:** Combining base models and a meta-model leads to a more robust predictive model.

**Steps:**
1. Construct a `StackingClassifier` using the defined base models.
2. Select a meta-model, typically a simpler model such as Logistic Regression.

In [None]:
# Task 4: Build the Stacking Model

💡 **Tip:** Use `estimators` parameter for base models and `final_estimator` for the meta-model.

⚙️ **Test Your Work:**

- Print the `StackingClassifier` configuration.

**Expected output:** The configuration detailing the base models and the meta-model.

## Task 5: Train and Evaluate the Model

**Context:** Fitting the model to the training data and evaluating its performance is essential.

**Steps:**

1. Fit the StackingClassifier to the training data using fit.
2. Evaluate the model using metrics such as accuracy, precision, recall, and F1 score.

In [None]:
# Task 5: Train and Evaluate the Model

💡 **Tip:** Use `classification_report` to get detailed performance metrics.

⚙️ **Test Your Work:**

- Print the classification report for the stacking model’s predictions.

**Expected output:** Metrics including accuracy, precision, recall, and F1 score.

### ✅ Success Checklist

- Successfully explored and prepared the dataset
- Split the data into training and testing sets
- Defined and configured diverse base models
- Constructed and configured the stacking model
- Trained and evaluated the stacking model
- Reflected on the model selection and potential improvements

### 🔍 Common Issues & Solutions

**Problem:** Data leakage during preprocessing.Dataset file not found.   
**Solution:** Ensure the dataset file is in the correct folder.

**Problem:** Categorical encoding errors.   
**Solution:** Double-check the columns being encoded.

**Problem:** Model training errors.   
**Solution:** Verify that data preprocessing steps were correctly applied.

### 🔑 Key Points

- Stacking allows leveraging multiple algorithms for better predictive performance.
- Proper data preprocessing is essential for model accuracy.
- Evaluating model metrics helps understand and improve model performance.

## 💻 Exemplar Solution

<details>    
<summary><strong>Click HERE to see an exemplar solution</strong></summary>    

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load Data
df = pd.read_csv('lending_club_loan_data.csv')

# Data preprocessing
print(df.describe())
print(df.dtypes)

# Standardizing data
scaler = StandardScaler()
label_encoder = LabelEncoder()
for column in df.select_dtypes(include=['object']).columns:
    df[column] = label_encoder.fit_transform(df[column])

X = df.drop('loan_status', axis=1)
y = df['loan_status']
X = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
base_models = [
    ('decision_tree', DecisionTreeClassifier(max_depth=3)),
    ('random_forest', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('support_vector_machine', SVC(probability=True))
]

# Stacking model
meta_model = LogisticRegression()
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=meta_model)

# Train the stacking classifier
stacking_clf.fit(X_train, y_train)

# Evaluate
y_pred = stacking_clf.predict(X_test)
print(f"Stacking Model Evaluation:\n {classification_report(y_test, y_pred)}")
```  