# 👩‍💻 Feature Scaling for Loan Approval Prediction

## 📋 Overview
In this lab, you'll work with a loan approval dataset to implement feature scaling techniques—a critical preprocessing step for machine learning models. You'll transform raw data features into consistent scales to improve model performance and interpretability. By the end, you'll have a scaled dataset ready for building a loan approval prediction model and understand how different scaling methods impact your results.

## 🎯 Learning Outcomes
By the end of this lab, you will be able to:

- Apply different feature scaling techniques (StandardScaler and MinMaxScaler) to prepare data for machine learning

- Visualize and compare feature distributions before and after scaling

- Evaluate how feature scaling affects model performance for loan approval prediction

- Select appropriate scaling methods based on data characteristics and model requirements

## 🚀 Starting Point
Access the starter code provided below. You'll need to load the loan approval prediction dataset during the lab.

Required tools/setup:

- Python 3.x
- Pandas, NumPy, Scikit-learn, Matplotlib/Seaborn libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder  # Import both scalers
import warnings

# Load the dataset
df = pd.read_csv('loan_approval_dataset.csv')
df.columns = df.columns.str.strip() # Remove whitespace

## Task 1: Explore the Dataset
**Context:** Data scientists need to understand their data before applying preprocessing techniques. In a real-world loan approval system, understanding feature distributions helps identify which scaling methods would be most appropriate.

**Steps:**

1. Examine the first few rows of the dataset using`head()`method
2. Display summary statistics with `describe()` to identify features with varying scales
3. Check for missing values using `isnull().sum()`
4. Examine the data types and convert categorical variables as needed

In [None]:
# Your code for dataset exploration

💡 **Tip:** Pay special attention to numerical features like income and loan amount. Their scales likely differ significantly from other features.

⚙️ **Test Your Work:**

- You should see features like income and loan amount having much larger values than others
- Confirm whether any missing values need to be addressed

## Task 2: Visualize Feature Distributions
**Context:** Visualization helps identify which features require scaling and provides a baseline for comparing scaled results.

**Steps:**

1. Create histograms of numerical features using `matplotlib.pyplot.hist()` or `seaborn.histplot()`
2. Generate a boxplot to visualize the range and outliers using `seaborn.boxplot()`
3. Create a correlation matrix heatmap using `seaborn.heatmap(df.corr())` to understand relationships between features

In [None]:
# Your code for feature visualization

💡 **Tip:** Compare feature ranges side-by-side using subplots to clearly see scale differences.

⚙️ **Test Your Work:**

- Your histograms should clearly show different scales across features
- The boxplots should reveal potential outliers that might affect scaling choices

## Task 3: Apply Standard Scaling
**Context:** StandardScaler transforms features to have a mean of 0 and standard deviation of 1, which is crucial for many machine learning algorithms, particularly those that assume normally distributed data.

**Steps:**

1. Separate features and target variable
2. Split data into training and testing sets using `train_test_split()`
3. Initialize a `StandardScaler object`
4. Fit the scaler on the training data and transform both training and testing data
5. Visualize the scaled features to confirm transformation

In [None]:
# Your code for standard scaling

💡 **Tip:** Always fit your scaler on training data only to avoid data leakage, then apply the same transformation to test data.

⚙️ **Test Your Work:**

- Check that scaled features have a mean close to 0 and standard deviation close to 1
- Verify that the shape of your data remains unchanged after scaling

## Task 4: Apply Min-Max Scaling
**Context:** MinMaxScaler transforms features to a specific range (typically [0,1]), which is useful for algorithms that expect bounded input values.

**Steps:**

1. Initialize a `MinMaxScaler object`
2. Fit the scaler on the training data and transform both training and testing data
3. Visualize the min-max scaled features
4. Compare the distributions before and after scaling

In [None]:
# Your code for min-max scaling

💡 **Tip:** MinMaxScaler is sensitive to outliers, so consider how they might affect your scaled data.

⚙️ **Test Your Work:**

- Verify all scaled values fall between 0 and 1
- Compare histograms of features before and after scaling to see how distributions change

## Task 5: Build and Evaluate Models with Different Scaling Methods
**Context:** Different scaling methods can impact model performance, so it's important to compare results.

**Steps:**

1. Initialize a LogisticRegression model
2. Train separate models using unscaled, standard-scaled, and min-max-scaled data
3. Make predictions with each model
4. Compare performance using metrics like accuracy, precision, recall, and F1-score
5. Generate confusion matrices to visualize prediction results

In [None]:
# Your code for model building and evaluation

💡 **Tip:** For fair comparison, use the same random state when initializing models and splitting data.

⚙️ **Test Your Work:**

- Note performance differences between models with different scaling methods
- Check if there's a clear winner among the scaling approaches

## Task 6: Document and Reflect on Results
**Context:** Documenting preprocessing steps and their impact is essential for model reproducibility and future reference.

**Steps:**

1. Summarize the impact of different scaling methods on model performance
2. Document which features benefited most from scaling
3. Explain why certain scaling methods might be more appropriate for this specific dataset
4. Reflect on how feature scaling contributes to the overall loan approval prediction task

In [None]:
# Add your documentation and reflection as comments

## ✅ Success Checklist

- Dataset successfully loaded and explored
- Feature distributions visualized before scaling
- StandardScaler correctly applied to training and testing data
- MinMaxScaler correctly applied to training and testing data
- Model performance compared across different scaling methods
- Documentation completed with insights on scaling impact
- All code runs without errors

## 🔍 Common Issues & Solutions

**Problem:** Forgetting to fit scalers only on training data

**Solution:** Always use fit_transform() on training data and only transform() on test data

**Problem:** Poor model performance despite scaling 

**Solution:** Check for outliers that might be affecting your scaling results, or consider robust scaling methods

**Problem:** Confusion about which features to scale

**Solution:** Only scale numerical features; categorical features should be encoded separately

## 🔑 Key Points

- Feature scaling is essential for algorithms sensitive to feature magnitudes
- Different scaling methods (StandardScaler vs. MinMaxScaler) serve different purposes
- The choice of scaling method should be informed by your data distribution and model requirements
- Always scale after splitting data to prevent data leakage

## ➡️ Next Steps

What you'll build next: In upcoming labs, you'll extend this work by implementing more advanced preprocessing techniques and building ensemble models for loan approval prediction.

How this connects to the next lab: The scaled features you've prepared will be used for feature selection and dimensionality reduction techniques in the next lab.

## 💻 Exemplar Solution

After completing this activity (or if you get stuck!), take a moment to review the exemplar solution. This sample solution can offer insights into different techniques and approaches. Reflect on what you can learn from the exemplar solution to improve your coding skills. Remember, multiple solutions can exist for some problems; the goal is to learn and grow as a programmer by exploring various approaches. Use the exemplar solution as a learning tool to enhance your understanding and refine your approach to coding challenges.

<details>

<summary><strong>Click HERE to see an exemplar solution</strong></summary>    
    
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder  # Import both scalers
import warnings

# Load the dataset
df = pd.read_csv('loan_approval_dataset.csv')
df.columns = df.columns.str.strip()  # Remove whitespace

# View first few rows
print("First 5 rows:")
print(df.head())

# Summary statistics
print("\nSummary statistics:")
print(df.describe())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Data types
print("\nData types:")
print(df.dtypes)

# Label encode categorical columns
categorical_cols = ['education', 'self_employed']
le = LabelEncoder()
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Task 2: Visualizing Feature Distributions

# Select numerical columns for visualization from the original DataFrame
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Histograms for numerical features
print("Generating histograms for numerical features...")
plt.figure(figsize=(16, 12))
for idx, col in enumerate(numerical_cols):
    plt.subplot((len(numerical_cols) + 2) // 3, 3, idx + 1)
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f'Histogram of {col}')
plt.tight_layout()
plt.show()

# Boxplots for numerical features
print("Generating boxplots for numerical features...")
plt.figure(figsize=(16, 12))
for idx, col in enumerate(numerical_cols):
    plt.subplot((len(numerical_cols) + 2) // 3, 3, idx + 1)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

# Correlation heatmap
print("Generating correlation heatmap...")
plt.figure(figsize=(10, 8))
sns.heatmap(df[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# --- Prepare features and target variable ---
# Features (X) are all columns except loan_id and loan_status
# Target (y) is loan_status
X = df.drop(['loan_id', 'loan_status'], axis=1)
y = df['loan_status']
print(f"\nFeatures shape: {X.shape}")

# --- Split the dataset ---
print("\n--- Splitting Dataset ---")
# Split data into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training features shape: {X_train.shape}")

# --- Apply Feature Scaling (StandardScaler and MinMaxScaler) ---
print("\n--- Applying Feature Scaling ---")

# Identify numerical columns after Label Encoding for scaling
# All columns in X are now numerical (original numeric + label encoded)
numerical_features = X_train.columns.tolist()
print(f"\nApplying scaling to features: {numerical_features}")

# Apply StandardScaler
print("\nApplying StandardScaler...")
scaler_standard = StandardScaler()
X_train_scaled_standard = scaler_standard.fit_transform(X_train)
X_test_scaled_standard = scaler_standard.transform(X_test)

# Convert to DataFrame for easy plotting
X_train_scaled_standard_df = pd.DataFrame(X_train_scaled_standard, columns=X_train.columns)

# Visualize scaled features
print("\nVisualizing Scaled Features...")
for feature in X_train_scaled_standard_df.columns:
    plt.figure(figsize=(6, 4))
    sns.histplot(data=X_train_scaled_standard_df, x=feature, kde=True, bins=30)
    plt.title(f'Distribution of Scaled Feature: {feature}')
    plt.tight_layout()
    plt.show()
    
# Apply MinMaxScaler
print("Applying MinMaxScaler...")
scaler_minmax = MinMaxScaler()
X_train_scaled_minmax = scaler_minmax.fit_transform(X_train)

# Convert back to DataFrame
X_train_scaled_minmax_df = pd.DataFrame(X_train_scaled_minmax, columns=X_train.columns)
print("Min-Max Scaling complete on training data.")   
```