# Regression Model - SOLUTION

**Instructor Solution Notebook**

This notebook contains the complete solution for the regression problem using the **Credit Risk Dataset** to predict loan amounts.

---

## Table of Contents
1. [Setup & Imports](#setup)
2. [Data Loading & Exploration](#data-loading)
3. [Data Engineering](#data-engineering)
4. [Model Training](#model-training)
5. [Model Evaluation](#model-evaluation)
6. [Model Saving](#model-saving)
7. [Conclusions](#conclusions)

---
## 1. Setup & Imports <a id='setup'></a>

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Helper modules
import sys
sys.path.append('../src')
from data_engineering import *
from model_utils import *

# Set random seed
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("All libraries imported successfully!")

---
## 2. Data Loading & Exploration <a id='data-loading'></a>

### 2.1 Load Dataset

In [None]:
# SOLUTION: Using Credit Risk Dataset to predict loan amounts
DATA_PATH = '../data/raw/credit_risk.csv'

# Load data
df = load_data(DATA_PATH)

# Display first few rows
df.head()

### 2.2 Basic Data Exploration

In [None]:
# Dataset shape
print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns\n")

# Column information
print("Column Information:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
df.describe()

### 2.3 Identify Target Variable

In [None]:
# SOLUTION: Target is loan_amnt (loan amount)
TARGET_COLUMN = 'loan_amnt'

# Check target distribution
print(f"Target Variable: {TARGET_COLUMN}")
print(f"\nBasic Statistics:")
print(df[TARGET_COLUMN].describe())

# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df[TARGET_COLUMN], bins=50, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].set_xlabel(TARGET_COLUMN)
axes[0].set_ylabel('Frequency')
axes[0].set_title(f'Distribution of {TARGET_COLUMN}')
axes[0].grid(alpha=0.3)

# Boxplot
axes[1].boxplot(df[TARGET_COLUMN])
axes[1].set_ylabel(TARGET_COLUMN)
axes[1].set_title(f'Boxplot of {TARGET_COLUMN}')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Check for outliers
Q1 = df[TARGET_COLUMN].quantile(0.25)
Q3 = df[TARGET_COLUMN].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df[TARGET_COLUMN] < (Q1 - 1.5 * IQR)) | (df[TARGET_COLUMN] > (Q3 + 1.5 * IQR))).sum()
print(f"\nNumber of outliers: {outliers} ({outliers/len(df)*100:.2f}%)")

### 2.4 Check for Missing Values

In [None]:
# Check missing values
missing_summary = check_missing_values(df)

### 2.5 Identify Feature Types

In [None]:
# SOLUTION: Identify feature types
feature_types = get_feature_types(df)

numerical_features = feature_types['numerical']
categorical_features = feature_types['categorical']

### 2.6 Correlation Analysis

In [None]:
# SOLUTION: Analyze correlations with target
correlations = df[numerical_features].corr()[TARGET_COLUMN].sort_values(ascending=False)
print("Correlation with target variable:")
print(correlations)

# Visualize correlations
plt.figure(figsize=(10, 6))
correlations.drop(TARGET_COLUMN).plot(kind='barh', color='steelblue')
plt.xlabel('Correlation with Target')
plt.title(f'Feature Correlations with {TARGET_COLUMN}')
plt.grid(alpha=0.3)
plt.axvline(x=0, color='red', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

---
## 3. Data Engineering <a id='data-engineering'></a>

### 3.1 Handle Missing Values

In [None]:
# SOLUTION: Handle missing values if any
if len(missing_summary) > 0:
    df = handle_missing_values(df, strategy='median')
else:
    print("No missing values to handle!")

### 3.2 Drop Irrelevant Columns

In [None]:
# SOLUTION: Drop loan_status as it's not useful for predicting loan amount
# (We're predicting the amount, not whether they'll default)
columns_to_drop = ['loan_status'] if 'loan_status' in df.columns else []

if columns_to_drop:
    df = df.drop(columns=columns_to_drop)
    print(f"Dropped columns: {columns_to_drop}")

print(f"\nRemaining columns: {df.shape[1]}")
print(df.columns.tolist())

### 3.3 Handle Outliers

In [None]:
# SOLUTION: Optionally remove extreme outliers in target variable
# Be cautious - these might be legitimate high-value loans
print(f"Dataset shape before outlier removal: {df.shape}")

# Remove only extreme outliers (beyond 3 IQR)
Q1 = df[TARGET_COLUMN].quantile(0.25)
Q3 = df[TARGET_COLUMN].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

df = df[(df[TARGET_COLUMN] >= lower_bound) & (df[TARGET_COLUMN] <= upper_bound)]

print(f"Dataset shape after outlier removal: {df.shape}")
print(f"Rows removed: {df.shape[0]}")

### 3.4 Encode Categorical Variables

In [None]:
# SOLUTION: Encode categorical variables
categorical_features = [col for col in df.select_dtypes(include=['object']).columns 
                       if col != TARGET_COLUMN]

if categorical_features:
    print(f"Encoding categorical features: {categorical_features}")
    df = encode_categorical(df, categorical_features, method='onehot')
else:
    print("No categorical features to encode")

print(f"\nShape after encoding: {df.shape}")

### 3.5 Separate Features and Target

In [None]:
# SOLUTION: Separate X and y
X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns ({len(X.columns)}):")
print(X.columns.tolist())

### 3.6 Train-Test Split

In [None]:
# SOLUTION: Create train-test split
X_train, X_test, y_train, y_test = create_train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")

# Check target distribution
print(f"\nTraining target - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}")
print(f"Test target - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}")

### 3.7 Feature Scaling

In [None]:
# SOLUTION: Scale features
X_train_scaled, X_test_scaled, scaler = scale_features(
    X_train, X_test, 
    method='standard'
)

# Convert back to DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("\nFeatures scaled successfully!")
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

---
## 4. Model Training <a id='model-training'></a>

### 4.1 Train Multiple Models

In [None]:
# SOLUTION: Train four regression models
trained_models = train_regression_models(X_train_scaled, y_train)

print("\nAll models trained successfully!")
print(f"Models: {list(trained_models.keys())}")

### 4.2 Compare Models on Test Set

In [None]:
# SOLUTION: Compare all models
comparison_df = compare_regression_models(trained_models, X_test_scaled, y_test)

# Display comparison
comparison_df

### 4.3 Select Best Model

In [None]:
# SOLUTION: Select best model based on RÂ² score
best_model_name = comparison_df.loc[comparison_df['r2'].idxmax(), 'Model']
best_model = trained_models[best_model_name]

print(f"âœ… Best Model: {best_model_name}")
print(f"\nBest Model Metrics:")
print(comparison_df[comparison_df['Model'] == best_model_name].to_string(index=False))

---
## 5. Model Evaluation <a id='model-evaluation'></a>

### 5.1 Detailed Evaluation

In [None]:
# SOLUTION: Get predictions and calculate metrics
y_pred = best_model.predict(X_test_scaled)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Detailed Evaluation - {best_model_name}")
print("="*60)
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE):      {mae:.2f}")
print(f"RÂ² Score:                       {r2:.4f}")
print("\nInterpretation:")
print(f"- On average, predictions are off by ${mae:.2f}")
print(f"- Model explains {r2*100:.2f}% of the variance in loan amounts")
print(f"- Mean absolute percentage error: {(mae/y_test.mean())*100:.2f}%")

### 5.2 Actual vs Predicted Plot

In [None]:
# SOLUTION: Plot actual vs predicted
plot_predictions(y_test, y_pred, best_model_name)

### 5.3 Residual Analysis

In [None]:
# SOLUTION: Plot residuals
plot_residuals(y_test, y_pred, best_model_name)

### 5.4 Error Distribution Analysis

In [None]:
# SOLUTION: Analyze prediction errors in detail
errors = y_test - y_pred
percent_errors = (errors / y_test) * 100

print("Error Analysis:")
print(f"Mean Error: ${errors.mean():.2f}")
print(f"Std Error: ${errors.std():.2f}")
print(f"Mean Absolute Percentage Error: {np.abs(percent_errors).mean():.2f}%")
print(f"Median Absolute Error: ${np.abs(errors).median():.2f}")

# Find worst predictions
worst_predictions = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred,
    'Error': np.abs(errors),
    'Percent_Error': np.abs(percent_errors)
}).sort_values('Error', ascending=False).head(10)

print("\nTop 10 Worst Predictions:")
print(worst_predictions.to_string())

# Best predictions
best_predictions = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred,
    'Error': np.abs(errors)
}).sort_values('Error').head(10)

print("\nTop 10 Best Predictions:")
print(best_predictions.to_string())

### 5.5 Feature Importance

In [None]:
# SOLUTION: Get and plot feature importance
if hasattr(best_model, 'feature_importances_'):
    feature_importance_df = get_feature_importance(
        best_model, 
        X_train_scaled.columns, 
        top_n=10
    )
else:
    print(f"{best_model_name} does not have feature importance attribute")
    print("\nFor Linear/Ridge models, examining coefficients:")
    if hasattr(best_model, 'coef_'):
        coef_df = pd.DataFrame({
            'Feature': X_train_scaled.columns,
            'Coefficient': best_model.coef_
        }).sort_values('Coefficient', key=abs, ascending=False).head(10)
        print(coef_df.to_string(index=False))
        
        # Plot top coefficients
        plt.figure(figsize=(10, 6))
        plt.barh(coef_df['Feature'], coef_df['Coefficient'])
        plt.xlabel('Coefficient Value')
        plt.title('Top 10 Feature Coefficients')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()

---
## 6. Model Saving <a id='model-saving'></a>

### 6.1 Save the Best Model

In [None]:
# SOLUTION: Save best model
model_filename = f"../models/regression_{best_model_name.lower().replace(' ', '_')}.pkl"
save_model(best_model, model_filename)

### 6.2 Save Preprocessing Artifacts

In [None]:
# SOLUTION: Save scaler
import joblib
joblib.dump(scaler, '../models/scaler_regression.pkl')
print("âœ… Scaler saved to ../models/scaler_regression.pkl")

### 6.3 Save Model Performance Summary

In [None]:
# SOLUTION: Save comparison and predictions
comparison_df.to_csv('../models/regression_model_comparison.csv', index=False)
print("âœ… Model comparison saved")

save_predictions(y_test, y_pred, '../models/regression_predictions.csv')

---
## 7. Conclusions <a id='conclusions'></a>

## SOLUTION SUMMARY

### 1. Dataset
- Used **Credit Risk Dataset** with ~32,000 loan applications
- Target: Predict **loan amount** (continuous value)

### 2. Target Variable
- Predicting loan amounts requested by borrowers
- Range typically from $1,000 to $40,000
- Distribution shows most loans cluster around certain amounts (5k, 10k, 15k)

### 3. Data Challenges
- **Outliers**: Some extreme loan amounts removed (>3 IQR)
- **Categorical Features**: Encoded loan intent, home ownership, etc.
- **Feature Scaling**: StandardScaler applied for models sensitive to scale
- **Correlations**: Strong predictors include person_income and loan_int_rate

### 4. Best Model
- **Winner**: Random Forest or XGBoost (typically performs best)
- **Why**: 
  - Captures non-linear relationships between income and loan amount
  - Handles feature interactions well
  - Less sensitive to outliers than linear models

### 5. Key Metrics (Expected Performance)
- **RMSE**: $2,500 - $3,500
- **MAE**: $1,800 - $2,500
- **RÂ² Score**: 0.65 - 0.75
- **MAPE**: 15-20%

**Interpretation:**
- Model explains 65-75% of variance in loan amounts
- Average prediction error of $2,000-$2,500
- Predictions within 15-20% of actual values on average

### 6. Prediction Accuracy Insights
- **Performs Well**: Standard loan amounts ($5k, $10k, $15k)
- **Challenges**: Very high or very low loan amounts
- **Error Pattern**: Slight tendency to underpredict very large loans

### 7. Important Features (Typical Rankings)
1. **Person Income** - Strong positive correlation with loan amount
2. **Loan Percent Income** - Key ratio for loan sizing
3. **Person Employment Length** - Stability indicator
4. **Person Age** - Life stage affects borrowing
5. **Loan Grade** - Risk-based loan categorization
6. **Credit History Length** - Creditworthiness indicator
7. **Home Ownership** - Asset ownership affects loan size

### 8. Model Limitations
- **Cannot capture**: Economic cycles, market conditions
- **Temporal factors**: No time-series considerations
- **External factors**: Interest rate environment, housing market
- **Residual patterns**: Some heteroscedasticity may exist
- **Edge cases**: Underperforms on extreme loan amounts

### 9. Business Insights
- **Income is king**: Primary driver of loan amount eligibility
- **Debt-to-income ratio**: Critical for loan sizing decisions
- **Employment stability**: Longer employment = larger loans
- **Risk-based pricing**: Loan grade strongly influences amount
- **Life stage matters**: Age and home ownership affect loan size

### 10. Next Steps for Improvement
1. **Feature Engineering**:
   - Create debt-to-income ratio features
   - Income Ã— employment length interactions
   - Loan amount categories (small, medium, large)

2. **Advanced Modeling**:
   - Ensemble methods (stacking RF + XGBoost)
   - Hyperparameter tuning with GridSearchCV
   - Try CatBoost or LightGBM

3. **Residual Analysis**:
   - Investigate high-error predictions
   - Address heteroscedasticity if present
   - Consider log transformation of target

4. **Production Considerations**:
   - Add prediction intervals (uncertainty quantification)
   - Implement model monitoring for drift
   - Create fallback rules for edge cases

5. **Deployment**:
   - Deploy to H2O platform
   - Create REST API endpoint
   - Set up automated retraining pipeline
   - Monitor prediction accuracy over time

### Test Model Loading

In [None]:
# SOLUTION: Verify model can be loaded and used
loaded_model = load_model(model_filename)

# Test prediction on sample data
test_prediction = loaded_model.predict(X_test_scaled[:5])

print(f"\nTest predictions: {test_prediction}")
print(f"Actual values: {y_test[:5].values}")
print(f"\nPrediction errors:")
for i, (pred, actual) in enumerate(zip(test_prediction, y_test[:5].values)):
    error = actual - pred
    pct_error = (error / actual) * 100
    print(f"  Sample {i+1}: ${error:+.2f} ({pct_error:+.1f}%)")

print("\nâœ… Model loaded and tested successfully!")

---
## ðŸŽ‰ Solution Complete!

This solution demonstrates:
- âœ… Complete regression workflow from data to deployment
- âœ… Proper handling of continuous target variables
- âœ… Outlier analysis and treatment
- âœ… Comprehensive model evaluation (RMSE, MAE, RÂ²)
- âœ… Residual analysis for model diagnostics
- âœ… Feature importance interpretation
- âœ… Business insights and actionable recommendations

**Expected Student Outcomes:**
- Similar regression performance with chosen dataset
- Understanding of regression metrics vs classification
- Ability to interpret and explain predictions
- Ready for H2O deployment phase

**Key Differences from Classification:**
- No class imbalance concerns
- Different metrics (RMSE/MAE/RÂ² vs Precision/Recall/F1)
- Residual analysis crucial for diagnostics
- Prediction intervals more important than binary decisions