# ML Interpretability and Stability Project

This notebook tests the functions in `ml_models.py` and implements the project requirements:

**Step 1**: Use estimated default probability (DP) and implement surrogate models to interpret the unknown model

**Step 2**: Estimate our own black-box ML model to forecast default

## Setup and Imports

In [None]:
import sys
import os

# Add src directory to path
sys.path.append('src')

# Import our custom module
from ml_models import DefaultProbabilityAnalysis

# Import additional libraries for testing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("All imports successful!")

## Initialize Analysis

First, let's initialize our analysis class and check if the data file exists.

In [None]:
# Check if data file exists
data_path = 'data/dataproject2025 (1).csv'

if os.path.exists(data_path):
    print(f"✓ Data file found: {data_path}")
    
    # Initialize analysis
    analysis = DefaultProbabilityAnalysis(data_path)
    print("✓ Analysis class initialized successfully")
else:
    print(f"✗ Data file not found: {data_path}")
    print("Available files in data directory:")
    if os.path.exists('data'):
        for file in os.listdir('data'):
            print(f"  - {file}")
    else:
        print("  Data directory does not exist")

## Step 1: Data Loading and Preprocessing

Load and preprocess the dataset to prepare for analysis.

In [None]:
# Load and preprocess data
try:
    df = analysis.load_and_preprocess_data()
    print("\n✓ Data loaded and preprocessed successfully")
    
    # Display basic information about the dataset
    print(f"\nDataset shape: {df.shape}")
    print(f"Number of features: {len(analysis.feature_cols)}")
    print(f"Default rate: {df['target'].mean():.2%}")
    
    # Show first few rows
    print("\nFirst 5 rows of key columns:")
    key_cols = ['target', 'Predicted probabilities'] + analysis.feature_cols[:5]
    display(df[key_cols].head())
    
except Exception as e:
    print(f"✗ Error loading data: {e}")

## Step 2: Exploratory Data Analysis

Perform exploratory analysis on the default probabilities (DP) from the unknown model.

In [None]:
# Perform exploratory analysis
try:
    analysis.exploratory_analysis()
    print("\n✓ Exploratory analysis completed")
except Exception as e:
    print(f"✗ Error in exploratory analysis: {e}")

## Step 3: Implement Surrogate Models (Project Step 1)

Implement surrogate models to interpret the unknown model used to generate the default probabilities.

In [None]:
# Implement surrogate models
try:
    lr_surrogate, dt_surrogate, linear_importance, tree_importance = analysis.implement_surrogate_models()
    print("\n✓ Surrogate models implemented successfully")
    
    print("\n=== SURROGATE MODELS SUMMARY ===")
    print("\n1. Linear Regression Surrogate:")
    print("   - Provides global interpretability through feature coefficients")
    print("   - Shows linear relationships between features and DP")
    
    print("\n2. Decision Tree Surrogate:")
    print("   - Provides rule-based interpretability")
    print("   - Captures non-linear relationships and interactions")
    
    print("\nTop 5 most important features (Linear Model):")
    display(linear_importance.head())
    
    print("\nTop 5 most important features (Decision Tree):")
    display(tree_importance.head())
    
except Exception as e:
    print(f"✗ Error implementing surrogate models: {e}")

## Step 4: Build Black-box Models (Project Step 2)

Develop our own black-box machine learning models to forecast default.

In [None]:
# Build black-box models
try:
    rf_model, lr_model, rf_importance = analysis.build_blackbox_model()
    print("\n✓ Black-box models built successfully")
    
    print("\n=== BLACK-BOX MODELS SUMMARY ===")
    print("\n1. Random Forest Classifier:")
    print("   - Ensemble method with high predictive power")
    print("   - Provides feature importance rankings")
    
    print("\n2. Logistic Regression:")
    print("   - Baseline linear classifier")
    print("   - Provides probability estimates and interpretable coefficients")
    
    print("\nTop 5 most important features (Random Forest):")
    display(rf_importance.head())
    
except Exception as e:
    print(f"✗ Error building black-box models: {e}")

## Step 5: Generate Summary Report

In [None]:
# Generate comprehensive summary report
try:
    analysis.generate_summary_report()
    print("\n✓ Summary report generated successfully")
except Exception as e:
    print(f"✗ Error generating summary report: {e}")

## Step 6: Additional Analysis and Insights

In [None]:
# Additional analysis - Compare feature importance across models
try:
    print("\n=== FEATURE IMPORTANCE COMPARISON ===")
    
    # Get top 10 features from each model
    top_linear = set(linear_importance.head(10)['feature'])
    top_tree = set(tree_importance.head(10)['feature'])
    top_rf = set(rf_importance.head(10)['feature'])
    
    # Find common important features
    common_features = top_linear.intersection(top_tree).intersection(top_rf)
    
    print(f"\nFeatures important across ALL models ({len(common_features)}):")
    for feature in common_features:
        print(f"  - {feature}")
    
    # Features important in at least 2 models
    two_models = (top_linear.intersection(top_tree).union(
                  top_linear.intersection(top_rf)).union(
                  top_tree.intersection(top_rf)))
    
    print(f"\nFeatures important in at least 2 models ({len(two_models)}):")
    for feature in two_models:
        print(f"  - {feature}")
        
except Exception as e:
    print(f"Error in additional analysis: {e}")

## Step 7: Model Validation and Testing

In [None]:
# Test individual functions to ensure they work correctly
print("=== FUNCTION TESTING ===")

# Test 1: Check if all models are properly trained
try:
    # Test surrogate models
    test_features = analysis.df[analysis.feature_cols].iloc[:5]
    test_features_scaled = analysis.scaler.transform(test_features)
    
    lr_pred = lr_surrogate.predict(test_features_scaled)
    dt_pred = dt_surrogate.predict(test_features)
    
    print("✓ Surrogate models can make predictions")
    
    # Test black-box models
    rf_pred = rf_model.predict(test_features)
    lr_pred_bb = lr_model.predict(test_features_scaled)
    
    print("✓ Black-box models can make predictions")
    
    print("\nSample predictions:")
    print(f"Linear Surrogate (DP): {lr_pred[:3]}")
    print(f"Tree Surrogate (DP): {dt_pred[:3]}")
    print(f"Random Forest (Default): {rf_pred[:3]}")
    print(f"Logistic Regression (Default): {lr_pred_bb[:3]}")
    
except Exception as e:
    print(f"✗ Error in model testing: {e}")

print("\n=== ALL TESTS COMPLETED ===")

## Conclusion

This notebook successfully implements and tests:

### Step 1: Surrogate Model Implementation ✓
- **Linear Regression Surrogate**: Interprets the unknown DP model through linear relationships
- **Decision Tree Surrogate**: Provides rule-based interpretation with non-linear capabilities
- Both models help understand how the original model generates default probabilities

### Step 2: Black-box Model Development ✓
- **Random Forest Classifier**: High-performance ensemble model for default prediction
- **Logistic Regression**: Baseline linear classifier with interpretable coefficients
- Both models forecast default using the same features as the original unknown model

### Key Insights:
1. **Model Performance**: Compare surrogate model fidelity and black-box model accuracy
2. **Feature Importance**: Identify key risk factors across different model types
3. **Interpretability**: Balance between model complexity and explainability
4. **Stability**: Analyze consistency of important features across models