# Loan Prediction Competition

In this workshop, we will apply ensemble methods such as Random Forest and Gradient Boosting to a loan prediction dataset. The dataset used is a modified subset of the **Loan Prediction Problem Dataset** from Kaggle ([link](https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset)).

## Objective
The task is to predict whether a loan application will be approved based on applicant information.

![Loan Prediction Competition](https://drive.google.com/uc?id=1eipuAdG46mfAgm-KSFth_YEazhJAZHVx)



## Loading the Data

The training dataset is loaded from the **[train.csv](https://drive.google.com/file/d/1Ejs0yaRm3NxFOVIhwQphoDz8voJl6NQx/view?usp=sharing)** file using Pandas. After loading, we inspect the first few rows to understand its structure and check for missing values. Basic preprocessing steps, such as handling null values and encoding categorical variables, will be performed before modeling.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

: 

In [None]:
# !wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Ejs0yaRm3NxFOVIhwQphoDz8voJl6NQx' -O loanpred_train.csv

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1prsz-DTYi9cfnzHiPU--Nr2pT-832pTg' -O loanpred_train.csv

In [None]:
# Let's define the "random_state" to ensure reproducible results:
random_state=42

In [None]:
# Let's change the font of Matplotlib plots:
plt.rc('font', family='serif', size=12)

In [None]:
#Let's load the data
# Carguemos los datos:
data = pd.read_csv('loanpred_train.csv')
data

In [None]:
# Let's check the dataset description:
print("Dataset Shape:", data.shape)
print("\nDataset Info:")
data.info()
print("\nDataset Description:")
data.describe()
print("\nMissing Values:")
print(data.isnull().sum())
print("\nTarget Variable Distribution:")
print(data['Loan_Status'].value_counts())
print(data['Loan_Status'].value_counts(normalize=True))


In [None]:
# Let's check the distribution of the columns
import warnings
warnings.filterwarnings('ignore')

# Create subplots for visualizations
fig, axes = plt.subplots(3, 4, figsize=(20, 15))
fig.suptitle('Data Distribution Analysis', fontsize=16, fontweight='bold')

# Categorical variables
categorical_cols = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
for i, col in enumerate(categorical_cols):
    row = i // 4
    col_idx = i % 4
    data[col].value_counts().plot(kind='bar', ax=axes[row, col_idx], color='skyblue')
    axes[row, col_idx].set_title(f'Distribution of {col}')
    axes[row, col_idx].tick_params(axis='x', rotation=45)

# Numerical variables
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']
for i, col in enumerate(numerical_cols):
    row = (i + len(categorical_cols)) // 4
    col_idx = (i + len(categorical_cols)) % 4
    if row < 3 and col_idx < 4:  # Make sure we don't exceed subplot limits
        data[col].hist(bins=20, ax=axes[row, col_idx], color='lightcoral', alpha=0.7)
        axes[row, col_idx].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(12, 8))
# Select only numerical columns for correlation
numerical_data = data.select_dtypes(include=[np.number])
correlation_matrix = numerical_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()


## Data Preprocessing Pipeline

Before modeling, we preprocess the dataset by handling missing values, encoding categorical variables, and scaling numerical features if necessary. This ensures that the data is clean and properly formatted for training machine learning models.


In [None]:
# Let's complete the data analysis stage and define the preprocessing pipeline
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import joblib

# Separate features and target
X = data.drop(['id', 'Loan_Status'], axis=1)
y = data['Loan_Status']

# Encode target variable (Y=1, N=0)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Target encoding:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# Identify categorical and numerical columns
categorical_features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area']
numerical_features = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

print(f"Categorical features: {categorical_features}")
print(f"Numerical features: {numerical_features}")

# Create preprocessing pipelines
# For numerical features: impute missing values with median and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# For categorical features: impute missing values with most frequent and one-hot encode
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

print("Preprocessing pipeline created successfully!")

# FEATURE ENGINEERING - Add new features to improve predictions
def add_engineered_features(df):
    """Add engineered features based on domain knowledge"""
    df_new = df.copy()
    
    # 1. Total Income (most important feature)
    df_new['TotalIncome'] = df_new['ApplicantIncome'] + df_new['CoapplicantIncome']
    
    # 2. Loan Amount to Income Ratio (critical for loan approval)
    df_new['LoanAmountToIncome'] = df_new['LoanAmount'] / (df_new['TotalIncome'] + 1)  # +1 to avoid division by zero
    
    # 3. Income per dependent
    df_new['IncomePerDependent'] = df_new['TotalIncome'] / (df_new['Dependents'].astype(str).replace('3+', '3').astype(float) + 1)
    
    # 4. Loan term in years
    df_new['LoanTermYears'] = df_new['Loan_Amount_Term'] / 12
    
    # 5. Monthly EMI approximation (Loan Amount / Term in months)
    df_new['MonthlyEMI'] = df_new['LoanAmount'] / (df_new['Loan_Amount_Term'] + 1)
    
    # 6. EMI to Income ratio
    df_new['EMIToIncomeRatio'] = (df_new['MonthlyEMI'] * 1000) / (df_new['TotalIncome'] + 1)  # *1000 because LoanAmount is in thousands
    
    # 7. Binary features
    df_new['HasCoapplicant'] = (df_new['CoapplicantIncome'] > 0).astype(int)
    df_new['HasDependents'] = (df_new['Dependents'].astype(str) != '0').astype(int)
    
    return df_new

# Apply feature engineering
print("Adding engineered features...")
X_engineered = add_engineered_features(X)

# Update feature lists
additional_numerical_features = ['TotalIncome', 'LoanAmountToIncome', 'IncomePerDependent', 
                               'LoanTermYears', 'MonthlyEMI', 'EMIToIncomeRatio', 
                               'HasCoapplicant', 'HasDependents']

numerical_features_extended = numerical_features + additional_numerical_features

print(f"Original features: {len(numerical_features + categorical_features)}")
print(f"With engineered features: {len(numerical_features_extended + categorical_features)}")

# Update preprocessing pipeline with new features
numerical_transformer_extended = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

preprocessor_extended = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer_extended, numerical_features_extended),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X_engineered, y_encoded, test_size=0.2, random_state=random_state, stratify=y_encoded
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Training target distribution: {np.bincount(y_train)}")
print(f"Test target distribution: {np.bincount(y_test)}")

# Update the global preprocessor variable
preprocessor = preprocessor_extended


# Model Training and Evaluation  

We train machine learning models, such as Random Forest, Gradient Boosting or XGBoost, to predict loan approval. The models are evaluated using appropriate metrics, and hyperparameter tuning is performed to optimize their performance.

In this section, we define the steps for training and evaluating the models.  

## Steps:  
1. **Define the hyperparameters**: Set initial values for model parameters.  
2. **Choose the cross-validation strategy**: Split the dataset into training and validation sets using an appropriate method.  
3. **Train the model**: Fit the model on training data using the defined hyperparameters.  
4. **Evaluate performance**: Use cross-validation to assess the model’s predictive ability.  
5. **Tune hyperparameters (if necessary)**: Optimize parameters for better performance.  

## Hyperparameters  
We define key hyperparameters for Random Forest and Gradient Boosting models, such as:  
- **n_estimators**: Number of trees in the ensemble.  
- **max_depth**: Maximum depth of each tree.  
- **learning_rate** (for boosting models): Controls step size for weight updates.  
- **min_samples_split**: Minimum samples required to split a node.  
- **min_samples_leaf**: Minimum samples required in a leaf node.  

## Cross-Validation Strategy  
To ensure reliable model evaluation, we use **K-Fold Cross-Validation**, which splits the dataset into **K** subsets (folds). The model is trained on **K-1** folds and tested on the remaining fold, repeating the process **K** times. This helps in reducing variance and providing a better generalization estimate.  




In [None]:
# Let's define the cross_validation
from sklearn.model_selection import GridSearchCV

# Try to import XGBoost for potentially better performance
try:
    from xgboost import XGBClassifier
    xgb_available = True
    print("XGBoost is available - adding to model comparison")
except ImportError:
    xgb_available = False
    print("XGBoost not available - using only RandomForest and GradientBoosting")

# Define cross-validation strategy
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

# Define models with hyperparameters to tune
models = {
    'RandomForest': {
        'model': RandomForestClassifier(random_state=random_state),
        'params': {
            'classifier__n_estimators': [100, 200, 300],
            'classifier__max_depth': [3, 5, 7, None],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4]
        }
    },
    'GradientBoosting': {
        'model': GradientBoostingClassifier(random_state=random_state),
        'params': {
            'classifier__n_estimators': [100, 200, 300],
            'classifier__learning_rate': [0.05, 0.1, 0.2],
            'classifier__max_depth': [3, 5, 7],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4]
        }
    }
}

# Add XGBoost if available
if xgb_available:
    models['XGBoost'] = {
        'model': XGBClassifier(random_state=random_state, eval_metric='logloss'),
        'params': {
            'classifier__n_estimators': [100, 200, 300],
            'classifier__learning_rate': [0.05, 0.1, 0.2],
            'classifier__max_depth': [3, 4, 5, 6],
            'classifier__min_child_weight': [1, 3, 5],
            'classifier__subsample': [0.8, 0.9, 1.0],
            'classifier__colsample_bytree': [0.8, 0.9, 1.0]
        }
    }

# Function to evaluate models with cross-validation
def evaluate_model_cv(model, X, y, cv_strategy, preprocessor):
    """Evaluate model using cross-validation"""
    # Create pipeline with preprocessing and model
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Perform cross-validation
    cv_scores = cross_val_score(pipeline, X, y, cv=cv_strategy, scoring='accuracy')
    
    return cv_scores

# Quick evaluation of base models
print("=== Base Model Cross-Validation Results ===")
for model_name, model_info in models.items():
    model = model_info['model']
    cv_scores = evaluate_model_cv(model, X_train, y_train, cv_strategy, preprocessor)
    
    print(f"\n{model_name}:")
    print(f"CV Scores: {cv_scores}")
    print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

print("\nCross-validation strategy defined successfully!")

# Finalizing the Model: Training a Full Pipeline  

Once the best model parameters have been selected, it is essential to **train a complete pipeline** that explicitly separates **data preprocessing** and **model training**. This ensures that preprocessing steps are consistently applied to both training and unseen data.  

## Steps:  
1. **Train the data preprocessing pipeline**:  
   - Handle missing values.  
   - Encode categorical features.  
   - Scale numerical features (if necessary).  

2. **Train the classification pipeline**:  
   - Use the entire processed training dataset and the best model hyperparameters to fit the selected model to make final predictions.  

3. **Save the trained pipelines**:  
   - The preprocessing and classification models should be saved for deployment and inference.  

By structuring the pipeline this way, we maintain consistency between training and real-world predictions while ensuring that preprocessing does not introduce **data leakage**.  


In [None]:
# Let's train the full pipelines
print("=== Training Models with Hyperparameter Tuning ===")
print("This may take a few minutes...")

best_models = {}
grid_search_results = {}

for model_name, model_info in models.items():
    print(f"\nTraining {model_name}...")
    
    # Create pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model_info['model'])
    ])
    
    # Perform grid search with cross-validation
    # Improved parameter grids for better performance
    if model_name == 'RandomForest':
        param_grid = {
            'classifier__n_estimators': [200, 300, 500],
            'classifier__max_depth': [7, 10, 15, None],
            'classifier__min_samples_split': [2, 3, 5],
            'classifier__min_samples_leaf': [1, 2, 3],
            'classifier__max_features': ['sqrt', 'log2'],
            'classifier__bootstrap': [True]
        }
    elif model_name == 'GradientBoosting':
        param_grid = {
            'classifier__n_estimators': [200, 300, 400],
            'classifier__learning_rate': [0.05, 0.1, 0.15],
            'classifier__max_depth': [3, 4, 5, 6],
            'classifier__min_samples_split': [2, 5, 10],
            'classifier__min_samples_leaf': [1, 2, 4],
            'classifier__subsample': [0.8, 0.9, 1.0]
        }
    else:  # XGBoost
        param_grid = {
            'classifier__n_estimators': [200, 300, 400],
            'classifier__learning_rate': [0.05, 0.1, 0.15],
            'classifier__max_depth': [3, 4, 5, 6],
            'classifier__min_child_weight': [1, 3, 5],
            'classifier__subsample': [0.8, 0.9],
            'classifier__colsample_bytree': [0.8, 0.9]
        }
    
    # Grid search
    grid_search = GridSearchCV(
        pipeline, 
        param_grid, 
        cv=cv_strategy, 
        scoring='accuracy', 
        n_jobs=-1,
        verbose=1
    )
    
    # Fit the model
    grid_search.fit(X_train, y_train)
    
    # Store results
    best_models[model_name] = grid_search.best_estimator_
    grid_search_results[model_name] = {
        'best_score': grid_search.best_score_,
        'best_params': grid_search.best_params_,
        'cv_results': grid_search.cv_results_
    }
    
    print(f"Best CV Score: {grid_search.best_score_:.4f}")
    print(f"Best Parameters: {grid_search.best_params_}")

# Compare models
print("\n=== Model Comparison ===")
for model_name, results in grid_search_results.items():
    print(f"{model_name}: {results['best_score']:.4f}")

# Select the best model
best_model_name = max(grid_search_results.keys(), key=lambda k: grid_search_results[k]['best_score'])
best_pipeline = best_models[best_model_name]

print(f"\nBest overall model: {best_model_name}")
print(f"Best CV score: {grid_search_results[best_model_name]['best_score']:.4f}")

# Evaluate on test set
y_pred = best_pipeline.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
test_roc_auc = roc_auc_score(y_test, best_pipeline.predict_proba(X_test)[:, 1])

print(f"\nTest Set Performance:")
print(f"Accuracy: {test_accuracy:.4f}")
print(f"ROC-AUC: {test_roc_auc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Rejected', 'Approved']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Rejected', 'Approved'], yticklabels=['Rejected', 'Approved'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Applying the Pipelines for Prediction  

With the trained **data preprocessing** and **classification pipelines**, we can now apply them to the test dataset to generate predictions.  

## Steps:  
1. **Load the test dataset**: Ensure it has the same structure as the training data.  
2. **Apply the preprocessing pipeline**: Transform the test data using the trained preprocessing steps (e.g., encoding, scaling).  
3. **Make predictions**: Use the trained classification pipeline to predict loan approval outcomes.  
4. **Save or submit predictions**: Store the results for further analysis or competition submission.  

This structured approach ensures consistency and avoids data leakage, making the model reliable for real-world applications.  

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1MMxV5EzWu4xI8i4LoBjK-q-94RsQq-mE' -O loanpred_test.csv

In [None]:
#Let's load the test data
test_df = pd.read_csv('loanpred_test.csv')
test_df.head(5)


In [None]:
# Let's make predictions
print("=== Making Predictions on Test Data ===")

# Check test data structure
print("Test data shape:", test_df.shape)
print("Test data columns:", test_df.columns.tolist())

# Prepare test features (remove id column - note it's 'id' not 'Loan_ID' in test data)
X_competition_test_raw = test_df.drop(['id'], axis=1)

# Apply the same feature engineering to test data
print("Applying feature engineering to test data...")
X_competition_test = add_engineered_features(X_competition_test_raw)

# Make predictions using the best pipeline
test_predictions = best_pipeline.predict(X_competition_test)
test_probabilities = best_pipeline.predict_proba(X_competition_test)

# Convert predictions back to original labels (Y/N)
test_predictions_labels = label_encoder.inverse_transform(test_predictions)

print(f"Predictions made for {len(test_predictions)} samples")
print(f"Prediction distribution:")
unique, counts = np.unique(test_predictions_labels, return_counts=True)
for label, count in zip(unique, counts):
    print(f"  {label}: {count} ({count/len(test_predictions_labels)*100:.1f}%)")

# Show first few predictions with probabilities
print("\nFirst 10 predictions:")
for i in range(min(10, len(test_predictions))):
    loan_id = test_df.iloc[i]['id']  # Use 'id' column name
    prediction = test_predictions_labels[i]
    prob_approved = test_probabilities[i][1]
    print(f"  {loan_id}: {prediction} (Prob. Approved: {prob_approved:.3f})")

print("Predictions completed successfully!")

In [None]:
# Save the results to a CSV file following the competition template
# Convert Y/N predictions to 1/0 format as required by competition
test_predictions_binary = np.where(test_predictions_labels == 'Y', 1, 0)

# Create submission DataFrame with correct format
submission = pd.DataFrame({
    'id': test_df['id'],  # Use 'id' column name from test data
    'pred': test_predictions_binary  # Use 'pred' column name and binary format (1/0)
})

# Save to CSV
submission_filename = 'loan_prediction_submission.csv'
submission.to_csv(submission_filename, index=False)

print(f"=== Submission File Created ===")
print(f"File saved as: {submission_filename}")
print(f"Submission shape: {submission.shape}")
print("\nFirst few rows of submission:")
print(submission.head(10))

print(f"\nSubmission summary:")
print(submission['pred'].value_counts())
print(f"Approval rate: {(submission['pred'] == 1).mean()*100:.1f}%")
print(f"\nFormat: 1 = Approved, 0 = Not Approved")

# Also save the model for future use
model_filename = f'best_loan_prediction_model_{best_model_name.lower()}.joblib'
joblib.dump(best_pipeline, model_filename)
print(f"\nBest model saved as: {model_filename}")

print("Submission ready for competition!")

# Explaining the Model with SHAP  

Understanding how a machine learning model makes predictions is crucial, especially in applications like loan approval, where fairness and transparency are key. **SHAP (SHapley Additive Explanations)** provides a way to interpret the contribution of each feature to a model’s predictions.  

## Why is SHAP Important?  
1. **Improves Trust and Transparency**: Helps explain why a loan was approved or rejected, making the decision process clearer.  
2. **Identifies Key Features**: Highlights which factors influence predictions the most, allowing for better feature selection and model refinement.  
3. **Detects Bias and Unfairness**: Reveals if certain features (e.g., gender, income) have unintended strong effects on decisions.  
4. **Enhances Model Debugging**: Helps diagnose issues like overfitting or unexpected feature dependencies.  

By using SHAP, we ensure that our model is interpretable and aligned with ethical and regulatory standards.  

In [None]:
# Apply shap to explain
print("=== SHAP Model Explanation ===")

# Install and import SHAP
try:
    import shap
    print("SHAP library is available")
except ImportError:
    print("Installing SHAP library...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "shap"])
    import shap
    print("SHAP library installed and imported")

# Initialize SHAP explainer
print("Creating SHAP explainer...")

# Get the preprocessed training data for the explainer
X_train_processed = best_pipeline.named_steps['preprocessor'].fit_transform(X_train)

# Get feature names after preprocessing
feature_names = []

# Numerical features (same names)
feature_names.extend(numerical_features)

# Categorical features (one-hot encoded)
cat_feature_names = best_pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
feature_names.extend(cat_feature_names)

print(f"Total features after preprocessing: {len(feature_names)}")

# Create explainer for the classifier only (after preprocessing)
classifier = best_pipeline.named_steps['classifier']

# For tree-based models, use TreeExplainer
if best_model_name == 'RandomForest':
    explainer = shap.TreeExplainer(classifier)
    print("Using TreeExplainer for Random Forest")
elif best_model_name == 'GradientBoosting':
    explainer = shap.TreeExplainer(classifier)
    print("Using TreeExplainer for Gradient Boosting")

# Calculate SHAP values for a sample of the test set
sample_size = min(50, len(X_test))
X_test_sample = X_test.iloc[:sample_size]
X_test_processed_sample = best_pipeline.named_steps['preprocessor'].transform(X_test_sample)

print(f"Calculating SHAP values for {sample_size} test samples...")
shap_values = explainer.shap_values(X_test_processed_sample)

# For binary classification, we want the positive class (approved loans)
if isinstance(shap_values, list):
    shap_values = shap_values[1]  # Positive class

# Create SHAP plots
print("Creating SHAP visualizations...")

# Summary plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test_processed_sample, feature_names=feature_names, show=False)
plt.title('SHAP Summary Plot - Feature Importance for Loan Approval')
plt.tight_layout()
plt.show()

# Feature importance plot
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test_processed_sample, feature_names=feature_names, plot_type="bar", show=False)
plt.title('SHAP Feature Importance - Mean |SHAP value|')
plt.tight_layout()
plt.show()

# Show SHAP values for individual predictions
print("\n=== Individual Prediction Explanations ===")
for i in range(min(5, sample_size)):
    loan_id = test_df.iloc[i]['id']  # Use 'id' column name
    prediction = test_predictions_labels[i]
    prob_approved = test_probabilities[i][1]
    
    print(f"\nLoan {loan_id} - Predicted: {prediction} (Prob: {prob_approved:.3f})")
    
    # Get top 5 most important features for this prediction
    feature_importance = list(zip(feature_names, shap_values[i]))
    feature_importance.sort(key=lambda x: abs(x[1]), reverse=True)
    
    print("Top 5 contributing features:")
    for j, (feature, shap_val) in enumerate(feature_importance[:5]):
        direction = "→ APPROVE" if shap_val > 0 else "→ REJECT"
        print(f"  {j+1}. {feature}: {shap_val:.4f} {direction}")

# Waterfall plot for first prediction
if sample_size > 0:
    plt.figure(figsize=(12, 8))
    
    # Handle base values correctly for binary classification
    base_value = explainer.expected_value
    if isinstance(base_value, (list, np.ndarray)):
        base_value = base_value[1] if len(base_value) > 1 else base_value[0]
    
    shap.waterfall_plot(shap.Explanation(values=shap_values[0], 
                                       base_values=base_value, 
                                       data=X_test_processed_sample[0],
                                       feature_names=feature_names), 
                       show=False)
    plt.title(f'SHAP Waterfall Plot - Loan {test_df.iloc[0]["id"]}')  # Use 'id' column name
    plt.tight_layout()
    plt.show()

print("\n=== SHAP Analysis Complete ===")
print("Key insights:")
print("- SHAP values show how each feature contributes to the final prediction")
print("- Positive SHAP values push towards loan approval")
print("- Negative SHAP values push towards loan rejection")
print("- The magnitude indicates the strength of the contribution")