# Submission by Priyadarshan for Franklin Templeton Assignment, Copy Rights-2024
# Loan Data Classification with Multiple Algorithms

The code is designed to analyze a dataset related to loans (real estate financing) and predict whether a borrower will default on their loan. It uses various machine learning models to make these predictions based on historical data.

The notebook demonstrates a complete machine learning workflow using loan data. We will:
- Load and preprocess the data
- Train several machine learning models
- Evaluate model performance using key metrics
- Use SHAP (SHapley Additive exPlanations) to explain the model predictions

- Data Preparation: Start with thorough data cleaning and preprocessing to ensure quality input data.
- Feature Engineering: Create meaningful features that can help improve model predictions.
- Model Training and Evaluation: Use a variety of models to see which performs best on validation data.
- Hyperparameter Tuning and Cross-Validation: Optimize models through tuning while validating their performance across different subsets of data.
- Final Model Selection: Choose the best-performing model based on comprehensive evaluation metrics.


## Step 1: Import Necessary Libraries

We will start by importing the necessary Python libraries, including:
- `pandas` and `numpy` for data manipulation.
- `scikit-learn` for model training and evaluation.
- `shap` for model explainability.


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
    BaggingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier  # Requires xgboost package
from lightgbm import LGBMClassifier  # Requires lightgbm package
from catboost import CatBoostClassifier  # Requires catboost package
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    f1_score
)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import shap
import warnings
import time

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'sklearn'

## Step 2: Load Data

Here, we load the loan data from two CSV files:
- **MPLCaseStudy.csv**: Contains the loan data.
- **mapping.csv**: Contains a mapping of column names to their descriptions.

We will also rename the columns based on the descriptions provided in `mapping.csv`.


In [None]:
# Load the datasets
loan_data = pd.read_csv('MPLCaseStudy.csv', low_memory=False)
mapping = pd.read_csv('mapping.csv')

# Apply column renaming from mapping.csv
mapping_dict = dict(zip(mapping['LoanStatNew'], mapping['Description']))
loan_data.rename(columns=mapping_dict, inplace=True)



: 

## Step 3: Data Preprocessing

We will now preprocess the data by:
- Converting percentage strings to floats
- Handling date columns and creating a new feature `years_since_earliest_cr_line`.
- Dropping irrelevant date columns.


In [None]:
# Function to convert percentage strings to floats
def convert_percent_to_float(x):
    if isinstance(x, str) and '%' in x:
        return float(x.strip('%')) / 100
    return x

# Convert date columns to datetime format and create new features
date_columns = ['The month the borrower\'s earliest reported credit line was opened',
                'The month which the loan was funded']
for col in date_columns:
    loan_data[col] = pd.to_datetime(loan_data[col], errors='coerce')

loan_data['years_since_earliest_cr_line'] = loan_data['The month the borrower\'s earliest reported credit line was opened'].apply(
    lambda x: (pd.Timestamp.now() - x).days / 365 if pd.notnull(x) else None)

# Drop original date columns
loan_data.drop(columns=date_columns, inplace=True)

# Convert percentage columns to float values
percentage_columns = [
    'Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.',
    'Percentage of all bankcard accounts > 75% of limit.',
    'Balance to credit limit on all trades'
]
for col in percentage_columns:
    loan_data[col] = loan_data[col].apply(convert_percent_to_float)

# Handle missing 'loan_status' column
if 'loan_status' not in loan_data.columns:
    loan_data.rename(columns={'Current status of the loan': 'loan_status'}, inplace=True)

# Identify numeric columns that need imputation and remove columns with all missing values
numeric_columns = loan_data.select_dtypes(include=[np.number]).columns.tolist()
loan_data_numeric = loan_data[numeric_columns].dropna(axis=1, how='all')

# Impute missing values for numeric columns using SimpleImputer
imputer = SimpleImputer(strategy='median')
loan_data_imputed = pd.DataFrame(imputer.fit_transform(loan_data_numeric), columns=loan_data_numeric.columns)
loan_data.update(loan_data_imputed)

# Encode categorical columns using LabelEncoder
label_encoders = {}
for column in loan_data.select_dtypes(include=[object]):
    label_encoders[column] = LabelEncoder()
    loan_data[column] = label_encoders[column].fit_transform(loan_data[column].astype(str))

: 

## Step 4: Handle Missing Values and Encode Categorical Columns

Next, we will:
- Handle missing values in numeric columns by using median imputation.
- Remove any columns that have all missing values.
- Encode categorical columns using `LabelEncoder`.


In [None]:
# Handle missing 'loan_status' column
if 'loan_status' not in loan_data.columns:
    loan_data.rename(columns={'Current status of the loan': 'loan_status'}, inplace=True)

# Identify numeric columns that need imputation
numeric_columns = loan_data.select_dtypes(include=[np.number]).columns

# Remove columns with all missing values
loan_data_numeric = loan_data[numeric_columns].dropna(axis=1, how='all')

# Preprocessing - Handling missing values for numeric columns using SimpleImputer
imputer = SimpleImputer(strategy='median')
loan_data_imputed = pd.DataFrame(imputer.fit_transform(loan_data_numeric), columns=loan_data_numeric.columns)

# Reassign imputed columns back to original DataFrame
loan_data.update(loan_data_imputed)

# Encode categorical columns
label_encoders = {}
for column in loan_data.select_dtypes(include=[object]):
    label_encoders[column] = LabelEncoder()
    loan_data[column] = label_encoders[column].fit_transform(loan_data[column].astype(str))


: 

## Step 5: Split Data for Training and Testing

We will now split the data into training and testing sets with an 80/20 split, keeping the `loan_status` column as the target.


In [None]:
# Split data into features and target variable, then into train and test sets
X = loan_data.drop(columns=['loan_status'])
y = loan_data['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

: 

## Step 6: Initialize Machine Learning Models

We will initialize several machine learning models to compare their performance:
- Random Forest
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine (SVM)
- Gradient Boosting


In [None]:
# Initialize a list of models to compare (including new models)
models = {
    'RandomForest': RandomForestClassifier(),
    'LogisticRegression': LogisticRegression(max_iter=500),
    'KNN': KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree'),  # Use BallTree for efficiency
    'SVM': SVC(probability=True),
    'GradientBoosting': GradientBoostingClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'AdaBoost': AdaBoostClassifier(),
    'Bagging': BaggingClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),  # Suppress warning for XGBoost 
    'LightGBM': LGBMClassifier(),
    'CatBoost': CatBoostClassifier(silent=True)  # Suppress output for CatBoost training 
}

: 

## Step 7: Define Function to Evaluate Models

This function will:
- Train a model on the training data.
- Predict results on the test data.
- Print performance metrics such as accuracy, confusion matrix, classification report, and ROC AUC score.


In [None]:
def evaluate_model_with_timeout(model, X_train, X_test, y_train, y_test, timeout=60):
    """Fit the model and evaluate it using various metrics with a timeout."""
    start_time = time.time()
    
    try:
        model.fit(X_train, y_train)
        elapsed_time = time.time() - start_time
        
        if elapsed_time > timeout:
            print(f"Model {model.__class__.__name__} fitting exceeded timeout of {timeout} seconds.")
            return
        
        y_pred = model.predict(X_test)
        
        # Check for probability predictions
        y_pred_proba = model.predict_proba(X_test) if hasattr(model, "predict_proba") else None

        print(f"Model: {model.__class__.__name__}")
        print("Accuracy: ", accuracy_score(y_test, y_pred))
        print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
        print("Classification Report:\n", classification_report(y_test, y_pred))
        
        if y_pred_proba is not None:
            if len(np.unique(y_test)) > 2:  # Handle multi-class case
                roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class="ovr", average="macro")
                print("ROC AUC Score (multi-class): ", roc_auc)
            else:
                roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
                print("ROC AUC Score: ", roc_auc)
        
        print("F1 Score (weighted): ", f1_score(y_test, y_pred, average='weighted'))
        print("-" * 60)

    except Exception as e:
        print(f"An error occurred while evaluating model {model.__class__.__name__}: {str(e)}")

: 

## Step 8: Data Transformation

We will use pipelines to handle both numerical and categorical features. The numeric features will be imputed using the median and scaled, while categorical features will be imputed and one-hot encoded.


In [None]:
# Preprocessing pipelines for numeric and categorical data
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=[object]).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply the preprocessor to X_train and X_test
X_train_imputed = preprocessor.fit_transform(X_train)
X_test_imputed = preprocessor.transform(X_test)

# Convert the result back to DataFrame for interpretability (optional)
X_train_imputed_df = pd.DataFrame(X_train_imputed.toarray() if hasattr(X_train_imputed, 'toarray') else X_train_imputed)
X_test_imputed_df = pd.DataFrame(X_test_imputed.toarray() if hasattr(X_test_imputed, 'toarray') else X_test_imputed)

: 

## Step 9: Evaluate the models
Evaluation

In [None]:
# Evaluate each model on the preprocessed data with a timeout for KNN specifically.
for name, model in models.items():
    evaluate_model_with_timeout(model, X_train_imputed_df, X_test_imputed_df, y_train, y_test)

: 

## Step 10: SHAP Analysis
Analyze results

In [None]:
# Explainable AI with SHAP for RandomForest model (or any other model of choice)
explainer = shap.TreeExplainer(models['RandomForest'])
shap_values = explainer.shap_values(X_test_imputed_df)

# Plot SHAP summary plot for RandomForest (or any other model of choice)
shap.summary_plot(shap_values[1], X_test_imputed_df)

: 

## Step 11: Predictions on Missing Loan Status
And saving results to xlsx

In [None]:
# Predict on rows with missing loan_status and save predictions to Excel
missing_loan_status = loan_data[loan_data['loan_status'].isnull()]
predictions = models['RandomForest'].predict(missing_loan_status.drop(columns=['loan_status']))
missing_loan_status['loan_status'] = predictions

missing_loan_status.to_excel('LoanStatusPredictions.xlsx', index=False)

: 

# Loan Default Prediction Analysis Insights

This section provides insights into the performance of three different machine learning models—**Random Forest**, **Logistic Regression**, and **K-Nearest Neighbors (KNN)**—based on their evaluation metrics.

## 1. Random Forest Classifier

### Results:
- **Accuracy**: 0.9744 (97.44%)
- **Confusion Matrix**:



| True \ Predicted | 0    | 1     | 2    |
|------------------|------|-------|------|
| **0**            | 3609 | 204   | 49   |
| **1**            | 0    | 26347 | 361  |
| **2**            | 23   | 249   | 3707 |



- **Classification Report**:
  - Precision for class 0: **0.99**
  - Recall for class 0: **0.93**
  - F1-score for class 0: **0.96**
  - Precision for class 1: **0.98**
  - Recall for class 1: **0.99**
  - F1-score for class 1: **0.98**
  - Precision for class 2: **0.90**
  - Recall for class 2: **0.93**
  - F1-score for class 2: **0.92**

- **ROC AUC Score**: **0.9853**
- **Weighted F1 Score**: **0.9744**

### Insights:
- The Random Forest model performs exceptionally well with an accuracy of about **97%**.
- The confusion matrix shows that it correctly predicts most of the loan statuses, with very few false positives and false negatives.
- The model has high precision and recall across all classes, particularly for class **1**, indicating it is very effective at identifying non-defaulted loans.
- The ROC AUC score of **0.9853** suggests that the model has excellent discriminatory ability between classes.

---

## 2. Logistic Regression

### Results:
- **Accuracy**: 0.9434 (94.34%)
- **Confusion Matrix**:



[[ 2563 1195 104]
[ 0 26335 373]
[ 19 263 3697]]

- **Classification Report**:
    - Precision for class 0: **0.99**
    - Recall for class 0: **0.66**
    - F1-score for class 0: **0.80**
    - Precision for class 1: **0.95**
    - Recall for class 1: **0.99**
    - F1-score for class 1: **0.97**
    - Precision for class 2: **0.89**
    - Recall for class 2: **0.93**
    - F1-score for class 2: **0.91**

- **ROC AUC Score**: **0.9604**
- **Weighted F1 Score**: **0.9404**

### Insights:
- Logistic Regression shows a good accuracy of about **94%**, but it is less effective than Random Forest.
- The confusion matrix indicates that while it performs well in identifying non-defaulted loans (class **1**), it struggles with identifying defaults (class **0**) as evidenced by the lower recall (66%).
- The precision and recall trade-off indicates that while it is good at predicting non-defaults, it may miss some actual defaults, leading to potential financial risk.
- The ROC AUC score of **0.9604** still shows a strong ability to distinguish between classes but is not as robust as Random Forest.

---

## 3. K-Nearest Neighbors (KNN)

### Results:
- **Accuracy**: **85.93%**
- **Confusion Matrix**:



[[1458, 2333, 71]
[136,26124, 448]
[151,1722,2106]]

- **Classification Report**:
    - Precision for class 0: **0.84**
    - Recall for class 0: **0.38**
    - F1-score for class 0: **0.52**
    - Precision for class 1: **0.87**
    - Recall for class 1: **0.98**
    - F1-score for class 1: **0.92**
    - Precision for class 2: **0.80**
    - Recall for class 2: **0.53**
    - F1-score for class2:**64**

- **ROC AUC Score**: **0.8449**
- **Weighted F1 Score**: **0.8416**

### Insights:
- KNN has the lowest accuracy of the three models at about **86%**, indicating that it struggles more than the others in predicting loan defaults accurately.
- The confusion matrix reveals a significant number of misclassifications, particularly in identifying non-defaulted loans (class `0`), where recall is only `38%`. This means many actual defaults are being missed.
- Although it performs reasonably well on the majority class (class `1`), its overall performance is not satisfactory due to poor precision and recall on classes `0` and `2`.
- The ROC AUC score of `0.8449` indicates a fair ability to distinguish between classes but suggests that KNN may not be the best choice given its performance.

---

## Overall Comparison and Conclusion

In summary:

- The Random Forest model outperforms both Logistic Regression and KNN in terms of accuracy, precision, recall, and overall robustness in predictions.
- Logistic Regression performs well but has limitations in identifying defaults effectively.
- KNN shows significant weaknesses in accuracy and misclassification rates, making it less suitable compared to the other two models.

### Recommendations

Given these insights:

1. Consider using Random Forest as the primary model due to its high accuracy and robustness.
2. If interpretability is crucial, Logistic Regression can be used alongside Random Forest to provide insights into feature importance.
3. Further tuning of hyperparameters and possibly exploring ensemble methods could enhance model performance even more.
4. Implement cross-validation techniques to ensure stability in model performance across different subsets of data.

This analysis provides a clear pathway to selecting the best model while highlighting areas where improvements can be made in future iterations or additional data collection efforts!