# Thesis Analysis on Missing Data Handling in Credit Scoring

## Objectives

In this notebook, we will explore different ways to handle missing data in credit scoring datasets. The primary objectives are:

1. Simulate a synthetic credit dataset.
2. Introduce missingness into the dataset using various missing data mechanisms (MCAR, MAR, MNAR).
3. Visualize the missingness patterns in the data.
4. Handle missing data using different imputation methods, Heckman correction, and BASL (Bias-Aware Self Learning).
5. Split the data for training and testing.
6. Train a machine learning model.
7. Evaluate the model using various performance metrics.
8. Conduct experiments to compare the performance of different missing data handling techniques.
9. Visualize the results and draw conclusions.

## Libraries Used

We will use the following libraries in this notebook:

1. **pandas**: For data manipulation.
2. **numpy**: For numerical operations and data generation.
3. **matplotlib** and **seaborn**: For visualization.
4. **sklearn**: For machine learning and evaluation metrics.
5. **statsmodels**: For the Heckman correction model.


In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, brier_score_loss
from src.data_simulator import DataSimulator
from src.missing_data_handler import MissingDataHandler
from src.heckman import HeckmanCorrection
from src.basl import BASLCorrection

ModuleNotFoundError: No module named 'src'

### 1. Simulating Data

We will create a synthetic dataset of 1000 samples with features like age, income, debt, and a target variable representing the repayment label (0 = default, 1 = no default).

Let's create this synthetic dataset.

In [None]:
# Initialize the DataSimulator class
simulator = DataSimulator(num_samples=1000)

# Generate synthetic data
data = simulator.generate_data()

# Display the first few rows of the data
data.head()


### 2. Introducing Missingness

We will introduce missingness in the `Repayment_Label` column based on three different mechanisms:
- **MCAR**: Missing Completely at Random
- **MAR**: Missing at Random
- **MNAR**: Missing Not at Random

We'll apply these missingness mechanisms one by one to simulate different real-world scenarios.


In [None]:
# Introduce MCAR missingness
data_mcar = simulator.introduce_missingness(data, missingness_type="MCAR", missing_rate=0.2)

# Introduce MAR missingness
data_mar = simulator.introduce_missingness(data, missingness_type="MAR", missing_rate=0.2)

# Introduce MNAR missingness
data_mnar = simulator.introduce_missingness(data, missingness_type="MNAR", missing_rate=0.2)

# Display the data with MCAR missingness
data_mcar.head()


### 3. Visualizing Missing Data

Let's visualize the missing data patterns using a heatmap, which will help us understand where and how the missing values are distributed.


In [None]:
# Function to plot the missingness heatmap
def plot_missing_data_heatmap(data):
    plt.figure(figsize=(10, 6))
    sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
    plt.title("Missing Data Heatmap")
    plt.show()

# Visualize missing data for the MCAR dataset
plot_missing_data_heatmap(data_mcar)


### 4. Handling Missing Data

We will apply three different methods to handle the missing data:
1. **Imputation**: Using mean imputation.
2. **Heckman Correction**: To handle MNAR missingness.
3. **BASL**: A more advanced approach that combines bias correction with self-learning.

We'll handle missing data for all three datasets (MCAR, MAR, MNAR) using these methods.


In [None]:
# Initialize the handlers
missing_data_handler = MissingDataHandler()
heckman_correction = HeckmanCorrection()
basl_correction = BASLCorrection()

# Impute missing data (MCAR example)
data_imputed_mcar = missing_data_handler.impute_missing_data(data_mcar)

# Apply Heckman correction (MNAR example)
data_heckman_mnar = heckman_correction.apply_heckman(data_mnar)

# Apply BASL correction (MNAR example)
data_basl_mnar = basl_correction.apply_basl(data_mnar)

# Display the processed data
data_imputed_mcar.head(), data_heckman_mnar.head(), data_basl_mnar.head()


### 5. Splitting the Data

We'll split the data into training and testing sets. We'll use 80% of the data for training and 20% for testing.


In [None]:
# Define features and target variable for the MCAR dataset
X_mcar = data_imputed_mcar.drop(columns='Repayment_Label')
y_mcar = data_imputed_mcar['Repayment_Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_mcar, y_mcar, test_size=0.2, random_state=42)

# Check the shape of the resulting datasets
X_train.shape, X_test.shape


### 6. Training a Model

We'll train a Random Forest classifier on the training data to predict the repayment label (default or no default).


In [None]:
# Initialize the RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Check the training accuracy
train_accuracy = rf_model.score(X_train, y_train)
train_accuracy


### 7. Feature Importances

We will extract and visualize the feature importances from the trained Random Forest model.


In [None]:
# Get feature importances from the trained model
feature_importances = rf_model.feature_importances_

# Visualize the feature importances
plt.figure(figsize=(8, 6))
sns.barplot(x=X_train.columns, y=feature_importances)
plt.title("Feature Importances")
plt.xticks(rotation=45)
plt.show()


### 8. Model Evaluation

We will evaluate the performance of the trained model using several metrics:
1. **AUC (Area Under the Curve)**
2. **Brier Score**
3. **Accuracy**


In [None]:
# Make predictions on the test set
y_pred = rf_model.predict(X_test)
y_pred_prob = rf_model.predict_proba(X_test)[:, 1]

# Calculate the evaluation metrics
auc_score = roc_auc_score(y_test, y_pred_prob)
brier_score = brier_score_loss(y_test, y_pred_prob)
accuracy = accuracy_score(y_test, y_pred)

# Display the metrics
auc_score, brier_score, accuracy


### 9. Experiments and Visualizing Results

Now, let's compare the performance of the different missing data handling techniques (Imputation, Heckman, and BASL) using the AUC, Brier Score, and Accuracy metrics.


In [None]:
# Define a function to evaluate the models with different missing data handling methods
def evaluate_model(X_train, X_test, y_train, y_test, method_name="Imputation"):
    if method_name == "Imputation":
        # Handle missing data using imputation (MCAR)
        data = missing_data_handler.impute_missing_data(data_mcar)
    elif method_name == "Heckman":
        # Handle missing data using Heckman correction (MNAR)
        data = heckman_correction.apply_heckman(data_mnar)
    elif method_name == "BASL":
        # Handle missing data using BASL correction (MNAR)
        data = basl_correction.apply_basl(data_mnar)
    
    # Split the data again for model training
    X_train, X_test, y_train, y_test = train_test_split(data.drop(columns='Repayment_Label'), data['Repayment_Label'], test_size=0.2, random_state=42)
    
    # Train and evaluate the model
    rf_model.fit(X_train, y_train)
    y_pred_prob = rf_model.predict_proba(X_test)[:, 1]
    
    auc_score = roc_auc_score(y_test, y_pred_prob)
    brier_score = brier_score_loss(y_test, y_pred_prob)
    accuracy = accuracy_score(y_test, rf_model.predict(X_test))
    
    return auc_score, brier_score, accuracy

# Evaluate using different methods
results = {
    "Imputation": evaluate_model(X_train, X_test, y_train, y_test, "Imputation"),
    "Heckman": evaluate_model(X_train, X_test, y_train, y_test, "Heckman"),
    "BASL": evaluate_model(X_train, X_test, y_train, y_test, "BASL")
}

# Display the results
results_df = pd.DataFrame(results, index=["AUC", "Brier Score", "Accuracy"])
results_df


### 10. Conclusion

In this thesis, we explored various methods for handling missing data in credit scoring datasets. We simulated data with different missingness mechanisms (MCAR, MAR, MNAR), applied several imputation and correction techniques (Imputation, Heckman, BASL), and evaluated the performance of these methods using various metrics.

The next steps could involve further tuning the models, exploring other missing data techniques, and applying the model to real-world datasets for more robust conclusions.
