# Credit Scoring with Missing Data Analysis

This notebook demonstrates the analysis of credit scoring data with a focus on handling Missing Not At Random (MNAR) data. We'll compare different methods for handling missing data and evaluate their impact on model performance.

## Table of Contents
1. [Setup and Data Loading](#1.-Setup-and-Data-Loading)
2. [Data Preprocessing](#2.-Data-Preprocessing)
3. [Missing Data Analysis](#3.-Missing-Data-Analysis)
4. [Handling Missing Data](#4.-Handling-Missing-Data)
5. [Model Training](#5.-Model-Training)
6. [Model Evaluation](#6.-Model-Evaluation)
7. [Results Comparison](#7.-Results-Comparison)
8. [Conclusions](#8.-Conclusions)

## 1. Setup and Data Loading

First, let's import the necessary libraries and load our data.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add the 'src' directory to the Python path
import sys
import os
sys.path.append(os.path.abspath('../src'))

# Import our custom modules
# from src.missing_data_handler import MissingDataHandler
# from src.model import CreditScoringModel
# from src.evaluation import ModelEvaluator
# from src.utils import preprocess_data, plot_missingness, plot_feature_distributions, create_correlation_matrix, split_data

# Set random seed for reproducibility
np.random.seed(42)

# Read a .gz file
file_path_accepted_gz = '/Users/boraeguz/MSc_Thesis_Missing_Data/data/raw/accepted_2007_to_2018Q4.csv.gz'
df_accepted_gz = pd.read_csv(file_path_accepted_gz, compression='gzip')
print(f"Dataset shape: {df_accepted_gz.shape}")

file_path_rejected_gz = '/Users/boraeguz/MSc_Thesis_Missing_Data/data/raw/rejected_2007_to_2018Q4.csv.gz'
df_rejected_gz = pd.read_csv(file_path_rejected_gz, compression='gzip')

# file_path_accepted_csv = '/Users/boraeguz/MSc_Thesis_Missing_Data/data/raw/accepted_2007_to_2018Q4.csv'
# df_accepted_csv = pd.read_csv(file_path_accepted_csv)   

# file_path_rejected_csv = '/Users/boraeguz/MSc_Thesis_Missing_Data/data/raw/rejected_2007_to_2018Q4.csv'
# df_rejected_csv = pd.read_csv(file_path_rejected_csv)   



  df_accepted_gz = pd.read_csv(file_path_accepted_gz, compression='gzip')


Dataset shape: (2260701, 151)


## 2. Data Preprocessing

Let's examine our data and perform initial preprocessing steps.

In [8]:
# Display basic information about the dataset
print("Dataset Info:")
df_rejected_gz.info()

print("\nSummary Statistics:")
df_rejected_gz.describe()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27648741 entries, 0 to 27648740
Data columns (total 9 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Amount Requested      float64
 1   Application Date      object 
 2   Loan Title            object 
 3   Risk_Score            float64
 4   Debt-To-Income Ratio  object 
 5   Zip Code              object 
 6   State                 object 
 7   Employment Length     object 
 8   Policy Code           float64
dtypes: float64(3), object(6)
memory usage: 1.9+ GB

Summary Statistics:


Unnamed: 0,Amount Requested,Risk_Score,Policy Code
count,27648740.0,9151111.0,27647820.0
mean,13133.24,628.1721,0.006375113
std,15009.64,89.93679,0.1127368
min,0.0,0.0,0.0
25%,4800.0,591.0,0.0
50%,10000.0,637.0,0.0
75%,20000.0,675.0,0.0
max,1400000.0,990.0,2.0


In [9]:
# Preprocess the data
processed_df = preprocess_data(df_rejected_gz)
print(f"Processed dataset shape: {processed_df.shape}")

NameError: name 'preprocess_data' is not defined

## 3. Missing Data Analysis

Let's analyze the patterns of missing data in our dataset.

In [10]:
# Visualize missing data patterns
plot_missingness(processed_df)

# Plot feature distributions
plot_feature_distributions(processed_df)

# Create correlation matrix
create_correlation_matrix(processed_df)

NameError: name 'plot_missingness' is not defined

## 4. Handling Missing Data

We'll apply different methods to handle missing data and create multiple versions of our dataset.

In [None]:
# Initialize missing data handler
handler = MissingDataHandler()

# Apply different missing data handling methods
df_mean = handler.mean_imputation(processed_df.copy(), 'target')
df_heckman = handler.heckman_correction(processed_df.copy(), 'target', 'income')
df_basl = handler.basl_method(processed_df.copy(), 'target')

# Store datasets in a dictionary
datasets = {
    'mean_imputation': df_mean,
    'heckman_correction': df_heckman,
    'basl_method': df_basl
}

## 5. Model Training

Now we'll train models using each version of our dataset.

In [None]:
# Initialize model and evaluator
model = CreditScoringModel()
evaluator = ModelEvaluator()

# Dictionary to store results
results = {}

# Train and evaluate models for each dataset
for method_name, dataset in datasets.items():
    print(f"\nProcessing {method_name}...")
    
    # Prepare data
    X = dataset.drop('target', axis=1)
    y = dataset['target']
    X_train, X_test, y_train, y_test = split_data(X, y)
    
    # Train model
    model.train(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Evaluate
    results[method_name] = evaluator.evaluate_model(y_test, y_pred, y_pred_proba)

## 6. Model Evaluation

Let's evaluate the performance of each model.

In [None]:
# Compare model performances
evaluator.compare_models(results)

# Print detailed results
for method_name, metrics in results.items():
    print(f"\nResults for {method_name}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")

## 7. Results Comparison

Let's analyze the differences between the methods.

In [None]:
# Create comparison visualizations
metrics_to_compare = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
evaluator.compare_models(results, metrics=metrics_to_compare)

## 8. Conclusions

Based on our analysis:

1. **Method Comparison**:
   - [Fill in observations about which method performed best]
   - [Note any interesting patterns in the results]

2. **Practical Implications**:
   - [Discuss what these results mean for credit scoring]
   - [Note any limitations or areas for future research]

3. **Recommendations**:
   - [Provide specific recommendations based on the results]
   - [Suggest best practices for handling missing data in credit scoring]