# Credit Portfolio Risk Modelling – Data Preparation

This notebook performs:

- Data loading
- Loan status filtering
- Target variable definition (Default)
- Removal of leakage variables
- Basic cleaning and preprocessing

Dataset: Lending Club Loan Data


In [1]:
## importing the required libraries 
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline

In [2]:
%pwd


'C:\\Users\\amit_\\Github\\credit-portfolio-vasicek-simulation\\notebooks'

In [3]:
df = pd.read_csv(r'C:\\Users\\amit_\\Github\\credit-portfolio-vasicek-simulation\\data\\raw\\loan_data.csv', index_col = 0, low_memory=False)

## Dataset Overview

The Lending Club dataset contains retail loan-level information including:

- Borrower characteristics
- Loan structure details
- Repayment and recovery information
- Loan status

This step focuses on:

1. Inspecting dataset structure
2. Understanding loan status distribution
3. Preparing data for PD modelling


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 466285 entries, 0 to 466284
Data columns (total 74 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           466285 non-null  int64  
 1   member_id                    466285 non-null  int64  
 2   loan_amnt                    466285 non-null  int64  
 3   funded_amnt                  466285 non-null  int64  
 4   funded_amnt_inv              466285 non-null  float64
 5   term                         466285 non-null  object 
 6   int_rate                     466285 non-null  float64
 7   installment                  466285 non-null  float64
 8   grade                        466285 non-null  object 
 9   sub_grade                    466285 non-null  object 
 10  emp_title                    438697 non-null  object 
 11  emp_length                   445277 non-null  object 
 12  home_ownership               466285 non-null  object 
 13  annu

## Initial Data Type Observations

After inspecting the dataset using df.info(), the following columns require type correction or cleaning:

- `term` → currently object; should be numeric (loan term in months)
- `emp_length` → object; requires cleaning and numeric mapping
- `pymnt_plan` → categorical (binary)
- `earliest_cr_line` → object; should be converted to datetime
- `issue_d` → object; should be converted to datetime

These transformations are necessary for proper modelling and feature engineering.


## Filtering Resolved Loans

For Probability of Default (PD) modelling, only loans with final outcomes are considered.

Included:
- Fully Paid
- Charged Off

Excluded:
- Current (active loans)

This ensures that the target variable represents observed default behaviour.


In [5]:
df['loan_status'].value_counts()

loan_status
Current                                                224226
Fully Paid                                             184739
Charged Off                                             42475
Late (31-120 days)                                       6900
In Grace Period                                          3146
Does not meet the credit policy. Status:Fully Paid       1988
Late (16-30 days)                                        1218
Default                                                   832
Does not meet the credit policy. Status:Charged Off       761
Name: count, dtype: int64

## Defining Modelling Population

For PD modelling, only loans with resolved outcomes are included.

Default = 
- Charged Off
- Default
- Does not meet credit policy: Status: Charged Off

Non-default =
- Fully Paid
- Does not meet credit policy: Status: Fully Paid

Unresolved loan states (Current, Late, Grace Period) are excluded.
Also, creating a loan_data dataframe a copy of df to protect the raw data

In [6]:
default_status = [
    'Charged Off',
    'Default',
    'Does not meet the credit policy. Status:Charged Off'
]

non_default_status = [
    'Fully Paid',
    'Does not meet the credit policy. Status:Fully Paid'
]
loan_data = df[df['loan_status'].isin(default_status + non_default_status)].copy()


In [7]:
loan_data['loan_status'].value_counts()/loan_data['loan_status'].count()

loan_status
Fully Paid                                             0.800446
Charged Off                                            0.184038
Does not meet the credit policy. Status:Fully Paid     0.008614
Default                                                0.003605
Does not meet the credit policy. Status:Charged Off    0.003297
Name: count, dtype: float64

In [8]:
loan_data['default'] = np.where(
    loan_data['loan_status'].isin(default_status),
    1,
    0
)


In [9]:
loan_data.shape
loan_data['default'].mean()
loan_data['default'].value_counts()


default
0    186727
1     44068
Name: count, dtype: int64