#Developing a PD Model for Application Scoring

#  About the Dataset
## The KGB dataset, a dummy dataset, is actually a collection of loan information used to help banks understand their risk. Here’s the breakdown:

##Unbalanced Sample: This dataset only includes people whose loans were approved (good accounts). There are no rejected applicants or those who are still waiting for a decision.
##Defining “Bad”: In this dataset, a bad account is simply someone who hasn’t made their loan payment in 90 days or more (delinquent).
##Good vs. Bad: There are no in-between options here. If someone isn’t classified as bad (delinquent), they’re automatically considered good.
##The KGB Variable: The data uses a simple variable called “GB” to indicate whether an account is good or bad.

##Import Libraries and Load Data

In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

# Load the data
data = pd.read_csv('KGB.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,TITLE,CHILDREN,PERS_H,AGE,TMADD,TMJOB1,TEL,NMBLOAN,FINLOAN,INCOME,...,DIV,CASH,PRODUCT,RESID,NAT,PROF,CAR,CARDS,GB,_freq_
0,R,0,2,46,15,33,2,0,0,0,...,0,2000,"Radio, TV, Hifi",Lease,German,Others,Car,Cheque card,0,30
1,H,4,6,34,144,54,2,1,1,3200,...,1,6000,,Owner,Turkish,,Car,no credit cards,1,1
2,H,3,5,31,108,120,2,1,1,3300,...,1,0,,Lease,Turkish,Others,Car,no credit cards,1,1
3,R,0,1,39,192,6,1,0,0,1500,...,0,2500,"Furniture,Carpet",Lease,German,Others,Without Vehicle,no credit cards,1,1
4,H,3,5,32,48,108,2,2,1,0,...,0,2500,"Furniture,Carpet",Lease,German,"Civil Service, M",Car,Cheque card,0,30


In [24]:
#Searching for Missings,type of data and also known the shape of data
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 28 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   TITLE     3000 non-null   object
 1   CHILDREN  3000 non-null   int64 
 2   PERS_H    3000 non-null   int64 
 3   AGE       3000 non-null   int64 
 4   TMADD     3000 non-null   int64 
 5   TMJOB1    3000 non-null   int64 
 6   TEL       3000 non-null   int64 
 7   NMBLOAN   3000 non-null   int64 
 8   FINLOAN   3000 non-null   int64 
 9   INCOME    3000 non-null   int64 
 10  EC_CARD   3000 non-null   int64 
 11  INC       3000 non-null   int64 
 12  INC1      3000 non-null   int64 
 13  STATUS    3000 non-null   object
 14  BUREAU    3000 non-null   int64 
 15  LOCATION  3000 non-null   int64 
 16  LOANS     3000 non-null   int64 
 17  REGN      3000 non-null   int64 
 18  DIV       3000 non-null   int64 
 19  CASH      3000 non-null   int64 
 20  PRODUCT   2988 non-null   object
 21  RESID     2465

In [25]:
data.shape

(3000, 28)

In [26]:
data.columns

Index(['TITLE', 'CHILDREN', 'PERS_H', 'AGE', 'TMADD', 'TMJOB1', 'TEL',
       'NMBLOAN', 'FINLOAN', 'INCOME', 'EC_CARD', 'INC', 'INC1', 'STATUS',
       'BUREAU', 'LOCATION', 'LOANS', 'REGN', 'DIV', 'CASH', 'PRODUCT',
       'RESID', 'NAT', 'PROF', 'CAR', 'CARDS', 'GB', '_freq_'],
      dtype='object')

In [27]:
#Looking unique values
print(data.nunique())

TITLE        2
CHILDREN     9
PERS_H      10
AGE         54
TMADD       32
TMJOB1      33
TEL          3
NMBLOAN      3
FINLOAN      2
INCOME      27
EC_CARD      2
INC          3
INC1         6
STATUS       6
BUREAU       3
LOCATION     2
LOANS        9
REGN         9
DIV          2
CASH        29
PRODUCT      6
RESID        2
NAT          8
PROF         9
CAR          3
CARDS        7
GB           2
_freq_       2
dtype: int64


## Age Distribution (Histogram with KDE)

In [28]:
import plotly.express as px

# Plotting Age distribution with a histogram
fig = px.histogram(data, x='AGE', nbins=20, marginal='violin', title='Age Distribution')
fig.update_layout(bargap=0.1)  # Adding a small gap between bins
fig.show()


Younger Age Predominance: Most individuals are in their 20s and early 30s.

Right-Skewed Distribution: There are fewer individuals in older age brackets.

Low Older Age Representation: Individuals above 50 are sparsely represented.

## Age vs. Income (Scatter Plot)

In [29]:
# Scatter plot of Age vs Income
fig = px.scatter(data, x='AGE', y='INCOME', title='Age vs Income')
fig.update_traces(marker=dict(size=5, opacity=0.7))  # Adjusting marker size and opacity
fig.show()


Low Income Range: Most individuals across all age groups have a low income, clustered close to the bottom of the plot.

Outlier Detection: There is a notable outlier with a significantly high income (around 100k) at an age near 50, suggesting that this individual is an exception compared to the general trend.

Minimal Age-Income Correlation: There doesn’t appear to be a clear correlation between age and income, as income levels remain consistently low regardless of age.

Income Consistency Across Ages: The spread of income is relatively uniform across ages, indicating that higher incomes aren’t necessarily concentrated in older or younger groups.

## Product vs. Income (Box Plot)

In [30]:
# Box plot of Income across Product categories
fig = px.box(data, x='PRODUCT', y='INCOME', title='Income by Product Category')
fig.update_layout(xaxis_title='Product Category', yaxis_title='Income')
fig.show()


Consistently Low Incomes Across Categories: Most income values are low across all product categories, with the majority clustered near the bottom of the chart.

Presence of Outliers: Each product category shows some outliers with higher income values, indicating a few individuals with incomes significantly above the general range within each category.

Extreme Outlier: The "Radio, TV, Hifi" category has an extreme outlier with an income around 100k, far above the other values, suggesting an anomaly or a high-income individual with a preference for this category.

Similar Income Distribution Across Categories: The overall spread and median income levels are fairly similar among the categories, indicating that income does not vary substantially by product preference in this dataset.

# Handle Missing Values and Categorical Data

## Identifying categorical and numerical columns helps us preprocess them accordingly. We impute missing values for numerical data using the median and for categorical data using the most frequent value. Then, we one-hot encode the categorical variables to convert them into a numeric format.

In [31]:
# Identify categorical and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(include=['number']).columns

# Ensure 'GB' is included in the numerical columns
numerical_cols = numerical_cols.drop('GB')

# Impute missing values
imputer = ColumnTransformer([
    ('num', SimpleImputer(strategy='median'), numerical_cols),
    ('cat', SimpleImputer(strategy='most_frequent'), categorical_cols)
])

# Apply imputation and create a DataFrame
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data_imputed, columns=numerical_cols.tolist() + categorical_cols.tolist())

# One-hot encode categorical variables
# The 'sparse' argument has been replaced with 'sparse_output'
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_data = encoder.fit_transform(data_imputed[categorical_cols])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# Combine numerical and encoded categorical data
data_preprocessed = pd.concat([data_imputed[numerical_cols], encoded_df], axis=1)

# Add the target variable 'GB'
data_preprocessed['GB'] = data['GB']

# Outlier Detection and Filtering

## Outliers can skew the model’s performance. We use the Interquartile Range (IQR) method to detect and remove outliers from our dataset.

In [32]:
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

for col in numerical_cols:
    data_preprocessed = remove_outliers(data_preprocessed, col)

# Check the shape of the dataset after removing outliers
data_preprocessed.shape

(1048, 55)

# Data Partitioning

## We split the data into training and testing sets. This is crucial for evaluating the model’s performance on unseen data.

In [33]:
X = data_preprocessed.drop('GB', axis=1)
y = data_preprocessed['GB']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape

((733, 54), (315, 54))

# Transforming Input Variables

## Standardizing the features ensures that they have a mean of zero and a standard deviation of one. This is important for models like logistic regression.

In [34]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Variable Classing and Selection

## We use Logistic Regression with L1 regularization (Lasso) to select the most relevant features. L1 regularization forces the less important feature coefficients to be zero.

In [35]:
# Logistic Regression with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)
model.fit(X_train_scaled, y_train)

# Get the model coefficients
coefficients = pd.Series(model.coef_[0], index=X.columns)

# Select features with non-zero coefficients
selected_features = coefficients[coefficients != 0].index
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# Scale the selected features
X_train_selected_scaled = scaler.fit_transform(X_train_selected)
X_test_selected_scaled = scaler.transform(X_test_selected)

# Modelling and Scaling

## We refit the logistic regression model using the selected features and predict probabilities.

In [36]:
# Re-fit the model with selected features
model = LogisticRegression(random_state=42)
model.fit(X_train_selected_scaled, y_train)

# Predict probabilities
y_train_pred = model.predict_proba(X_train_selected_scaled)[:, 1]
y_test_pred = model.predict_proba(X_test_selected_scaled)[:, 1]

# Scale the probabilities
scaler_prob = StandardScaler()
y_train_pred_scaled = scaler_prob.fit_transform(y_train_pred.reshape(-1, 1))
y_test_pred_scaled = scaler_prob.transform(y_test_pred.reshape(-1, 1))

# Model Validation

## Evaluate the model performance using metrics like ROC-AUC score and classification report.

In [37]:
# ROC-AUC score
train_auc = roc_auc_score(y_train, y_train_pred)
test_auc = roc_auc_score(y_test, y_test_pred)
print(f"Train AUC: {train_auc:.2f}")
print(f"Test AUC: {test_auc:.2f}")

# Classification report
print("Classification Report (Test Data):")
print(classification_report(y_test, model.predict(X_test_selected_scaled)))

Train AUC: 1.00
Test AUC: 1.00
Classification Report (Test Data):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       136
           1       1.00      1.00      1.00       179

    accuracy                           1.00       315
   macro avg       1.00      1.00      1.00       315
weighted avg       1.00      1.00      1.00       315



# Reject Inference (concept only)

## While our code builds a credit risk model using the KGB dataset (containing only approved loans), it doesn’t account for selection bias. This bias arises because the model is trained solely on good borrowers, excluding rejected applicants (potential bad borrowers).

## Reject inference offers a solution. It involves using the current model to score the rejected applications (not included in KGB.csv). These scores can then be used to classify the rejected applications as inferred good or inferred bad borrowers.

## The next step would be to create an augmented dataset (AGB). This would involve combining the existing KGB data (good borrowers) with the inferred classifications (good or bad) from the rejected applications. This augmented dataset would represent a more complete picture of the “through-the-door” population (all applicants).

## Finally, you could use this augmented dataset (AGB) to train a new credit risk model. This new model would consider both good and inferred bad borrowers, potentially leading to a more accurate prediction of loan defaults.

# Generating a Scorecard Report

In [38]:
# Function to transform probabilities into scores
def prob_to_score(prob, base_score=600, pdo=50):
    odds = prob / (1 - prob)
    score = base_score + pdo / np.log(2) * np.log(odds)
    return score

# Apply the function to the predicted probabilities
X_test['PD_Score'] = y_test_pred
X_test['Score'] = X_test['PD_Score'].apply(lambda x: prob_to_score(x))

# Display the test data with predicted scores
print(X_test[['PD_Score', 'Score']].head(15))

      PD_Score       Score
2111  0.994129  970.179383
700   0.007763  250.103739
1203  0.994129  970.179383
1230  0.007763  250.103739
2446  0.007763  250.103739
940   0.994129  970.179383
1160  0.994129  970.179383
2122  0.994129  970.179383
1352  0.994129  970.179383
2725  0.007763  250.103739
1668  0.007763  250.103739
2611  0.007763  250.103739
1593  0.007763  250.103739
2115  0.994129  970.179383
1552  0.994129  970.179383
