# Credit Card Approval Prediction

To decide whether to issue a credit card or not, financial institutions use credit score which uses personal data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings

Credit scores can objectively quantify the magnitude of the risk.

In this project, we will build a Machine Learning model to predict whether an applicant is a __good__ or __bad__ client. So, it is a Binary Classification, and we will use __Logistic Regression__ as our model.

The __application_record__ and __credit_record__ are merged into a dataset, and the target class is __STATUS__, and we will use __Vintage Analysis__ to construct the label.

In [2]:
import glob
import pandas as pd
import numpy as np

In [3]:
glob.glob("*.csv")

['application_record.csv', 'credit_record.csv']

In [155]:
credit = pd.read_csv('credit_record.csv')
application = pd.read_csv('application_record.csv') 

In [156]:
credit['STATUS'] = credit['STATUS'].apply(lambda x: int(x) if x.isnumeric() else x)

In [157]:
grouped = credit.groupby('ID')

In [158]:
pivot_table = credit.pivot(index='ID', columns='MONTHS_BALANCE', values='STATUS')

We used a pivot table to have an expanded view of the status in our credit data

In [202]:
pivot_table

MONTHS_BALANCE,ID,OPEN_MONTH,END_MONTH,WINDOW
0,5001711,-3,0,3
1,5001712,-18,0,18
2,5001713,-21,0,21
3,5001714,-14,0,14
4,5001715,-59,0,59
...,...,...,...,...
45980,5150482,-28,-11,17
45981,5150483,-17,0,17
45982,5150484,-12,0,12
45983,5150485,-1,0,1


In [160]:
pivot_table['OPEN_MONTH'] = grouped['MONTHS_BALANCE'].min()
pivot_table['END_MONTH'] = grouped['MONTHS_BALANCE'].max()
pivot_table['ID'] = pivot_table.index

In [161]:
pivot_table = pivot_table[['ID', 'OPEN_MONTH', 'END_MONTH']]

In [162]:
pivot_table['WINDOW'] = pivot_table['END_MONTH'] - pivot_table['OPEN_MONTH']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pivot_table['WINDOW'] = pivot_table['END_MONTH'] - pivot_table['OPEN_MONTH']


In [163]:
pivot_table.head()

MONTHS_BALANCE,ID,OPEN_MONTH,END_MONTH,WINDOW
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5001711,5001711,-3,0,3
5001712,5001712,-18,0,18
5001713,5001713,-21,0,21
5001714,5001714,-14,0,14
5001715,5001715,-59,0,59


In [164]:
pivot_table.reset_index(drop=True, inplace=True)

In [165]:
credit = pd.merge(credit, pivot_table, on="ID", how="left")

In [166]:
credit.head()

Unnamed: 0,ID,MONTHS_BALANCE,STATUS,OPEN_MONTH,END_MONTH,WINDOW
0,5001711,0,X,-3,0,3
1,5001711,-1,0,-3,0,3
2,5001711,-2,0,-3,0,3
3,5001711,-3,0,-3,0,3
4,5001712,0,C,-18,0,18


In [168]:
credit_copy = credit.copy()

## Vintage Analysis

In this project, Vintage Analysis is used to construct the label. It is a popular method for managing credit risk. Vintage analysis measures the performance of a portfolio in different periods of time after the loan( or credit card) was granted. Performance can be measured in the form of cumulative charge-off rate, proportion of customers 30/60/90 days past due, utilization ratio and average balance.

We will label customers who are __60 days__ or more past due as _'bad'_ (or 0), while others will be labelled as _'good'_ (or 1). Customers labelled 'bad' are found where STATUS is equal to 2,3,4 or 5.

The Performance Window we used are those greater than __20 months__ . So, customers who defaulted during this performance window would be considered as __bad__ customer and labelled as __1__ in the STATUS variable.

In [169]:
# remove observe window less than 20 
credit = credit[credit['WINDOW'] > 20]

In [170]:
# 60 days or more past due - BAD customer
credit['STATUS'] = np.where(credit['STATUS'].isin([2,3,4,5]), 1, 0)

In [171]:
credit['STATUS'] = credit['STATUS'].astype(np.int8)

In [172]:
credit

Unnamed: 0,ID,MONTHS_BALANCE,STATUS,OPEN_MONTH,END_MONTH,WINDOW
23,5001713,0,0,-21,0,21
24,5001713,-1,0,-21,0,21
25,5001713,-2,0,-21,0,21
26,5001713,-3,0,-21,0,21
27,5001713,-4,0,-21,0,21
...,...,...,...,...,...,...
1048570,5150487,-25,0,-29,0,29
1048571,5150487,-26,0,-29,0,29
1048572,5150487,-27,0,-29,0,29
1048573,5150487,-28,0,-29,0,29


Both credit and application data are merged into a dataframe

In [173]:
df = pd.merge(credit, application, on="ID", how='left')

In [174]:
len(credit), len(application)

(775282, 438557)

In [175]:
df.head()

Unnamed: 0,ID,MONTHS_BALANCE,STATUS,OPEN_MONTH,END_MONTH,WINDOW,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,...,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5001713,0,0,-21,0,21,,,,,...,,,,,,,,,,
1,5001713,-1,0,-21,0,21,,,,,...,,,,,,,,,,
2,5001713,-2,0,-21,0,21,,,,,...,,,,,,,,,,
3,5001713,-3,0,-21,0,21,,,,,...,,,,,,,,,,
4,5001713,-4,0,-21,0,21,,,,,...,,,,,,,,,,


We filled up the NaN values with modal values of each of the columns

In [177]:
modes = df.mode().iloc[0]

In [178]:
df.columns

Index(['ID', 'MONTHS_BALANCE', 'STATUS', 'OPEN_MONTH', 'END_MONTH', 'WINDOW',
       'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS'],
      dtype='object')

In [179]:
def proc_data(df):
    df.fillna(modes, inplace=True)
    df['CODE_GENDER'] = pd.Categorical(df.CODE_GENDER)
    df['FLAG_OWN_CAR'] = pd.Categorical(df.FLAG_OWN_CAR)
    df['FLAG_OWN_REALTY'] = pd.Categorical(df.FLAG_OWN_REALTY)
    df['NAME_INCOME_TYPE'] = pd.Categorical(df.NAME_INCOME_TYPE)
    df['NAME_EDUCATION_TYPE'] = pd.Categorical(df.NAME_EDUCATION_TYPE)
    df['NAME_FAMILY_STATUS'] = pd.Categorical(df.NAME_FAMILY_STATUS)
    df['NAME_HOUSING_TYPE'] = pd.Categorical(df.NAME_HOUSING_TYPE)
    df['FLAG_MOBIL'] = pd.Categorical(df.FLAG_MOBIL)
    df['FLAG_WORK_PHONE'] = pd.Categorical(df.FLAG_WORK_PHONE)
    df['FLAG_PHONE'] = pd.Categorical(df.FLAG_PHONE)
    df['FLAG_EMAIL'] = pd.Categorical(df.FLAG_EMAIL)
    df['OCCUPATION_TYPE'] = pd.Categorical(df.OCCUPATION_TYPE)

In [180]:
proc_data(df)

There are __12__ categorical, __8__ continuous and __1__ target features

In [181]:
cats = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', \
       'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE']

conts = ['MONTHS_BALANCE','STATUS','WINDOW','CNT_CHILDREN','AMT_INCOME_TOTAL','DAYS_BIRTH', \
        'DAYS_EMPLOYED','CNT_FAM_MEMBERS']

dep = 'STATUS'

In [182]:
df_copy = df.copy()

In [183]:
df_id = df['ID']

In [184]:
df = df.drop(['OPEN_MONTH', 'END_MONTH', 'ID'], axis=1)

Because the dataset is imbalanced, we used the Undersampling technique. __Undersampling__ is used to balance the class distribution for a classification dataset that has a skewed class distribution. We randomly selected examples from the majority class in the ratio of almost __1:1__ for the __good__ and __bad__ values in the __target__ class.

In [185]:
no_record_bad = len(df[df.STATUS == 1])
bad_indices = df[df.STATUS == 1].index
good_indices = df[df.STATUS == 0].index

random_good_indices = np.random.choice(good_indices, no_record_bad, replace=False)

under_sample_indices = np.concatenate([random_good_indices, bad_indices])

In [186]:
under_sample_data = df.iloc[under_sample_indices,:]

We then splitted the undersampled dataset into train and validation datasets, using the train data to train our model, and the validation data to measure the error metrics of our trained model.

In [187]:
from numpy import random
from sklearn.model_selection import train_test_split

random.seed(42)
trn_df, val_df = train_test_split(under_sample_data, test_size=0.25)
trn_df[cats] = trn_df[cats].apply(lambda x: x.cat.codes)
val_df[cats] = val_df[cats].apply(lambda x: x.cat.codes)

In [188]:
def xs_y(df):
    xs = df[cats+conts].copy()
    return xs, df[dep] if dep in df else None

trn_xs, trn_y = xs_y(trn_df)
val_xs, val_y = xs_y(val_df)

Logistic Regression model is used for our data, and it is very effective for Binary Classification. The __C__ parameter (0.01) in the Logistic Regression model is the Inverse Regularization strength, the smaller the stronger regularization. The algorithm used for optimization is __liblinear__, which is good for small datasets.

To prevent overfitting, we used __L1__ for penalty. L1 combats overfitting by shrinking the parameters towards 0. It is a kind of feature selection, because we are penalizing the absolute value of the weights.

In [208]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score, roc_auc_score, mean_squared_error

In [190]:
lr = LogisticRegression(solver='liblinear', C = 0.01, random_state=0, penalty='l1')
lr.fit(trn_xs, trn_y)
roc_auc_score(val_y, lr.predict_proba(val_xs)[:,1])

1.0

## Metrics

We have an __ROC AUC Score__ of __1.0__ and a __Mean Absolute Error__ of __0.0__. We also used the Mean Accuracy which is a harsher metric since it requires that for each sample that each label set be correctly predicted. And we have a __Mean Accuracy__ of __1.0__.

It is a very good model. It learnt well from the training data and predicted exceptionally well using the validation data.

In [210]:
print('ROC AUC Score: ', roc_auc_score(val_y, lr.predict_proba(val_xs)[:,1]))
print('Mean Absolute Error: ', mean_absolute_error(val_y, lr.predict(val_xs)))
print('Mean Squared Error: ', mean_squared_error(val_y, lr.predict(val_xs)))
print('Accuracy: ', accuracy_score(val_y, lr.predict(val_xs)))
print('Mean Accuracy:', lr.score(val_xs, val_y))

ROC AUC Score:  1.0
Mean Absolute Error:  0.0
Mean Squared Error:  0.0
Accuracy:  1.0
Mean Accuracy: 1.0
