<a href="https://colab.research.google.com/github/dileep-rawat/Capstone_Project_3-Credit_Card_default_Prediction/blob/main/Credit_Card_Default_Prediction_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Predicting whether a customer will default on his/her credit card </u></b>

**Project Type**    - Classification

**Contribution**    - Individual

**Index:**

1. Problem statement
2. Importing Essential Libraries
3. Mounting drive
4. Data Exploration
5. Preprocessing & Data Cleaning
6. Exploratory Data Analysis
7. Feature engineering
8. ML model implementation
9. XG Boost model explainability using Shapley values
10. Results
11. Summary and conclusions
12. References

## <b> Problem Description </b>

### This project is aimed at predicting the case of customers default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the [K-S chart](https://www.listendata.com/2019/07/KS-Statistics-Python.html) to evaluate which customers will default on their credit card payments


## <b> Data Description </b>

### <b>Attribute Information: </b>

### This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
* ### X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* ### X2: Gender (1 = male; 2 = female).
* ### X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* ### X4: Marital status (1 = married; 2 = single; 3 = others).
* ### X5: Age (year).
* ### X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
* ### X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
* ### X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

# **2. Importing Essential Libraries:-**

In [1]:
import pandas as pd
import numpy as np

# Importing Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt


# Importing warning for ignore warnings 
import warnings
warnings.filterwarnings("ignore")


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
credit_df = pd.read_excel('/content/drive/MyDrive/Almabetter/Capstone Project/Credit Card Default Prediction/default of credit card clients.xls')

In [4]:
#checking the shape ( rows and column numbers)
credit_df.shape

(30001, 25)

In [5]:
# Checking first 5 rows
credit_df.head(5)

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


#<b> Preprocessing

In [7]:
credit_df.drop(columns='Unnamed: 0',inplace=True)

In [8]:
# creating list of columns name for renaming data frame column names
columns = ['Limit_bal','Gender','Education','Marital_status','Age','Repayment_September','Repayment_August','Repayment_July','Repayment_June','Repayment_May',
           'Repayment_April','Sep_Bill','Aug_Bill','July_Bill','June_Bill','May_Bill','Apr_Bill','Pay_Sep','Pay_Aug','Pay_July','Pay_June','Pay_May','Pay_April','Defaulter'] 

In [9]:
 # replace column name with columns list
credit_df.set_axis(columns, axis=1, inplace=True)

In [10]:
# droping the axis 0
credit_df=credit_df.drop(0,axis=0).reset_index(drop=True)

In [13]:
credit_df.head()

Unnamed: 0,Limit_bal,Gender,Education,Marital_status,Age,Repayment_September,Repayment_August,Repayment_July,Repayment_June,Repayment_May,...,June_Bill,May_Bill,Apr_Bill,Pay_Sep,Pay_Aug,Pay_July,Pay_June,Pay_May,Pay_April,Defaulter
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [14]:
# Checking the shape of data
credit_df.shape

(30000, 24)

In [15]:
# checking information about each column, but all columns dtype is in object
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Limit_bal            30000 non-null  object
 1   Gender               30000 non-null  object
 2   Education            30000 non-null  object
 3   Marital_status       30000 non-null  object
 4   Age                  30000 non-null  object
 5   Repayment_September  30000 non-null  object
 6   Repayment_August     30000 non-null  object
 7   Repayment_July       30000 non-null  object
 8   Repayment_June       30000 non-null  object
 9   Repayment_May        30000 non-null  object
 10  Repayment_April      30000 non-null  object
 11  Sep_Bill             30000 non-null  object
 12  Aug_Bill             30000 non-null  object
 13  July_Bill            30000 non-null  object
 14  June_Bill            30000 non-null  object
 15  May_Bill             30000 non-null  object
 16  Apr_

# **5. Pre-processing & Data Cleaning:**

Data cleaning is done in the following steps:-  
1) Remove duplicate rows  
2) Handling missing values.  
3) Convert columns to appropriate datatypes.  
4) Adding new features and renaming the features

In [19]:
credit_df.duplicated().value_counts()  # #true means duplicate rows

False    29965
True        35
dtype: int64

So we have 31994 duplicate rows in our data and we will drop the duplicate rows from our data.

In [20]:
# Dropping duplicate values
credit_df.drop_duplicates(inplace = True)

In [21]:
# Exploring shape
credit_df.shape

(29965, 24)

In [22]:
# # Checking null values
credit_df.isna().sum().sort_values(ascending= False).reset_index().rename(columns={'index':'Columns',0:'Null values'})

Unnamed: 0,Columns,Null values
0,Limit_bal,0
1,Gender,0
2,Pay_April,0
3,Pay_May,0
4,Pay_June,0
5,Pay_July,0
6,Pay_Aug,0
7,Pay_Sep,0
8,Apr_Bill,0
9,May_Bill,0


So we don't have any missing value

## Step-3:  Convert columns to appropriate datatypes:

In [23]:
# Converting all columns from Object to int 
for i in credit_df.columns:
  credit_df[i]=credit_df[i].astype('int')

In [24]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29965 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Limit_bal            29965 non-null  int64
 1   Gender               29965 non-null  int64
 2   Education            29965 non-null  int64
 3   Marital_status       29965 non-null  int64
 4   Age                  29965 non-null  int64
 5   Repayment_September  29965 non-null  int64
 6   Repayment_August     29965 non-null  int64
 7   Repayment_July       29965 non-null  int64
 8   Repayment_June       29965 non-null  int64
 9   Repayment_May        29965 non-null  int64
 10  Repayment_April      29965 non-null  int64
 11  Sep_Bill             29965 non-null  int64
 12  Aug_Bill             29965 non-null  int64
 13  July_Bill            29965 non-null  int64
 14  June_Bill            29965 non-null  int64
 15  May_Bill             29965 non-null  int64
 16  Apr_Bill             2