<a href="https://colab.research.google.com/github/fathimajafir/Catboost_learning-project/blob/main/CatBoost_on_Credit_Card_Default_Prediction_Learning_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u> Project Title : Predicting whether a customer will default on his/her credit card </u></b>

## <b> Problem Description </b>

### This project is aimed at predicting the case of customers default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the [K-S chart](https://www.listendata.com/2019/07/KS-Statistics-Python.html) to evaluate which customers will default on their credit card payments


## <b> Data Description </b>

### <b>Attribute Information: </b>

### This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
* ### X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* ### X2: Gender (1 = male; 2 = female).
* ### X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* ### X4: Marital status (1 = married; 2 = single; 3 = others).
* ### X5: Age (year).
* ### X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
* ### X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
* ### X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

***LEARNING CATBOOST ALGORITHM WITHOUT PREPROCESSING***

In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.0.3-cp37-none-manylinux1_x86_64.whl (76.3 MB)
[K     |████████████████████████████████| 76.3 MB 1.4 MB/s 
Installing collected packages: catboost
Successfully installed catboost-1.0.3


In [4]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from numpy import math
import statsmodels.api as sm
from catboost import CatBoostClassifier
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,accuracy_score,recall_score

In [5]:
#defining function which give the summary of data frame
def resumetable(df):
  print(f"Dataset shape: {df.shape}")
  summary=pd.DataFrame(df.dtypes,columns=['dtypes'])
  summary=summary.reset_index()
  summary['name']=summary['index']
  summary=summary[['name','dtypes']]
  summary['missing']=df.isnull().sum().values
  summary['uniques']=df.nunique().values
  summary['first_value']=df.loc[0].values
  summary['second_value']=df.loc[1].values 
  return summary

In [7]:
df=pd.read_excel('/content/drive/MyDrive/Credit card default prediction/default of credit card clients.xlsx')

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [9]:
#use the function resumetable to get the summary of dataframe 
result=resumetable(df)
result

Dataset shape: (30000, 25)


Unnamed: 0,name,dtypes,missing,uniques,first_value,second_value
0,ID,int64,0,30000,1,2
1,LIMIT_BAL,int64,0,81,20000,120000
2,SEX,int64,0,2,2,2
3,EDUCATION,int64,0,7,2,2
4,MARRIAGE,int64,0,4,1,2
5,AGE,int64,0,56,24,26
6,PAY_0,int64,0,11,2,-1
7,PAY_2,int64,0,11,2,2
8,PAY_3,int64,0,11,-1,0
9,PAY_4,int64,0,11,-1,0


In [10]:
#define dependent and independent variable 
target_col="default payment next month"
X=df.loc[:, df.columns != target_col]
y=df.loc[:, target_col]


In [11]:
#split the data into train and test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)


In [12]:
X_train.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
16831,16832,120000,1,3,1,49,0,-1,-1,-1,-1,-1,119440,3844,2290,780,8190,4600,3844,2299,780,8190,4600,1081
4222,4223,30000,1,1,2,38,2,0,0,0,0,0,69707,71904,62630,57406,46231,73262,4000,5000,8000,1460,40000,10000
8736,8737,90000,2,2,2,39,0,0,0,0,0,0,45709,45045,42151,37842,30849,28061,2000,2000,1200,1018,1200,710
27880,27881,130000,2,3,1,26,0,0,2,2,2,0,121329,128791,127881,133130,127159,131069,11000,2600,9000,0,6000,5000
29290,29291,50000,1,3,2,26,2,0,0,0,0,0,49644,94883,42097,32394,16658,17006,2047,5728,1300,1194,617,650


In [13]:
features=list(X_train.columns)

In [23]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt

In [52]:
#defining a variable for  catboost model
model=CatBoostClassifier(task_type='GPU')
parameters={'depth'           :sp_randInt(4,10),
            'learning_rate'   :sp_randFloat(),
            'iterations'      :sp_randInt(10,200)
          
             }
randm=RandomizedSearchCV(estimator=model,param_distributions=parameters,cv=10,n_iter=10,n_jobs=1)            
#task_type='GPU',iterations=100,random_state=2021,eval_metric="F1")

In [None]:
#fitting the model
randm.fit(X_train,y_train,plot=True,eval_set=(X_test,y_test))

In [54]:
y_pred=randm.predict(X_test)


In [55]:
f1_score(y_test,y_pred)

0.46599045346062046

In [56]:
recall_score(y_test,y_pred)

0.36190917516218724

In [57]:
accuracy_score(y_test,y_pred)

0.8191919191919191