# **Project Name**    - **Credit Card Default Prediction**



##### **Project Type**  - Classification
##### **Contribution**    - Individual

# **Project Summary -**

**Data preprocessing:**
1. Getting the dataset
2. Importing libraries
3. Importing Dataset
4. Finding Missing Data
5. Encoding categorical data
6. Data cleaning and feature engineering

**Exploratory Data Analysis(EDA)**
1. Firstly checked distribution of target variables and independent variables.
2. Checked number of values in categorical features.
3. Replaced the values that are lowest in categorical features.
4. Replaced the other values with a particular value in some numerical features such as pay status.
5. Dummyfied the categorical features.
6. Checked correlation to see if there are any highly correlated independent features.

**Supervised Machine learning algorithms and implementation:**
1. K nearest neighbours
2. Logistic Regression
3. Decision Tree
4. XG boost classifier
5. Random Forest Classifier



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Defaulted credit card refers to a solution where a credit card holder fails to make the required minimum payments on their credit card account for a certain period of time, typically several consecutive month. As a result, the credit card issuer or the lending institution consider the account in default and takes action to recover the outstanding balance. However if the credit card issuer or the lending institution is not able to recover the outstanding then its a financial loss to them and also on the other hand it impacts the customers in a negative way to their credit score as well as credit profile. So in this project our aim will be to build a model which will help top predict whether a customer will default the payment of his/her credit card so that the banks can understand the characteristics that lead to this outcome.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
pd.set_option('display.max_columns', 100)


import warnings
warnings.filterwarnings("ignore")

from datetime import datetime

from imblearn.over_sampling import SMOTE
from collections import Counter

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from xgboost import XGBRFClassifier

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score




```
# This is formatted as code
```

### Dataset Loading

In [None]:
# Load Dataset
Cr = ("/content/default of credit card clients.xls")
Cr_df = pd.read_excel(Cr)

### Dataset First View

In [None]:
# Dataset First Look
Cr_df.head()

In [None]:
rename_list= ['ID','LIMIT_BAL','SEX', 'EDUCATION','MARRIAGE', 'AGE',  'PAY_0',
              'PAY_2',  'PAY_3',  'PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5',  'BILL_AMT6',  'PAY_AMT1',
              'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'default']
col_rename = dict(zip(Cr_df.columns, rename_list))
Cr_df = Cr_df.rename(columns=col_rename)


In [None]:
Cr_df=Cr_df.drop(Cr_df.index[[0]],axis=0)

In [None]:
Cr_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Cr_df.shape

### Dataset Information

In [None]:
# Dataset Info
Cr_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
Cr_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Cr_df.isna().sum().sort_values(ascending=False)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12,5))
sns.heatmap(Cr_df.isnull(),cmap='plasma',annot=False, yticklabels=False)
plt.title("Visualising Missing Values")

### What did you know about your dataset?

There are no missing values in tha dataset. The given dataset contains 30001 rows and 25 columns. There are 24 independent variables and 1 target variable in our dataset. The initial problem was reading the information as it was indicated column name as X and Y, so we have renamed it. 'Default' is our target variable that we have to predict.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Cr_df.columns

In [None]:
# Dataset Describe
Cr_df.describe(include='all')

### Variables Description

The name of the individual variables and descriptions of them mentioned below:


**ID**: ID of each client

**LIMIT_BAL**: Amount of given credit by customers

**SEX**: Gender(1=male, 2=female)

**EDUCATION**: Qualification of the customers(1=graduate school, 2=university,3= high school, 4=others, 5=unknown, 6=unknown)

**MARRIAGE**: Marital status (1=married,2=single,3=others)

**AGE**: Age in years

**PAY_0**:Repayment status in September,2005(-1=pay duly, 1=payment delay for one month,2=payment delay for two month,..8= payment delay for eight months, 9=payment delay for nine months and above)

**PAY_2**:Repayment status in August, 2005(scale same as above)

**PAY_3**:Repayment status in July, 2005(scale same as above)

**PAY_4**:Repayment status in June, 2005(scale same as above)

**PAY_5**:Repayment status in May, 2005(scale same as above)

**PAY_6**:Repayment status in April, 2005(scale same as above)

**Bill_AMT1**:Amount of bill statement in September,2005(NT dollar)

**Bill_AMT2**:Amount of bill statement in August,2005(NT dollar)

**Bill_AMT3**:Amount of bill statement in July,2005(NT dollar)

**Bill_AMT4**:Amount of bill statement in June,2005(NT dollar)

**Bill_AMT5**:Amount of bill statement in May,2005(NT dollar)

**Bill_AMT6**:Amount of bill statement in April,2005(NT dollar)

**PAY_AMT1**:Amount of previous payment in September,2005(NT dollar)

**PAY_AMT2**:Amount of previous payment in August,2005(NT dollar)

**PAY_AMT3**:Amount of previous payment in July,2005(NT dollar)

**PAY_AMT4**:Amount of previous payment in June,2005(NT dollar)

**PAY_AMT5**:Amount of previous payment in May,2005(NT dollar)

**PAY_AMT6**:Amount of previous payment in April,2005(NT dollar)

**default.payment.next.month**:Default payment(1=yes,0=no)


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable
Cr_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#changing all values from objects to
Cr_df = Cr_df.apply(pd.to_numeric,errors='coerce')


In [None]:
Cr_df.info()

In [None]:
Cr_df.describe()

In [None]:
#Droping the ID column as it is not important for the analysis
Cr_df.drop(["ID"], axis=1)

In [None]:
Cr_df.head()

#categorical variables
We have few categorical features in our dataset. Lets check how they are related with our target class.


**SEX**



*   1 - Male
*   2 - Female

In [None]:
Cr_df['SEX'].value_counts()

**Education**

1=Graduate school;  2=university;  3=high school;  4=others

In [None]:
Cr_df['EDUCATION'].value_counts()

In [None]:
#Fixing the education column for the values of 5,6 and 0 by 4
fil= (Cr_df['EDUCATION']==5) | (Cr_df['EDUCATION']==6) | (Cr_df['EDUCATION']==0)
Cr_df.loc[fil,'EDUCATION']=4
Cr_df['EDUCATION'].value_counts()

**Marriage**

1 = married;  2 = single;  3 = others

In [None]:
Cr_df['MARRIAGE'].value_counts()

Fixing the 'marital status' by replacing the values of 0  with 3.

In [None]:
fil= (Cr_df['MARRIAGE']==0)
Cr_df.loc[fil,'MARRIAGE']=3
Cr_df['MARRIAGE'].value_counts()

### What all manipulations have you done and insights you found?

First the entire data was changed to integer type information. There are three categorical variables in the data as 'Gender', 'Education', 'marital status'. The gender column has only two values as 1 and 2 which is fine. But the Education and Marital status has more varities, so i have replaced the unknown values in those columns with the lowest value category.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

**Target Variable**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,6))
ax=sns.countplot(x =Cr_df['default'])
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1], labels=["Not Defaulted", "Defaulted"])
plt.show()

##### 1. Why did you pick the specific chart?

**To represent the occurrence of the observation present in the categorical variable.**

##### 2. What is/are the insight(s) found from the chart?

**This data is quite imbalance which about 22% of clients will default next month.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

**SEX Variable:**

In [None]:
# Chart - 2 visualization code
ax=sns.countplot(x=Cr_df['SEX'])
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1], labels=["Male", "Female"])
plt.title("Sex distribution")
plt.show()

ax=sns.countplot(data=Cr_df, x="SEX", hue="default")
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1], labels=["Male", "Female"])
plt.title("Sex distribution according Default")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

**EDUCATION VARIABLE**

In [None]:
# Chart - 3 visualization code
ax=sns.countplot(x=Cr_df['EDUCATION'])
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1,2,3], labels=["graduate school","university","high school","others"])
plt.title("Education distribution")
plt.show()

ax=sns.countplot(data=Cr_df, x="EDUCATION", hue="default")
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1,2,3], labels=["graduate school","university","high school","others"])
plt.title("Education distribution according Default")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

**Marriage Status Variable**

In [None]:
# Chart - 4 visualization code
ax=sns.countplot(x=Cr_df['MARRIAGE'])
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1,2], labels=["Married","Single","Others"])
plt.title("Marriage Status distribution")
plt.show()

ax=sns.countplot(data=Cr_df, x="MARRIAGE", hue="default")
for label in ax.containers:
    ax.bar_label(label)
plt.xticks([0,1,2], labels=["Married","Single","Others"])
plt.title("Marriage distribution according Default")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

**AGE VARIABLE**

In [None]:
# Chart - 5 visualization code
sns.histplot(data=Cr_df, x="AGE", binwidth=3)
plt.title("Age distribution")
plt.show()

sns.histplot(data=Cr_df, x="AGE", hue="default", binwidth=3)
plt.title("Age distribution according Default")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

**Limit_Bal Variable**

In [None]:
# Chart - 6 visualization code
sns.displot(Cr_df.LIMIT_BAL, kde=True)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

**Amount of bill statement and Amount of previous payment**

In [None]:
# Chart - 7 visualization code
plt.subplots(figsize=(20,10))
plt.subplot(231)
plt.scatter(x=Cr_df.PAY_AMT1, y=Cr_df.BILL_AMT1, c='r', s=1)
plt.xlabel('PAY_AMT1')
plt.ylabel('BILL_AMT1')


plt.subplot(232)
plt.scatter(x=Cr_df.PAY_AMT2, y=Cr_df.BILL_AMT2, c='g', s=1)
plt.xlabel('PAY_AMT2')
plt.ylabel('BILL_AMT2')
plt.title('Amount of Bill statementvs Amount of Previous Payment in the last 6 months', fontsize=15)


plt.subplot(233)
plt.scatter(x=Cr_df.PAY_AMT3, y=Cr_df.BILL_AMT3, c='b', s=1)
plt.xlabel('PAY_AMT3')
plt.ylabel('BILL_AMT3')


plt.subplot(234)
plt.scatter(x=Cr_df.PAY_AMT4, y=Cr_df.BILL_AMT4, c='y', s=1)
plt.xlabel('PAY_AMT4')
plt.ylabel('BILL_AMT4')

plt.subplot(235)
plt.scatter(x=Cr_df.PAY_AMT5, y=Cr_df.BILL_AMT5, c='m', s=1)
plt.xlabel('PAY_AMT5')
plt.ylabel('BILL_AMT5')

plt.subplot(236)
plt.scatter(x=Cr_df.PAY_AMT6, y=Cr_df.BILL_AMT6, c='orange', s=1)
plt.xlabel('PAY_AMT6')
plt.ylabel('BILL_AMT6')


plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

**Correlation Analysis**

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(20,20))
sns.heatmap(Cr_df.corr(), annot=True, square=True)

##### 1. Why did you pick the specific chart?

To check the multicollinearity problem the above chart is used.

##### 2. What is/are the insight(s) found from the chart?

From the above correlation plot we can see that bill_amt6 to bill_amt1 are highly correlated to each other which makes sense as these columns indicates the bill amounts.

Apart from that there are no highly correlated inputs in our dataset, so there is no multicollinearity problem.

#### Chart - 9

**MODELLING**

In [None]:
# Chart - 9 visualization code
Cr_df.columns

In [None]:
x= Cr_df.drop(['default'], axis=1)
y= Cr_df['default']
x.head()

In [None]:
##Feature Engineering
from sklearn.preprocessing import StandardScaler
x= StandardScaler().fit_transform(x)

In [None]:
x_train,x_test,y_train,y_test= train_test_split(x,y,test_size=0.20, random_state=42)

In [None]:
##summarize class distribution
import collections
from imblearn.over_sampling import SMOTE

print("Before oversampling:",collections.Counter(y_train))
SMOTE=SMOTE()

x_train,y_train=SMOTE.fit_resample(x_train,y_train)

#summarize class distribution
print("After oversampling:",collections.Counter(y_train))

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

**Building Model:**

Logistic Regression

Random Forest Classifier

Decision Tree

XGBoost Classifier

**a) Logistic Regression**


In [None]:
from sklearn.svm import SVC
logit= LogisticRegression()
logit.fit(x_train,y_train)

pred_logit=logit.predict(x_test)

print(classification_report(y_test, pred_logit))

print('confusion matrix of logistic regression')

print("Logistic accuracy:", accuracy_score(y_test, pred_logit))
clf = SVC(random_state=0)
clf.fit(x_train, y_train)
ConfusionMatrixDisplay.from_estimator(logit,x_test, y_test, cmap="BuPu_r")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

**b) Decision Tree Classifier:**

In [None]:
# Chart - 11 visualization code
Tree = DecisionTreeClassifier(criterion= 'gini', max_depth=7,
                              max_features=9, min_samples_leaf=2, random_state=0)

Tree.fit(x_train, y_train)
pred_tree= Tree.predict(x_test)
print('Decision Tree Accuracy:', accuracy_score(y_test,pred_tree))

print(classification_report(y_test, pred_tree))

print('confusion matrix of decision tree')
ConfusionMatrixDisplay.from_estimator(Tree,x_test, y_test, cmap="BuPu_r")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

**c) Random Forest:**

In [None]:
# Chart - 12 visualization code
RF= RandomForestClassifier()

RF.fit(x_train, y_train)

pred_RF= RF.predict(x_test)
print("Random Forest Accuracy is:", accuracy_score(y_test, pred_RF))

print(classification_report(y_test, pred_RF))

ConfusionMatrixDisplay.from_estimator(RF,x_test, y_test, cmap="BuPu_r")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

**d) XGBoost:**

In [None]:
# Chart - 13 visualization code
xgboost= xgb.XGBClassifier()

xgboost.fit(x_train, y_train)

pred_xgboost= xgboost.predict(x_test)

print("XGBoost Accuracy:", accuracy_score(y_test, pred_xgboost))
print(classification_report(y_test, pred_xgboost))
ConfusionMatrixDisplay.from_estimator(xgboost,x_test, y_test, cmap="BuPu_r")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14

**Hyper parameter turning:**

In [None]:
# Hyper Parameter Optimization

params={
    "learning_rate"   :[0.05,0.10,0.15,0.20,0.25,0.30],
    "max_depth"       :[3, 4, 5, 6, 8, 10, 12, 15],
    "min_child_weight":[1, 3, 5, 7],
    "gamma"           :[0.0, 0.1, 0.2, 0.3, 0.4],
    "colsample_bytree":[0.3, 0.4, 0.5, 0.7]
    }

In [None]:
random_search= RandomizedSearchCV(xgboost,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

random_search.fit(x_train,y_train)

In [None]:
#best estimators:
random_search.best_estimator_

In [None]:
#best param
random_search.best_params_

In [None]:
classifier=XGBRFClassifier(objective='binary:logistic',
                                     min_child_weight=3,
                                     max_depth=10,
                                     learning_rate=0.25,
                                    gamma=0.1,
                                    colsample_bynode=1,
                                    colsample_bytree=0.4,
                                    use_label_encoder=False)
#fitting the model
classifier.fit(x_train,y_train)

In [None]:
#predicting model
hyper_pred=classifier.predict(x_test)

print("The accuracy of the model is:", accuracy_score(y_test, hyper_pred))

**Compare Model Performance:**

In [None]:
#logistic model
pred_logit= logit.predict(x_test)
fpr1, tpr1, thresholds=metrics.roc_curve(y_test, pred_logit)
auc1= metrics.roc_auc_score(y_test, pred_logit)

#decision tree model
pred_Tree= Tree.predict(x_test)
fpr2, tpr2, thresholds=metrics.roc_curve(y_test, pred_tree)
auc2= metrics.roc_auc_score(y_test, pred_tree)

#random forest model
pred_RF= RF.predict(x_test)
fpr3, tpr3, thresholds= metrics.roc_curve(y_test, pred_RF)
auc3= metrics.roc_auc_score(y_test, pred_RF)

#XGboost:
pred_xgboost= xgboost.predict(x_test)
fpr4, tpr4, thresholds= metrics.roc_curve(y_test, pred_xgboost)
auc4= metrics.roc_auc_score(y_test, pred_xgboost)

plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr1, tpr1, label="logistic, auc="+str(round(auc1,2)))
plt.plot(fpr2, tpr2, label="decision_tree, auc="+str(round(auc2,2)))
plt.plot(fpr3, tpr3, label="random_forest, auc="+str(round(auc3,2)))
plt.plot(fpr4, tpr4, label="XGboost, auc="+str(round(auc4,2)))

plt.legend(loc=4, title='Models', facecolor='white')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC', size=16)
plt.box(False)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
data={'logistic':68.65,
      'decision_tree':75.36,
      'Random_forest':79.73,
      'xgboost':81.4,
       'xgboost_hyper':78.85}
courses = list(data.keys())
values = list(data.values())

In [None]:
plt.figure(figsize=(14,8))
plt.title('Comparing Accuracy of ML Models', fontsize=20)
colors=['red','orange','blue','green','magenta'
]
plt.bar(courses, values, color= colors, alpha=0.5, width=0.4)
plt.xticks(rotation=45)


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***