# **Credit Card Default Prediction**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Upendra Pratap Singh


# **Problem Statement**


This project is aimed at predicting the case of customers' default payments in Taiwan. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. We can use the K-S chart to evaluate which customers will default on their credit card payments.

## DATA DISCRIPTION

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables



*   LIMIT_BAL - Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
*   SEX- Gender (1 = male; 2 = female).

*   EDUCATION- (1 = graduate school; 2 = university; 3 = high school; 4 = others).
*   MARRIAGE- Marital status (1 = married; 2 = single; 3 = others).

*   AGE- Age (year).
*   PAY_0- PAY_6 -History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: PAY_0 = the repayment status in September, 2005;
  PAY_2 = the repayment status in August, 2005; . . .;
  PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two   months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

*   BILL_AMT1- BILL_AMT6- Amount of bill statement (NT dollar). BILL_AMT6 = amount of bill statement in September, 2005;
  BILL_PAY13 = amount of bill statement in August, 2005; . . .;
  BILL_PAY17 = amount of bill statement in April, 2005.
*   PAY_AMT1-PAY_AMT6- Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

*   default payment next month- default payment (Yes = 1, No = 0)

















# **GitHub Link -**

https://github.com/UPENDRA555/Credit_Card_Default_Prediction/tree/main





# ***Let's Begin !***

## *** Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc, classification_report
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/default of credit card clients.xls', header= 1)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('The row & column count')
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_value=len(df[df.duplicated()])
print('The number of duplicate value in theis data:', duplicate_value)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
def show_missing():
    missing = df.columns[df.isnull().any()].tolist()
    return missing

# Missing data counts and percentage
print('Missing Data Count')
print(df[show_missing()].isnull().sum().sort_values(ascending = False))
print('--'*50)
print('Missing Data Percentage')
print(round(df[show_missing()].isnull().sum().sort_values(ascending = False)/len(df)*100,2))

In [None]:
df.isna().sum()

No Missing value is present in the dataset

In [None]:
# Visualizing the missing values use missingo
!pip install missingno

In [None]:
# Plot a Distplot of missing value
import missingno as msno
msno.matrix(df)

In [None]:
# Plot a bar graph of missing value
msno.bar(df)

## *** Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().transpose()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

##  ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data = df.copy()
data.drop(columns= 'ID', inplace= True)
data.head()

In [None]:
data.shape

In [None]:
data.info()

### What all manipulations have you done and insights you found?

we can see that all the 24 columns have 30000 count which indicates there is no missing value.
we can see that the repayment status is indicated in columns PAY_0, PAY_2 ... with no PAY_1 column, so we rename PAY_0 to PAY_1 and 'default payment next month' to 'target_default' for ease of understanding.

In [None]:
data.rename(columns={'PAY_0':'PAY_1'}, inplace=True)
data.rename(columns={'default payment next month':'target_default'}, inplace=True)

In [None]:
data.columns

### Remove Duplicate Value

In [None]:
# Remove Duplicate values in the dataset
data = data.drop_duplicates(subset=[col for col in data.columns if col != 'target_default'])
data.shape

In [None]:
data.info()

next we check the datatype of each variable of dataset. We see that all the columns are int64 type whereas from previous knowledge we know that SEX, EDUCATION, MARRIAGE, PAY_1, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, default payment next month are categorical features. So we convert these features in categorical

In [None]:
# change the datatype of categorical features from integer to category
data['SEX']=data['SEX'].astype('category',copy=False)
data['EDUCATION']=data['EDUCATION'].astype('category')
data['MARRIAGE']=data['MARRIAGE'].astype('category')
data['PAY_1']=data['PAY_1'].astype('category')
data['PAY_2']=data['PAY_2'].astype('category')
data['PAY_3']=data['PAY_3'].astype('category')
data['PAY_4']=data['PAY_4'].astype('category')
data['PAY_5']=data['PAY_5'].astype('category')
data['PAY_6']=data['PAY_6'].astype('category')
data['target_default']=data['target_default'].astype('category')

In [None]:
data.dtypes

#### Categorical Features

SEX


*   1- Male
*   2- Female



In [None]:
data['SEX'].value_counts()

Education

1 = graduate school; 2 = university; 3 = high school; 4 = others

In [None]:
data['EDUCATION'].value_counts()

As we can see in dataset we have values like 5,6,0 as well for which we are not having description so we can add up them in 4, which is Others.

In [None]:
data["EDUCATION"] = data["EDUCATION"].replace({0:4,5:4,6:4})
data["EDUCATION"].value_counts()

Marriage

1 = married; 2 = single; 3 = others

In [None]:
data['MARRIAGE'].value_counts()

We have few values for 0, which are not determined . So I am adding them in Others category.

In [None]:
data["MARRIAGE"] = data["MARRIAGE"].replace({0:3})
data["MARRIAGE"].value_counts()

In [None]:
categorical_features= data[['SEX', 'EDUCATION', 'MARRIAGE']]
categorical_features.head()

***Renaming Columns***

In [None]:
#renaming columns
data.rename(columns={'PAY_1':'PAY_SEPT','PAY_2':'PAY_AUG','PAY_3':'PAY_JUL','PAY_4':'PAY_JUN','PAY_5':'PAY_MAY','PAY_6':'PAY_APR'},inplace=True)
data.rename(columns={'BILL_AMT1':'BILL_AMT_SEPT','BILL_AMT2':'BILL_AMT_AUG','BILL_AMT3':'BILL_AMT_JUL','BILL_AMT4':'BILL_AMT_JUN','BILL_AMT5':'BILL_AMT_MAY','BILL_AMT6':'BILL_AMT_APR'}, inplace = True)
data.rename(columns={'PAY_AMT1':'PAY_AMT_SEPT','PAY_AMT2':'PAY_AMT_AUG','PAY_AMT3':'PAY_AMT_JUL','PAY_AMT4':'PAY_AMT_JUN','PAY_AMT5':'PAY_AMT_MAY','PAY_AMT6':'PAY_AMT_APR'},inplace=True)

History payment status

In [None]:
pay_col = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR']

Paid Amount

In [None]:
pay_amnt_df = data[['PAY_AMT_SEPT',	'PAY_AMT_AUG',	'PAY_AMT_JUL',	'PAY_AMT_JUN',	'PAY_AMT_MAY',	'PAY_AMT_APR']]

Total Bill Amount

In [None]:
bill_amnt_df = data[['BILL_AMT_SEPT', 'BILL_AMT_AUG', 'BILL_AMT_JUL', 'BILL_AMT_JUN', 'BILL_AMT_MAY', 'BILL_AMT_APR']]

In [None]:
numerical_features= data.select_dtypes(include=['int64', 'float64'])
numerical_features.head()

## ***EDA(Exploratory Data Analysis)***

#### Dependent Variable Analysis

In [None]:
# # Dependent variable analysis of the dataset
# Find a total count of a dependent variable churn
data['target_default'].value_counts()

In [None]:
# Plot a pie chart and bar chart of dependent feature

plt.subplot(2, 1, 1)
data["target_default"].value_counts().plot.pie( figsize= (40, 40), fontsize=10, autopct= "%1.2f%%")

plt.subplot(2, 1, 2)
data["target_default"].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=20)
plt.xlabel('default type')
plt.ylabel('default Count')
plt.suptitle("default Percentage by Customer", fontsize=25)
plt.show()



*   As we can see from above graph that both classes are not in proportion and we have imbalanced dataset.
*   The number of default a credit card is 6622(22.11%) of total card holder and 23322(77.89%) are not default.



#### For Categorical data Univariate analysis

In [None]:
# Find a unique value of categorical variable
for colm in categorical_features:
  data[colm].unique()
  print('----------------------------------------------------------------------------------------------------------------------------------')
  print(colm)
  print(data[colm].unique())

In [None]:
# Find a total count of a unique value
for colm in categorical_features:
  data[colm].value_counts()
  print('---------------------------------------------------------------------------------')
  print(data[colm].value_counts())


In [None]:
# Plot a bar graph of a categorical feature
for colm in categorical_features:
  data[colm].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=10, width= 0.3)
  plt.xlabel(colm)
  plt.ylabel('Count')
  plt.show()

Below are few observations for categorical features:

*   Female are more card hoder then male
*   Univerity pass out are more card holder then other

*   Single are more card holder then married and other









### For Categorical data bivariate analysis with target feature churn

In [None]:
for colm in categorical_features:
  pd.crosstab(colm, data['target_default'] )
  print('------------------------------------------------------------')
  print(pd.crosstab(colm, data['target_default'] ))

In [None]:
# Plot a bar graph between a categorical feature and categorical feature
for colm in categorical_features:
  fig, ax = plt.subplots(figsize=(15,10))

  ax = sns.countplot(x=data[colm], hue='target_default', data=data, width= 0.5)
  ax.set_ylabel('COUNTS', rotation=0, labelpad=100,size=10)
  ax.set_xlabel(colm)
  ax.yaxis.set_label_coords(0.03, 0.75)
  ax.tick_params(labelsize=10)

observations for categorical features:


*   There are more females credit card holder,so no. of defaulter have high proportion of females.
*   No. of defaulters have a higher proportion of educated people (graduate school and university)

*   No. of defaulters have a higher proportion of Singles.





In [None]:
# Plot a bar graph between a categorical feature and target feature
for colm in pay_col:
  fig, ax = plt.subplots(figsize=(15,10))

  ax = sns.countplot(x=data[colm], hue='target_default', data=data, width= 0.5)
  ax.set_ylabel('COUNTS', rotation=0, labelpad=100,size=10)
  ax.set_xlabel(colm)
  ax.yaxis.set_label_coords(0.03, 0.75)
  ax.tick_params(labelsize=10)

### For numerical data  analysis

In [None]:
for colm in numerical_features:
  ax = data.groupby('target_default')[colm].mean()
  print('--------------------------------------------------------------------')
  print(pd.DataFrame(ax))

In [None]:
# Plot a distplot of a 'LIMIT_BAL' features
sns.displot(x=data['LIMIT_BAL'], kde=True)

In [None]:
# Plot a distplot of a 'LIMIT_BAL' features
data.boxplot(column='LIMIT_BAL', by='target_default')
plt.show()

In [None]:
# Plot a barplot of a 'LIMIT_BAL' features
fig, ax = plt.subplots(figsize=(15,10))

ax = sns.barplot(y=data['LIMIT_BAL'], x='target_default', data=data, width= 0.5)
ax.set_ylabel('COUNTS', rotation=0, labelpad=100,size=10)
ax.set_xlabel('LIMIT_BAL')
ax.yaxis.set_label_coords(0.03, 0.75)
ax.tick_params(labelsize=10)

In [None]:
# Count a number of people in different age
data['AGE'].value_counts()

In [None]:
# Plot a barplot of different age
plt.figure(figsize=(8,6))
data['AGE'].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=10, width= 0.6)
plt.xlabel('AGE')
plt.ylabel('Count')
plt.show()

In [None]:
pd.crosstab('AGE', data['target_default'] )

In [None]:
# Plot a countplot of different age and default value
fig, ax = plt.subplots(figsize=(15,10))

ax = sns.countplot(x=data['AGE'], hue='target_default', data=data, width= 0.5)
ax.set_ylabel('COUNTS', rotation=0, labelpad=100,size=10)
ax.set_xlabel('AGE')
ax.yaxis.set_label_coords(0.03, 0.75)
ax.tick_params(labelsize=10)

In [None]:
# Plot a distplot of a 'LIMIT_BAL' features
data.boxplot(column='AGE', by='target_default')
plt.show()

#### Correlation Heatmap of the dataset

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(25,20))
sns.heatmap(data.corr(),annot=True,cmap="coolwarm")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Pair Plot of dataset

In [None]:
# Plot  pair plot of Total Paid Amount
sns.pairplot(data = pay_amnt_df)
plt.show()

In [None]:
# Plot  pair plot of Total Bill Amount
sns.pairplot(data = bill_amnt_df)
plt.show()

### Resampling The Datasets

In [None]:
# Resampling the dataset
data['target_default'].value_counts()


Datasat is highely Inbalence so resampling is required, use a SMOTE(Synthetic Minority Oversampling Technique) technoque for resampling



In [None]:
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

smote = SMOTE()

# fit predictor and target variable
x_smote, y_smote = smote.fit_resample(data.iloc[:,0:-1], data['target_default'])

print('shape of Dataset Before Resampling', len(data))
print('shape of Dataset After Resampling', len(y_smote))

In [None]:
print('Shape of X {}'.format(x_smote.shape))
print('Shape of y {}'.format(y_smote.shape))

In [None]:
columns = list(data.columns)

In [None]:
# Extrat the last column in the datsets
columns.pop()

In [None]:
balance_data = pd.DataFrame(x_smote, columns=columns)

In [None]:
balance_data = pd.concat([x_smote,y_smote],axis=1)
print('Normal distributed dataset shape {}'.format(balance_data.shape))

In [None]:
balance_data['target_default'] = y_smote

In [None]:
# Plot a barplot of balence datasets
balance_data["target_default"].value_counts().plot(kind= 'bar', figsize= (10, 10), fontsize=20)
plt.show()

In [None]:
# Show a datasets of default card holder
balance_data[balance_data['target_default']==1]

### One Hot Encoding

In [None]:
balance_data.replace({'SEX': {1 : 'MALE', 2 : 'FEMALE'}, 'EDUCATION' : {1 : 'graduate school', 2 : 'university', 3 : 'high school', 4 : 'others'}, 'MARRIAGE' : {1 : 'married', 2 : 'single', 3 : 'others'}}, inplace = True)

In [None]:
balance_data.head()


In [None]:
# Encode your categorical columns
balance_data= pd.get_dummies(balance_data, columns= ['EDUCATION', 'MARRIAGE'])
balance_data.head()


In [None]:
balance_data.drop(['EDUCATION_others','MARRIAGE_others'],axis = 1, inplace = True)

In [None]:
balance_data = pd.get_dummies(balance_data, columns = ['PAY_SEPT',	'PAY_AUG',	'PAY_JUL',	'PAY_JUN',	'PAY_MAY',	'PAY_APR'], drop_first = True )

In [None]:
balance_data.shape

In [None]:
# LABEL ENCODING FOR SEX
encoders_nums = {
                 "SEX":{"FEMALE": 0, "MALE": 1}
}
balance_data = balance_data.replace(encoders_nums)

In [None]:
balance_data.head()

### Data Splitting

In [None]:
balance_data.info()

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
data_X = balance_data.drop(['target_default'], axis=1)
data_y = balance_data['target_default']


X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=10)

Use a 20% of the dataset as a test data rest 80% as a train datasets

In [None]:
# Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_X = scaler.fit_transform(data_X)

## ***ML Model Implementation***

### LogisticRegression

In [None]:
# Use a Logistic Regression to fit the algorithm anf fit the model
log = LogisticRegression()
log.fit(X_train, y_train)

y_train_pred = log.predict(X_train)
y_test_pred = log.predict(X_test)

y_train_proba = log.predict_proba(X_train)
y_test_proba = log.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
# Get the accuracy scores
train_accuracy_log = accuracy_score(y_train_pred,y_train)
test_accuracy_log = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_log)
print('\nAccuracy Score for test data: ', test_accuracy_log)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for the hyperparameter tuning
param_grid = {'penalty':['l1','l2'], 'C' :np.logspace(-4, 4, 50) }

grid_log_clf = GridSearchCV(LogisticRegression(), param_grid= param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 5)
grid_log_clf.fit(X_train, y_train)

In [None]:
grid_log_clf.best_params_

In [None]:
grid_log_clf.best_score_

In [None]:
optaind_clf = grid_log_clf.best_estimator_
print(optaind_clf)

In [None]:
y_train_pred = optaind_clf.predict(X_train)
y_test_pred = optaind_clf.predict(X_test)

y_train_proba = optaind_clf.predict_proba(X_train)
y_test_proba = optaind_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
# Get the accuracy scores
train_accuracy_log_grid = accuracy_score(y_train_pred,y_train)
test_accuracy_log_grid = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_log_grid)
print('\nAccuracy Score for test data: ', test_accuracy_log_grid)

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

### Decision Tree

In [None]:
# Use a Decision Tree to fit the algorithm anf fit the model
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

y_train_proba = tree.predict_proba(X_train)
y_test_proba = tree.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
# Get the accuracy scores
train_accuracy_tree = accuracy_score(y_train_pred,y_train)
test_accuracy_tree = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_tree)
print('\nAccuracy Score for test data: ', test_accuracy_tree)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning fo Dicision Tree

In [None]:
# Use a GridSearch CV for the hyperparameter tuning
param_grid = {'max_features': ['auto', 'sqrt', 'log2'],
              'ccp_alpha': [0.1, .01, .001],
              'max_depth' : [5, 6, 7, 8, 9],
              'criterion' :['gini', 'entropy'],
              'min_samples_split':[0.1,0.2,0.4]
             }

grid_tree_clf = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 5)
grid_tree_clf.fit(X_train, y_train)

In [None]:
grid_tree_clf.best_params_

In [None]:
grid_tree_clf.best_score_

In [None]:
optaind_clf = grid_tree_clf.best_estimator_
print(optaind_clf)

In [None]:
y_train_pred = optaind_clf.predict(X_train)
y_test_pred = optaind_clf.predict(X_test)

y_train_proba = optaind_clf.predict_proba(X_train)
y_test_proba = optaind_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
# Get the accuracy scores
train_accuracy_tree_grid = accuracy_score(y_train_pred,y_train)
test_accuracy_tree_grid = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_tree_grid)
print('\nAccuracy Score for test data: ', test_accuracy_tree_grid)

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

### RandomForest

In [None]:
# Use a Random Forest to fit the algorithm anf fit the model
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)

y_train_pred = rf_clf.predict(X_train)
y_test_pred = rf_clf.predict(X_test)

y_train_proba = rf_clf.predict_proba(X_train)
y_test_proba = rf_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_rf_clf = accuracy_score(y_train_pred,y_train)
test_accuracy_rf_clf = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_rf_clf)
print('\nAccuracy Score for test data: ', test_accuracy_rf_clf)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for the hyperparameter tuning
param_grid = {'n_estimators': [100,150,200], 'max_depth': [10,20,30]}

grid_rf_clf = GridSearchCV(estimator= rf_clf , param_grid=param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 5)
grid_rf_clf.fit(X_train, y_train)

In [None]:
grid_rf_clf.best_params_

In [None]:
grid_rf_clf.best_score_

In [None]:
optaind_clf = grid_rf_clf.best_estimator_
print(optaind_clf)

In [None]:
y_train_pred = optaind_clf.predict(X_train)
y_test_pred = optaind_clf.predict(X_test)

y_train_proba = optaind_clf.predict_proba(X_train)
y_test_proba = optaind_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_rf_clf_grid = accuracy_score(y_train_pred,y_train)
test_accuracy_rf_clf_grid = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_rf_clf_grid)
print('\nAccuracy Score for test data: ', test_accuracy_rf_clf_grid)

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

### KNeighborsClassifier

In [None]:
# Use a KNeighborsClassifier to fit the algorithm and fit the model
from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
knn.fit(X_train, y_train)

y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)

y_train_proba = knn.predict_proba(X_train)
y_test_proba = knn.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_knn = accuracy_score(y_train_pred,y_train)
test_accuracy_knn = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_knn)
print('\nAccuracy Score for test data: ', test_accuracy_knn)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for the hyperparameter tuning
param_grid = dict(n_neighbors=list(range(1, 10)))

grid_knn_clf = GridSearchCV(estimator= knn , param_grid=param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 5)
grid_knn_clf.fit(X_train, y_train)

In [None]:
 grid_knn_clf.best_params_

In [None]:
grid_knn_clf.best_score_

In [None]:
optaind_clf = grid_knn_clf.best_estimator_
print(optaind_clf)

In [None]:
y_train_pred = optaind_clf.predict(X_train)
y_test_pred = optaind_clf.predict(X_test)

y_train_proba = optaind_clf.predict_proba(X_train)
y_test_proba = optaind_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_knn_grid = accuracy_score(y_train_pred,y_train)
test_accuracy_knn_grid = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_knn_grid)
print('\nAccuracy Score for test data: ', test_accuracy_knn_grid)

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

### Gaussian Naive Bayes

In [None]:
# Use a Gaussian Naive Bayes to fit the algorithm and fit the model
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_train_pred = gnb.predict(X_train)
y_test_pred = gnb.predict(X_test)

y_train_proba = gnb.predict_proba(X_train)
y_test_proba = gnb.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_gnb = accuracy_score(y_train_pred,y_train)
test_accuracy_gnb = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_gnb)
print('\nAccuracy Score for test data: ', test_accuracy_gnb)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Use a GridSearch CV for the hyperparameter tuning
param_grid = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

grid_svc_clf = GridSearchCV(estimator= gnb , param_grid=param_grid, scoring = 'accuracy', n_jobs = -1, verbose = 3, cv = 5)
grid_svc_clf.fit(X_train, y_train)

In [None]:
grid_svc_clf.best_params_

In [None]:
grid_svc_clf.best_score_

In [None]:
optaind_clf = grid_svc_clf.best_estimator_
print(optaind_clf)

In [None]:
y_train_pred = optaind_clf.predict(X_train)
y_test_pred = optaind_clf.predict(X_test)

y_train_proba = optaind_clf.predict_proba(X_train)
y_test_proba = optaind_clf.predict_proba(X_test)

print('Classification Report for train data:\n', classification_report(y_train_pred, y_train, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_train = confusion_matrix(y_train_pred, y_train)
print('Confusion matrix for train data:\n', cm_train )

print('--------------------------------------------------------------------')

print('Classification Report for test data:\n', classification_report(y_test_pred, y_test, target_names=['No defaulter', 'Defaulter']))
labels = ['Not Defaulter', 'Defaulter']
cm_test= confusion_matrix(y_test_pred, y_test)
print('Confusion matrix for train data:\n', cm_test )

print('-----------------------------------------------------------------------')
train_accuracy_gnb_grid = accuracy_score(y_train_pred,y_train)
test_accuracy_gnb_grid = accuracy_score(y_test_pred,y_test)

print('\nAccuracy Score for train data: ', train_accuracy_gnb_grid)
print('\nAccuracy Score for test data: ', test_accuracy_gnb_grid)

In [None]:
# Visualizing evaluation Metric Score chart
print('Heat map of confusion matrix for train data')
ax= plt.subplot()
sns.heatmap(cm_train, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

print('------------------------------------------------------------------------------------------------------------------')

print('Heat map of confusion matrix for test data')
ax= plt.subplot()
sns.heatmap(cm_test, annot=True, ax = ax)

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

In [None]:
score = roc_auc_score(y_train, y_train_pred)
print(f"ROC AUC: {score:.4f}")
fpr, tpr, _ = roc_curve(y_train, y_train_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

## ***Evaluating the models***

### 1- Without Hyperameter Tuning

In [None]:
classifiers = ['Logistic Regression', 'Decision Tree', 'RandomForest', 'KNeighborsClassifier', 'Gaussian Naive Bayes']
train_accuracy = [train_accuracy_log, train_accuracy_tree, train_accuracy_rf_clf, train_accuracy_knn, train_accuracy_gnb]
test_accuracy = [test_accuracy_log, test_accuracy_tree, test_accuracy_rf_clf, test_accuracy_knn, test_accuracy_gnb]


In [None]:
pd.DataFrame({'Classifier':classifiers, 'Train Accuracy': train_accuracy, 'Test Accuracy': test_accuracy})

### 2- After Hyperameter Tuning

In [None]:
classifiers_grid = ['Logistic Regression Grid', 'Decision Tree Grid', 'RandomForest Grid', 'KNeighborsClassifier Grid', 'Gaussian Naive Bayes']
train_accuracy_grid = [train_accuracy_log_grid, train_accuracy_tree_grid, train_accuracy_rf_clf_grid, train_accuracy_knn_grid, train_accuracy_gnb_grid]
test_accuracy_grid = [test_accuracy_log_grid, test_accuracy_tree_grid, test_accuracy_rf_clf_grid, test_accuracy_knn, test_accuracy_gnb_grid]


In [None]:
pd.DataFrame({'Classifier':classifiers_grid, 'Train Accuracy': train_accuracy_grid, 'Test Accuracy': test_accuracy_grid})

# **Conclusion**



*   After performing the various model we the get the best accuracy form the Random forest and XGBoost classifier
*   Gaussian Naive Bayes	 is the least accurate as compared to other models performed.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***