#  ML Classification Project - Credit Card Approvals 
Commercial banks receive many applications for credit cards every day. Manually analyzing them is time-consuming. Nowadays, commercial banks can use machine learning to automate this process. In this project, I will build a credit card approval predictor that makes decisions about whether an application should be approved or rejected. This is a classification problem because the task is to predict who should be given credit and who should not.

Task Feature Description:

BAD: 1 = Client defaulted on previous loan; 0 = loan repaid (i.e., 1 means bad)
DEBTINC: Debt-to-income ratio
DELINQ: Number of delinquent credit lines
DEROG: Number of major derogatory reports
VALUE: Value of Current Property
CLAGE: Age of oldest tradeline in months

## Importing the Libraries

In [148]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px

## Importing the Dataset

In [149]:
dataset = pd.read_csv('Credit data.csv') 


## Showing the Dataset in a Table

In [150]:
pd.DataFrame(dataset)

Unnamed: 0.1,Unnamed: 0,BAD,DEBTINC,DELINQ,DEROG,VALUE,CLAGE
0,0,0,28.749890,0,0.0,89082.0,217.321155
1,1,0,41.447506,1,0.0,194992.0,337.689972
2,2,0,29.964687,0,0.0,63601.0,153.735399
3,3,0,38.251392,0,0.0,37391.0,217.744275
4,4,1,43.159875,1,0.0,81922.0,120.885811
...,...,...,...,...,...,...,...
619,619,0,1.000000,1,,122021.0,143.000000
620,620,0,149.823437,0,0.0,318005.0,768.676993
621,621,0,149.823437,0,0.0,318005.0,768.676993
622,622,0,149.823437,0,0.0,318005.0,768.676993


## A Quick Review of the Data

In [151]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  624 non-null    int64  
 1   BAD         624 non-null    int64  
 2   DEBTINC     623 non-null    float64
 3   DELINQ      624 non-null    int64  
 4   DEROG       622 non-null    float64
 5   VALUE       623 non-null    float64
 6   CLAGE       624 non-null    float64
dtypes: float64(4), int64(3)
memory usage: 34.2 KB


## Preprocessing the Data for Machine Learning

### Removing duplicates

In [152]:
# Let us take a look at the total number of duplicate rows in the dataset
(dataset.duplicated()).sum()

0

In [153]:
# Let us take a look at the total number of non-duplicate rows in the dataset
(~dataset.duplicated()).sum()

624

So, we have only ZERO duplicates and  no need to take any action.

### Dropping records (observations) with missing values

In [154]:
# Checking if the dataset contains any NULL values
dataset.isnull().any().any()

True

In [155]:
# A quick review of the data to find where the missing values are :
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 624 entries, 0 to 623
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  624 non-null    int64  
 1   BAD         624 non-null    int64  
 2   DEBTINC     623 non-null    float64
 3   DELINQ      624 non-null    int64  
 4   DEROG       622 non-null    float64
 5   VALUE       623 non-null    float64
 6   CLAGE       624 non-null    float64
dtypes: float64(4), int64(3)
memory usage: 34.2 KB


In [156]:
# Let us drop rows with missing values using dropna() function  
dataset.dropna(inplace=True)

In [157]:
#checking the dataset 
dataset

Unnamed: 0.1,Unnamed: 0,BAD,DEBTINC,DELINQ,DEROG,VALUE,CLAGE
0,0,0,28.749890,0,0.0,89082.0,217.321155
1,1,0,41.447506,1,0.0,194992.0,337.689972
2,2,0,29.964687,0,0.0,63601.0,153.735399
3,3,0,38.251392,0,0.0,37391.0,217.744275
4,4,1,43.159875,1,0.0,81922.0,120.885811
...,...,...,...,...,...,...,...
617,617,0,39.483914,0,0.0,237546.0,291.633415
620,620,0,149.823437,0,0.0,318005.0,768.676993
621,621,0,149.823437,0,0.0,318005.0,768.676993
622,622,0,149.823437,0,0.0,318005.0,768.676993


In [158]:
#A quick review of the data
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 622 entries, 0 to 623
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  622 non-null    int64  
 1   BAD         622 non-null    int64  
 2   DEBTINC     622 non-null    float64
 3   DELINQ      622 non-null    int64  
 4   DEROG       622 non-null    float64
 5   VALUE       622 non-null    float64
 6   CLAGE       622 non-null    float64
dtypes: float64(4), int64(3)
memory usage: 38.9 KB


### Seperating the input data from the output data

In [159]:
X = dataset.iloc[:, 2:].values
y = dataset.iloc[:, 1].values


In [160]:
dataset

Unnamed: 0.1,Unnamed: 0,BAD,DEBTINC,DELINQ,DEROG,VALUE,CLAGE
0,0,0,28.749890,0,0.0,89082.0,217.321155
1,1,0,41.447506,1,0.0,194992.0,337.689972
2,2,0,29.964687,0,0.0,63601.0,153.735399
3,3,0,38.251392,0,0.0,37391.0,217.744275
4,4,1,43.159875,1,0.0,81922.0,120.885811
...,...,...,...,...,...,...,...
617,617,0,39.483914,0,0.0,237546.0,291.633415
620,620,0,149.823437,0,0.0,318005.0,768.676993
621,621,0,149.823437,0,0.0,318005.0,768.676993
622,622,0,149.823437,0,0.0,318005.0,768.676993


In [161]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4
0,28.749890,0.0,0.0,89082.0,217.321155
1,41.447506,1.0,0.0,194992.0,337.689972
2,29.964687,0.0,0.0,63601.0,153.735399
3,38.251392,0.0,0.0,37391.0,217.744275
4,43.159875,1.0,0.0,81922.0,120.885811
...,...,...,...,...,...
617,39.483914,0.0,0.0,237546.0,291.633415
618,149.823437,0.0,0.0,318005.0,768.676993
619,149.823437,0.0,0.0,318005.0,768.676993
620,149.823437,0.0,0.0,318005.0,768.676993


In [162]:
X.shape

(622, 5)

In [163]:
y.shape

(622,)

### Taking care of outlier in the numerical data

In [164]:
# Source 1: https://towardsdatascience.com/local-outlier-factor-lof-algorithm-for-outlier-identification-8efb887d9843
# Source 2: https://towardsdatascience.com/anomaly-detection-with-local-outlier-factor-lof-d91e41df10f2
#Local Outlier Factor (LOF) — Algorithm for outlier identification

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(contamination=0.01)

# yhat has postive and negative values(i.e +1 & -1).Negative values are outliers 
# and positives inliers

yhat = lof.fit_predict(X)


# filter outlier values
mask = (yhat != -1)

print("mask shape: ",mask.shape)

X, y = X[mask, :], y[mask]
print("X shape",X.shape)
print("y shape",y.shape)

pd.DataFrame(X)

mask shape:  (622,)
X shape (615, 5)
y shape (615,)


Unnamed: 0,0,1,2,3,4
0,28.749890,0.0,0.0,89082.0,217.321155
1,41.447506,1.0,0.0,194992.0,337.689972
2,29.964687,0.0,0.0,63601.0,153.735399
3,38.251392,0.0,0.0,37391.0,217.744275
4,43.159875,1.0,0.0,81922.0,120.885811
...,...,...,...,...,...
610,39.483914,0.0,0.0,237546.0,291.633415
611,149.823437,0.0,0.0,318005.0,768.676993
612,149.823437,0.0,0.0,318005.0,768.676993
613,149.823437,0.0,0.0,318005.0,768.676993


###  Splitting the dataset into the training set and test Set

In [165]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [166]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,91.612600,6.0,2.0,167164.0,111.588603
1,41.572972,0.0,0.0,114224.0,203.468979
2,32.656531,0.0,3.0,80147.0,244.233530
3,2.830297,1.0,2.0,128923.0,114.461955
4,40.845078,0.0,0.0,110253.0,192.422606
...,...,...,...,...,...
487,41.262353,0.0,0.0,72322.0,105.074802
488,38.121518,0.0,0.0,122352.0,58.861027
489,33.581117,0.0,0.0,107590.0,78.299274
490,36.665319,0.0,0.0,87260.0,103.692818


### Feature Scaling

In [167]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [168]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,3.407791,3.720047,1.555545,1.210328,-0.545263
1,0.253532,-0.438067,-0.355387,0.221112,0.370191
2,-0.308518,-0.438067,2.511011,-0.415638,0.776350
3,-2.188621,0.254952,1.555545,0.495771,-0.516634
4,0.207649,-0.438067,-0.355387,0.146911,0.260130
...,...,...,...,...,...
487,0.233952,-0.438067,-0.355387,-0.561852,-0.610163
488,0.035969,-0.438067,-0.355387,0.372988,-1.070616
489,-0.250236,-0.438067,-0.355387,0.097152,-0.876942
490,-0.055823,-0.438067,-0.355387,-0.282727,-0.623933


## ML Model

### Decision Tree

In [169]:
# Searching for the best max_depth hyperparameter using for-loop
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

lst=[]
for i in range (1,20):
   
    classifier = DecisionTreeClassifier(criterion="gini", random_state = 0, max_depth=i)
    classifier.fit(X_train, y_train) 
    y_pred = classifier.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    lst.append(accuracy)
    print(accuracy)

max_value=max(lst)
max_value_index=lst.index(max_value)
print ("max value :",max_value)
print ("index of max value :",max_value_index)
    

0.6585365853658537
0.7073170731707317
0.7073170731707317
0.7398373983739838
0.7723577235772358
0.7235772357723578
0.7154471544715447
0.6991869918699187
0.7235772357723578
0.7154471544715447
0.6991869918699187
0.7235772357723578
0.7154471544715447
0.7235772357723578
0.7235772357723578
0.7235772357723578
0.7235772357723578
0.7235772357723578
0.7235772357723578
max value : 0.7723577235772358
index of max value : 4


####  Training the Decision tree model on the Training set

In [171]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion="gini", random_state = 0, max_depth=max_value_index+1) 
classifier.fit(X_train, y_train)


DecisionTreeClassifier(max_depth=5, random_state=0)

#### Predicting the Test set Results

In [172]:
y_pred = classifier.predict(X_test)

compare=[y_pred,y_test]
pd.DataFrame(compare)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,113,114,115,116,117,118,119,120,121,122
0,1,1,1,1,1,0,1,0,1,0,...,1,1,0,0,0,0,1,0,0,1
1,1,1,1,0,1,0,1,0,1,1,...,1,1,0,1,0,0,1,0,1,0


In [173]:
accuracy_DT = np.sum(y_pred==y_test)/len(y_test)
print(accuracy_DT)

0.7723577235772358



#### Making the Confusion Matrix


In [174]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Your Model Accuracy is=", accuracy_score(y_test, y_pred)*100, "%")


[[49 13]
 [15 46]]
Your Model Accuracy is= 77.23577235772358 %



#### Hyperparameter tuning (Optional)


In [175]:
from sklearn.model_selection import GridSearchCV

dt = DecisionTreeClassifier(random_state=0)


# Create the parameter grid based on the results of random search 
params = {
    'max_depth': [2, 3, 5, 10,15, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'criterion': ["gini", "entropy"]
}


# Instantiate the grid search model
grid_search = GridSearchCV(estimator=dt, 
                           param_grid=params, 
                           cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")


#training
grid_search.fit(X_train, y_train)




Fitting 4 folds for each of 60 candidates, totalling 240 fits


GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [2, 3, 5, 10, 15, 20],
                         'min_samples_leaf': [5, 10, 20, 50, 100]},
             scoring='accuracy', verbose=1)

In [176]:
grid_search.best_estimator_

DecisionTreeClassifier(max_depth=10, min_samples_leaf=10, random_state=0)

In [177]:
dt_best = grid_search.best_estimator_

In [178]:
def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(X_train)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(X_train)))
    print("-"*50)
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(X_test)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(X_test)))
    
evaluate_model(dt_best)

Train Accuracy : 0.8252032520325203
Train Confusion Matrix:
[[216  34]
 [ 52 190]]
--------------------------------------------------
Test Accuracy : 0.7235772357723578
Test Confusion Matrix:
[[54  8]
 [26 35]]



#### Credit Card Approval Estimator


In [180]:
result =classifier.predict(  sc.transform( [[   28.749890366, 0, 0, 89082, 217.32115468  ]] )   )
if  result[0] == 0:
    print( "Credit card application will be approved")
else: 
    print( "Credit card application will be declined")


Credit card application will be approved


In [181]:
result =classifier.predict(  sc.transform( [[  91.612600, 6.0,2.0, 167164.0, 111.588603  ]] )   )
if  result[0] == 0:
    print( "Credit card application will be approved")
else: 
    print( "Credit card application will be declined")


Credit card application will be declined
