# Credit Analysis and Classifier

Brief Overview

Banks play a crucial role in market economies. They decide who can get finance and on what terms they can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. Nowadays there are many risks related to lending loans, especially for the banks so as to reduce their capital loss. The analysis of risks and assessment of default becomes crucial thereafter. Banks hold huge volumes of customer behavior related data from which they are unable to arrive at a judgment if an applicant can be defaulter or not. Credit Risk assessment is a crucial issue faced by Banks nowadays which helps them to evaluate if a loan applicant can be a defaulter at a later stage so that they can go ahead and grant the loan or not. This helps the banks to minimize the possible losses and can increase the volume of credits. This is basically a classification problem.



# Tables of Content:

**1. [Introduction](#Introduction)** <br>
    - Info's about datasets 
    - Data Analysis
**2. [Librarys](#Librarys)** <br>
    - Importing Librarys
    - Importing Dataset
**3. [Knowning the data](#Known)** <br>
    - 3.1 Looking the Type of Data
    - 3.2 Shape of data
    - 3.3 Null Numbers
    - 3.4 Unique values
    - 3.5 The first rows of our dataset
**4. [Exploring some Variables](#Explorations)** <br>
    - 4.1 Ploting some graphical and descriptive informations
**5. [Correlation of data](#Correlation)** <br>
	- 5.1 Correlation Data
**6. [Preprocess](#Preprocessing)** <br>
	- 6.1 Importing Librarys
	- 6.2 Setting X and Y
    - 6.3 Spliting the X and Y in train and test 
**7. 1 [Model 1](#Modelling 1)** <br>
    - 7.1.1 Random Forest 
    - 7.1.2 Score values
    - 7.1.3 Cross Validation 
**7. 2 [Model 2](#Modelling 2)** <br>
    - 7.2.1 Logistic Regression 
    - 7.2.2 Score values
    - 7.2.3 Cross Validation 
    - 7.2.4 ROC Curve  

# **1. Introduction:** 
<h2>Dataset Information</h2>
The original dataset contains 1000 entries with 20 categorial/symbolic attributes. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.

The selected attributes are:

<b>Age </b>(numeric)<br>
<b>Sex </b>(text: male, female)<br>
<b>Job </b>(numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)<br>
<b>Housing</b> (text: own, rent, or free)<br>
<b>Saving accounts</b> (text - little, moderate, quite rich, rich)<br>
<b>Checking account </b>(numeric, in DM - Deutsch Mark currency)<br>
<b>Credit amount</b> (numeric, in DM)<br>
<b>Duration</b> (numeric, in month)<br>
<b>Purpose</b>(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others<br>
<b>Risk </b> (Value target - Good or Bad Risk)<br>

<h2>Dataset Analysis</h2>
Before getting into any sophisticated analysis, the first step is to do an EDA and data cleaning. Since both categorical and continuous variables are included in the data set, appropriate tables and summary statistics are provided.

The dataset contains 1000 entries with 9 attributes, and classifies people as good or bad credit risks. In this quick analysis, I would like to explore the following points:

The age distribution of the people getting credit loans
How much credit do men/women usually take?
Is there any difference between two genders in the purpose of taking credit?
For what reason do people with low-skilled job take credit? Do people with high-skilled job behave differently?
Before going further, I would like to know the percentage of male and female entries in this dataset. It appears that the number of male entries is double that of female. One is tempted to infer that men are willing to take credit more than women. However, based on the sample size I would say that this statement has almost no ground to hold.

<a id="Librarys"></a> <br>
# **2. Librarys:** 
- Importing Librarys
- Importing Dataset

In [276]:
#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn

#Importing the data
df_credit = pd.read_csv("credit_data_Result.csv",index_col=0)

In [277]:
df_credit.head(10)

Unnamed: 0_level_0,Age,Sex,Job,Housing,Saving accounts,Current account,Credit amount,Duration,Purpose,EXISTCR,Risk
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7798481,67,male,2,own,,little,1169,6,radio/TV,2,good
6144999,22,female,2,own,little,moderate,5951,48,radio/TV,1,bad
7127219,49,male,1,own,little,,2096,12,education,0,good
8531771,45,male,2,free,little,little,7882,42,furniture/equipment,3,good
7071197,53,male,2,free,little,little,4870,24,car,0,bad
7820389,35,male,1,free,,,9055,36,education,1,good
5037082,53,male,2,own,quite rich,,2835,24,furniture/equipment,2,good
5814416,35,male,3,rent,little,moderate,6948,36,car,1,good
7935262,61,male,1,own,rich,,3059,12,radio/TV,0,good
8591528,28,male,3,own,little,moderate,5234,30,car,0,bad


In [278]:
print("Purpose : ",df_credit.Purpose.unique())
print("Sex : ",df_credit.Sex.unique())
print("Housing : ",df_credit.Housing.unique())
print("Saving accounts : ",df_credit['Saving accounts'].unique())
print("Current account : ",df_credit['Current account'].unique())

Purpose :  ['radio/TV' 'education' 'furniture/equipment' 'car' 'business'
 'domestic appliances' 'repairs' 'vacation/others']
Sex :  ['male' 'female']
Housing :  ['own' 'free' 'rent']
Saving accounts :  [nan 'little' 'quite rich' 'rich' 'moderate']
Current account :  ['little' 'moderate' nan 'rich']


 <a id="Explorations"></a> <br>
# **4. Some explorations:**
 
- Starting by distribuition of column Age.
- Some graphical Representation
- Columns crossing

In [279]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
from collections import Counter

good_credit = go.Bar(
            x = df_credit[df_credit["Risk"]== 'good']["Risk"].value_counts().index.values,
            y = df_credit[df_credit["Risk"]== 'good']["Risk"].value_counts().values,
            name='Good credit'
    )

bad_credit = go.Bar(
            x = df_credit[df_credit["Risk"]== 'bad']["Risk"].value_counts().index.values,
            y = df_credit[df_credit["Risk"]== 'bad']["Risk"].value_counts().values,
            name='Bad credit'
    )

data = [good_credit, bad_credit]

layout = go.Layout(
    
)

layout = go.Layout(
    yaxis=dict(
        title='Count'
    ),
    xaxis=dict(
        title='Risk Variable'
    ),
    title='Target variable distribution'
)

fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='grouped-bar')
plt.savefig("Total_Risk.svg", format="svg")

## Display AGE distribution against RISK of the customers. 
Gives the visualization of age groups and risks to lend them credit

In [280]:
# Displaying the age w.r.t risks of good and bad customers 
df_good = df_credit.loc[df_credit["Risk"] == 'good']['Age'].values.tolist()
df_bad = df_credit.loc[df_credit["Risk"] == 'bad']['Age'].values.tolist()
df_age = df_credit['Age'].values.tolist()

#First plot
trace0 = go.Histogram(
    x=df_good,
    histnorm='probability',
    name="Good Credit"
)

trace1 = go.Histogram(
    x=df_bad,
    histnorm='probability',
    name="Bad Credit"
)

#Third plot
trace2 = go.Histogram(
    x=df_age,
    histnorm='probability',
    name="Overall Age"
)

#Creating the grid
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Good Credits v/s AGE','Bad Credits v/s AGE', 'General Distribuition'))

#setting the figs
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)

fig['layout'].update(showlegend=True, title='Age Distribuition', bargap=0.05)
py.iplot(fig, filename='custom-sized-subplot-with-subplot-titles')


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3           -      ]



## Display HOUSING distribution Rent/Own against RISK of the customers. 
Gives the visualization of housing methods and risks to lend them credit

In [281]:
#First plot
trace0 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'good']["Housing"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'good']["Housing"].value_counts().values,
    name='Good credit'
)

#Second plot
trace1 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'bad']["Housing"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'bad']["Housing"].value_counts().values,
    name="Bad Credit"
)

data = [trace0, trace1]

layout = go.Layout(
    title='Housing Distribuition'
)


fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='Housing-Grouped')


Above chart shows a correlation between "own house" and "good credits". People who owns their house shows a good credit record

## Display JOB distribution against RISK of the customers. 
Gives the visualization of housing methods and risks to lend them credit

In [282]:
#First plot
trace0 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'good']["Job"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'good']["Job"].value_counts().values,
    name='Good credit'
)

#Second plot
trace1 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'bad']["Job"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'bad']["Job"].value_counts().values,
    name="Bad Credit"
)

data = [trace0, trace1]

layout = go.Layout(
    title='JOB Distribuition'
)


fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename='JOB-Grouped')


In [283]:
#First plot
trace0 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'good']["Current account"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'good']["Current account"].value_counts().values,
    name='Good credit Distribuition' 
    
)

#Second plot
trace1 = go.Bar(
    x = df_credit[df_credit["Risk"]== 'bad']["Current account"].value_counts().index.values,
    y = df_credit[df_credit["Risk"]== 'bad']["Current account"].value_counts().values,
    name="Bad Credit Distribuition"
)

data = [trace0, trace1]

layout = go.Layout(
    title='Checking accounts Distribuition',
    xaxis=dict(title='Checking accounts name'),
    yaxis=dict(title='Count'),
    barmode='group'
)


fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename = 'Age-ba', validate = False)

In [284]:
date_int = ["Purpose", 'Sex']
cm = sns.light_palette("green", as_cmap=True)
pd.crosstab(df_credit[date_int[0]], df_credit[date_int[1]]).style.background_gradient(cmap = cm)

Sex,female,male
Purpose,Unnamed: 1_level_1,Unnamed: 2_level_1
business,19,78
car,94,243
domestic appliances,6,6
education,24,35
furniture/equipment,74,107
radio/TV,85,195
repairs,5,17
vacation/others,3,9


# Feature Engineering

# Transforming the data into Dummy variables
For the above mentioned transformation we can use One Hot Encoder. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [548]:
#Load the librarys
import pandas as pd #To work with dataset
import numpy as np #Math library
import seaborn as sns #Graph library that use matplot in background
import matplotlib.pyplot as plt #to plot some parameters in seaborn

#Importing the data
df_credit = pd.read_csv("credit_data_Result.csv",index_col=0)

In [549]:
df_credit['Saving accounts'] = df_credit['Saving accounts'].fillna('no_inf')
df_credit['Current account'] = df_credit['Current account'].fillna('no_inf')

#Purpose to Dummies Variable
df_credit = df_credit.merge(pd.get_dummies(df_credit.Purpose, drop_first=True, prefix='Purpose'), left_index=True, right_index=True)

#Applying One hot encoding in Gender/Sex feature using dummies
df_credit = df_credit.merge(pd.get_dummies(df_credit.Sex, drop_first=True, prefix='Sex'), left_index=True, right_index=True)

# Housing get dummies
df_credit = df_credit.merge(pd.get_dummies(df_credit.Housing, drop_first=True, prefix='Housing'), left_index=True, right_index=True)

# Saving Accounts
df_credit = df_credit.merge(pd.get_dummies(df_credit["Saving accounts"], drop_first=True, prefix='Savings'), left_index=True, right_index=True)
#  Risk
# df_credit = df_credit.merge(pd.get_dummies(df_credit["Risk"], prefix='Risk'), left_index=True, right_index=True)
#  Checking Account
df_credit = df_credit.merge(pd.get_dummies(df_credit["Current account"], drop_first=True, prefix='Check'), left_index=True, right_index=True)
#  Age categorical
df_credit = df_credit.merge(pd.get_dummies(df_credit["Age"], drop_first=True, prefix='Age_cat'), left_index=True, right_index=True)

In [550]:
df_credit.shape

(1000, 80)

# Deleting the old features

In [551]:
#Excluding the missing columns
del df_credit["Saving accounts"]
del df_credit["Current account"]
del df_credit["Purpose"]
del df_credit["Sex"]
del df_credit["Housing"]
del df_credit["Age"]

# 6. Preprocessing: 
Importing ML librarys


Setting X and y variables to the prediction


Splitting Data

In [552]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score # to split the data
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, fbeta_score #To evaluate our model

from sklearn.grid_search import GridSearchCV

# Algorithmns models to be compared
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# from xgboost import XGBClassifier

In [553]:
df_credit['Credit amount'] = np.log(df_credit['Credit amount'])

In [554]:
# Creating the X and y variables
X = df_credit.drop(['Risk'], axis=1)

y = df_credit["Risk"].values

# del df_credit["Risk"]

# Spliting X and y into train and test version
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

> # **7.1 Model 1 :** <a id="Modelling 1"></a> <br>
- Using Random Forest to predictict the credit score 
- Some of Validation Parameters

In [555]:
# Random Forest in train data # Random  
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
print("Mean Accuracy for Random Forest in train data: ",round(acc_random_forest,2), "%")

Mean Accuracy for Random Forest in train data:  100.0 %


In [556]:
#Testing the model 
#Predicting using our  model
y_pred = random_forest.predict(X_test)

# Display the accuracy and validation scores
print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

0.752


[[ 26  46]
 [ 16 162]]


             precision    recall  f1-score   support

        bad       0.62      0.36      0.46        72
       good       0.78      0.91      0.84       178

avg / total       0.73      0.75      0.73       250



> # **7.2 Model 2:** <a id="Modelling 2"></a> <br>
GaussianNB

In [557]:
GNB = GaussianNB()

# Fitting with train data
model = GNB.fit(X_train, y_train)

In [558]:
# Printing the Training Score
print("Training score data: ")
print(model.score(X_train, y_train))

Training score data: 
0.38666666666666666


In [559]:
y_pred = model.predict(X_test)

print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

0.348


[[ 66   6]
 [157  21]]


             precision    recall  f1-score   support

        bad       0.30      0.92      0.45        72
       good       0.78      0.12      0.20       178

avg / total       0.64      0.35      0.27       250



> > # **7.3 Model 3:** <a id="Modelling 3"></a> <br>
#Logistic Regression

In [560]:
# Logistic Regression
from sklearn.metrics import f1_score
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
print("Mean Accuracy for Logistic Regression in train data: ",round(acc_log,2), "%")

Mean Accuracy for Logistic Regression in train data:  77.47 %


In [561]:
y_pred = logreg.predict(X_test)

In [562]:
print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

0.748


[[ 26  46]
 [ 17 161]]


             precision    recall  f1-score   support

        bad       0.60      0.36      0.45        72
       good       0.78      0.90      0.84       178

avg / total       0.73      0.75      0.73       250



# Model 4

# Decision Tree Classifier

In [563]:
#Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
print("Mean Accuracy for Decision Tree in train data: ",round(acc_decision_tree,2), "%")

Mean Accuracy for Decision Tree in train data:  100.0 %


In [564]:
y_pred = decision_tree.predict(X_test)

In [565]:
print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

0.624


[[ 24  48]
 [ 46 132]]


             precision    recall  f1-score   support

        bad       0.34      0.33      0.34        72
       good       0.73      0.74      0.74       178

avg / total       0.62      0.62      0.62       250



In [566]:
# import pickle

# #serializing our model to a file called model.pkl
# pickle.dump(logreg, open("model.pkl","wb"))

In [567]:
#loading a model from a file called model.pkl
import dill as pickle
filename = 'model_v3.pk'
with open(filename, 'wb') as file:
    pickle.dump(decision_tree, file)

In [570]:
import os
import pandas as pd
from sklearn.externals import joblib
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/predict', methods=['GET','POST'])
def predict():
    """API Call
    
    Pandas dataframe (sent as a payload) from API Call
    """
    data = "{'status':'OK'}"
    #Importing the data
    df_predict = pd.read_csv("credit_data.csv",index_col=0)
    X_to_train = df_predict.values
    print(X_to_train)
    #Load the saved model
    print("Loading the model...")
    loaded_model = None
    with open('model_v3.pk','rb') as f:
        loaded_model = pickle.load(f)

    print("The model has been loaded...doing predictions now...")
    predictions = loaded_model.predict(X_test)
    print('predictions : ',predictions)
    """Add the predictions as Series to a new pandas dataframe
                            OR
       Depending on the use-case, the entire test data appended with the new files
    """
#     prediction_series = list(pd.Series(predictions))
#     print('prediction_series : ',prediction_series)
#         final_predictions = pd.DataFrame(list(zip(loan_ids, prediction_series)))

    """We can be as creative in sending the responses.
       But we need to send the response codes as well.
    """
#         responses = jsonify(predictions=prediction_series.to_json(orient="records"))
    responses = data
#     responses.status = 200

    return (predictions)

In [571]:
if __name__ == '__main__':
    app.run()

 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
[2018-10-03 15:30:37,830] ERROR in app: Exception on /predict [GET]
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\flask\app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\ProgramData\Anaconda3\lib\site-packages\flask\app.py", line 1615, in full_dispatch_request
    return self.finalize_request(rv)
  File "C:\ProgramData\Anaconda3\lib\site-packages\flask\app.py", line 1630, in finalize_request
    response = self.make_response(rv)
  File "C:\ProgramData\Anaconda3\lib\site-packages\flask\app.py", line 1740, in make_response
    rv = self.response_class.force_type(rv, request.environ)
  File "C:\ProgramData\Anaconda3\lib\site-packages\werkzeug\wrappers.py", line 921, in force_type
    response = BaseResponse(*_run_wsgi_app(response, environ))
  File "C:\ProgramData\Anaconda3\lib\site-packages\werkzeug\test.py", line 923, in run_wsgi_app
    app_rv = app(e

[[67 'female' 2 ... 6 'radio/TV' 2]
 [22 'female' 2 ... 48 'radio/TV' 1]
 [49 'male' 1 ... 12 'education' 0]
 ...
 [38 'male' 2 ... 12 'radio/TV' 0]
 [23 'male' 2 ... 45 'radio/TV' 3]
 [27 'male' 2 ... 45 'car' 3]]
Loading the model...
The model has been loaded...doing predictions now...
predictions :  ['good' 'good' 'good' 'bad' 'good' 'bad' 'good' 'bad' 'good' 'good' 'good'
 'good' 'good' 'bad' 'bad' 'good' 'good' 'good' 'bad' 'good' 'good' 'good'
 'good' 'good' 'good' 'bad' 'bad' 'good' 'good' 'good' 'good' 'good'
 'good' 'good' 'good' 'good' 'bad' 'good' 'good' 'good' 'good' 'good'
 'good' 'bad' 'good' 'bad' 'good' 'good' 'bad' 'bad' 'bad' 'good' 'bad'
 'good' 'bad' 'bad' 'bad' 'good' 'bad' 'good' 'good' 'good' 'good' 'bad'
 'bad' 'good' 'good' 'bad' 'bad' 'good' 'good' 'good' 'good' 'good' 'good'
 'bad' 'good' 'good' 'good' 'good' 'good' 'good' 'good' 'bad' 'good'
 'good' 'good' 'bad' 'bad' 'good' 'bad' 'good' 'bad' 'bad' 'good' 'good'
 'good' 'good' 'bad' 'good' 'good' 'good' 'go