# Analysis of German Credit Data

In this project we want to minimize risk of loss and maximize profit on behalf of the bank

When a bank receives a loan application, based on the applicant’s profile, the bank has to make a decision regarding whether to go ahead with the loan approval or not. 
Two types of risks are associated with the bank’s decision :

* If the applicant is a good credit risk, i.e. is worthy to given credit or likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.

* If the applicant is a bad credit risk, i.e. is not worthy enough to given credit or not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.

This [dataset](http://freakonometrics.free.fr/german_credit.csv) contain 21 variables including the classification whether an applicant is considered a Credit-worthy or Not Credit-worthy for 1000 loan applicants. You can see the appendix [here](https://github.com/ekosaputro09/Data-Science-Project/blob/master/German%20Credit/Appendix.docx?raw=true)



## Data Preprocessing

In [1]:
# Import libraries that are necessary

import numpy as np
import pandas as pd

# Load the German Credit data

data = pd.read_csv("german_credit.csv")
print("German Credit dataset has {} observations and {} variables.".format(*data.shape))


German Credit dataset has 1000 observations and 21 variables.


Let's take a look at the first few rows

In [2]:
data.head()

Unnamed: 0,Creditability,Account Balance,Duration of Credit (month),Payment Status of Previous Credit,Purpose,Credit Amount,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,...,Duration in Current address,Most valuable available asset,Age (years),Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,1,18,4,2,1049,1,2,4,2,...,4,2,21,3,1,1,3,1,1,1
1,1,1,9,4,0,2799,1,3,2,3,...,2,1,36,3,1,2,3,2,1,1
2,1,2,12,2,9,841,2,4,2,2,...,4,1,23,3,1,1,2,1,1,1
3,1,1,12,4,0,2122,1,3,3,3,...,2,1,39,3,1,2,2,2,1,2
4,1,1,12,4,0,2171,1,3,4,3,...,4,2,38,1,2,2,2,1,1,2


## Input - Output Split

We will use 'Creditability' column as our output variable (also called as target or dependent variable). And for our independent or input variable, we will only use categorical data and remove all numerical continues data. Because almost all of the variable are categorical data and we want to focus at classification or categorical data to classify whether an applicant is creditworthy or not.

In [3]:
output_data = data['Creditability']

input_data = data.drop(['Creditability', 'Duration of Credit (month)', 'Credit Amount', 'Age (years)'], axis=1)

Let's take a look at the first few rows of our input data

In [4]:
input_data.head()

Unnamed: 0,Account Balance,Payment Status of Previous Credit,Purpose,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,Guarantors,Duration in Current address,Most valuable available asset,Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,4,2,1,2,4,2,1,4,2,3,1,1,3,1,1,1
1,1,4,0,1,3,2,3,1,2,1,3,1,2,3,2,1,1
2,2,2,9,2,4,2,2,1,4,1,3,1,1,2,1,1,1
3,1,4,0,1,3,3,3,1,2,1,3,1,2,2,2,1,2
4,1,4,0,1,3,4,3,1,4,2,1,2,2,2,1,1,2


## Train - Test Split

Next, we will split the data into training and test data with 75:25 proportion. With x is input data and y is output data

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(input_data, output_data, test_size = 0.25, random_state = 123)

## Model Building

We will use these model to be applied to our data

* KNN
* Logistic Regression
* Random Forest

First, we have to import them

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Then fitting our models with our train data

### KNN

In [7]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [8]:
knn_train_predicted = pd.DataFrame(knn.predict(x_train))

We use confusion matrix to compare the actual data and the data predicted by model

In [9]:
from sklearn.metrics import confusion_matrix

In [10]:
knn_train_matrix = pd.DataFrame(confusion_matrix(y_train, knn_train_predicted))
knn_train_matrix

Unnamed: 0,0,1
0,98,117
1,33,502


In [11]:
knn_train_score = knn.score(x_train, y_train)
knn_train_score

0.80000000000000004

### Logistic Regression

In [12]:
logreg = LogisticRegression(random_state=123)
logreg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=123, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
logreg_train_predicted = pd.DataFrame(logreg.predict(x_train))

In [14]:
logreg_train_matrix = pd.DataFrame(confusion_matrix(y_train, logreg_train_predicted))
logreg_train_matrix

Unnamed: 0,0,1
0,64,151
1,45,490


In [15]:
logreg_train_score = logreg.score(x_train, y_train)
logreg_train_score

0.73866666666666669

### Random Forest

In [16]:
rf = RandomForestClassifier(random_state=123)
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=123,
            verbose=0, warm_start=False)

In [17]:
rf_train_predicted = pd.DataFrame(rf.predict(x_train))

In [18]:
rf_train_matrix = pd.DataFrame(confusion_matrix(y_train, rf_train_predicted))
rf_train_matrix

Unnamed: 0,0,1
0,210,5
1,1,534


In [19]:
rf_train_score = rf.score(x_train, y_train)
rf_train_score

0.99199999999999999

We save our models to be used to our test data

In [20]:
from sklearn.externals import joblib
 
joblib.dump(knn, 'knn.pkl')
joblib.dump(logreg, 'logreg.pkl')
joblib.dump(rf, 'rf.pkl')


['rf.pkl']

## Model Evaluation

We will evaluate the models with our test data. First, we have to load the models that we saved earlier.

In [21]:
knn = joblib.load('knn.pkl')
logreg = joblib.load('logreg.pkl')
rf = joblib.load('rf.pkl')

### KNN

In [22]:
knn_test_predicted = pd.DataFrame(knn.predict(x_test))

In [23]:
knn_test_matrix = pd.DataFrame(confusion_matrix(y_test, knn_test_predicted))
knn_test_matrix

Unnamed: 0,0,1
0,40,45
1,21,144


In [24]:
knn_test_score = knn.score(x_test, y_test)
knn_test_score

0.73599999999999999

This will give the result as :

<img src='knn_summary.jpg'>

### Logistic Regression

In [25]:
logreg_test_predicted = pd.DataFrame(logreg.predict(x_test))

In [26]:
logreg_test_matrix = pd.DataFrame(confusion_matrix(y_test, logreg_test_predicted))
logreg_test_matrix

Unnamed: 0,0,1
0,47,38
1,16,149


In [27]:
logreg_test_score = logreg.score(x_test, y_test)
logreg_test_score

0.78400000000000003

This will give the result as :

<img src='logreg_summary.jpg'>

### Random Forest

In [28]:
rf_test_predicted = pd.DataFrame(rf.predict(x_test))

In [29]:
rf_test_matrix = pd.DataFrame(confusion_matrix(y_test, rf_test_predicted))
rf_test_matrix

Unnamed: 0,0,1
0,49,36
1,22,143


In [30]:
rf_test_score = rf.score(x_test, y_test)
rf_test_score

0.76800000000000002

This will give the result as :

<img src='rf_summary.jpg'>

## Summary

This is the summary of accuracy score that we have so far

In [31]:
summary = pd.DataFrame(data=[[knn_train_score, knn_test_score], 
                             [logreg_train_score, logreg_test_score],
                             [rf_train_score, rf_test_score]],
                      index=["knn", "logreg", "rf"],
                      columns=["train", "test"])

summary

Unnamed: 0,train,test
knn,0.8,0.736
logreg,0.738667,0.784
rf,0.992,0.768


It looks like the random forest model is performed very well in our train data but not in test data. On the other hand, logistic regression model which is not performed well in train data has the best performance in test data.

## Cost - Profit Analysis

Finally all those numbers have to be translated into profit consideration for the bank. Let us assume that a correct decision of the bank would result in 35% profit at the end of 5 years. A correct decision here means that the bank predicts an application to be good or credit-worthy and it actually turns out to be credit worthy. When the opposite is true, i.e. bank predicts the application to be good but it turns out to be bad credit, then the loss is 100%. If the bank predicts an application to be non-creditworthy, then loan facility is not extended to that applicant and bank does not incur any loss (opportunity loss is not considered here). The cost matrix, therefore, is as follows:

<img src='cost_profit_table.jpg'>

Out of 1000 applicants, 70% are creditworthy. A loan manager without any model would incur [0.7*0.35 + 0.3 (-1)] = - 0.055 or 0.055 unit loss. If the average loan amount is 3200 DM (approximately), then the total loss will be 1760000 DM and per applicant loss is 176 DM.

<img src='result1.jpg'>

Logistic Regression model show a per unit profit, other methods are not doing well