# Analysis of German Credit Data

### by Eko Saputro

In this project, we want to minimize the risks of loss and maximize profit on behalf of the bank

When a bank receives a loan application, based on the applicant’s profile, the bank has to make a decision regarding whether to go ahead with the loan approval or not. 
Two types of risks are associated with the bank’s decision :

* If the applicant is a good credit risk, i.e. is worthy to be given credit or likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank.

* If the applicant is a bad credit risk, i.e. is not worthy enough to be given credit or not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank

To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.

This [dataset](http://freakonometrics.free.fr/german_credit.csv) contains 21 variables including the classification whether an applicant is considered a Creditworthy or Non-Creditworthy for 1000 loan applicants. You can see the appendix [here](https://github.com/ekosaputro09/Data-Science-Project/blob/master/German%20Credit/data/Appendix.docx?raw=true).



## Data Preprocessing

In [1]:
# Import libraries that are necessary

import numpy as np
import pandas as pd

# Load the German Credit data

data = pd.read_csv("data/german_credit.csv")
print("German Credit dataset has {} observations and {} variables.".format(*data.shape))


German Credit dataset has 1000 observations and 21 variables.


Let's take a look at the first few rows

In [2]:
data.head()

Unnamed: 0,Creditability,Account Balance,Duration of Credit (month),Payment Status of Previous Credit,Purpose,Credit Amount,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,...,Duration in Current address,Most valuable available asset,Age (years),Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,1,18,4,2,1049,1,2,4,2,...,4,2,21,3,1,1,3,1,1,1
1,1,1,9,4,0,2799,1,3,2,3,...,2,1,36,3,1,2,3,2,1,1
2,1,2,12,2,9,841,2,4,2,2,...,4,1,23,3,1,1,2,1,1,1
3,1,1,12,4,0,2122,1,3,3,3,...,2,1,39,3,1,2,2,2,1,2
4,1,1,12,4,0,2171,1,3,4,3,...,4,2,38,1,2,2,2,1,1,2


## Input - Output Split

We will use 'Creditability' column as our output variable (also called as a target or dependent variable). And for our independent or input variable, we will only use categorical data and remove all numerical continues data. Because almost all of the variables are categorical data and we want to focus on classification or categorical data to classify whether an applicant is creditworthy or not.

In [3]:
output_data = data['Creditability']

input_data = data.drop(['Creditability', 'Duration of Credit (month)', 
                        'Credit Amount', 'Age (years)'], axis=1)

Let's take a look at the first few rows of our input data

In [4]:
input_data.head()

Unnamed: 0,Account Balance,Payment Status of Previous Credit,Purpose,Value Savings/Stocks,Length of current employment,Instalment per cent,Sex & Marital Status,Guarantors,Duration in Current address,Most valuable available asset,Concurrent Credits,Type of apartment,No of Credits at this Bank,Occupation,No of dependents,Telephone,Foreign Worker
0,1,4,2,1,2,4,2,1,4,2,3,1,1,3,1,1,1
1,1,4,0,1,3,2,3,1,2,1,3,1,2,3,2,1,1
2,2,2,9,2,4,2,2,1,4,1,3,1,1,2,1,1,1
3,1,4,0,1,3,3,3,1,2,1,3,1,2,2,2,1,2
4,1,4,0,1,3,4,3,1,4,2,1,2,2,2,1,1,2


## Train - Test Split

Next, we will split the data into training and test data with 75:25 proportion. With x is input data and y is output data

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(input_data, output_data, 
                                                    test_size = 0.25, random_state = 123)

## Model Building

We will use these model to be applied to our data

* KNN
* Logistic Regression
* Decision Tree
* Random Forest
* SVM

First, we have to import them

In [40]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

We use RandomizedSearchCV to get the best hyperparameter from our models, and comparing the result with confusion matrix.

In [41]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix

Then fitting our models

### KNN

In [7]:
def knn_fit(x_train, y_train):
    knn = KNeighborsClassifier()
    
    hyperparam = {'n_neighbors':[1,2,3,4,5], 
                  'p':[1,2,3]}
    
    random_knn = RandomizedSearchCV(knn, param_distributions=hyperparam,
                                   cv=3, n_iter=5, n_jobs=2, random_state=123)
    
    random_knn.fit(x_train, y_train)
    
    print("Best Accuracy", random_knn.score(x_train, y_train))
    print("Best Param", random_knn.best_params_)
    
    return random_knn

In [8]:
best_knn = knn_fit(x_train, y_train)

Best Accuracy 0.832
Best Param {'p': 2, 'n_neighbors': 3}


In [9]:
knn = KNeighborsClassifier(n_neighbors=best_knn.best_params_.get('n_neighbors'),
                          p=best_knn.best_params_.get('p'))
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [10]:
knn_train_score = knn.score(x_train, y_train)
knn_train_score

0.83199999999999996

In [34]:
knn_matrix = pd.DataFrame(confusion_matrix(y_test, knn.predict(x_test)))
knn_matrix

Unnamed: 0,0,1
0,41,44
1,30,135


In [11]:
knn_test_score = knn.score(x_test, y_test)
knn_test_score

0.70399999999999996

### Logistic Regression

In [12]:
def logreg_fit(x_train, y_train):
    logreg = LogisticRegression()
    
    hyperparam = {'C': [1000, 333.33, 100, 33.33, 10, 3.33, 1, 0.33, 0.1, 0.033, 0.01, 0.0033, 
                        0.001, 0.00033, 0.0001],
                 'penalty': ['l1', 'l2']}

    random_logreg = RandomizedSearchCV(logreg, param_distributions=hyperparam, cv=5,
                                      n_iter=5, n_jobs=2, random_state=123)
    
    random_logreg.fit(x_train, y_train)
    
    print("Best Accuracy", random_logreg.score(x_train, y_train))
    print("Best Param", random_logreg.best_params_)
    
    return random_logreg

In [13]:
best_logreg = logreg_fit(x_train, y_train)

Best Accuracy 0.741333333333
Best Param {'penalty': 'l2', 'C': 100}


In [14]:
logreg = LogisticRegression(C=best_logreg.best_params_.get('C'),
                           penalty=best_logreg.best_params_.get('penalty'))
logreg.fit(x_train, y_train)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [15]:
logreg_train_score = logreg.score(x_train, y_train)
logreg_train_score

0.74133333333333329

In [36]:
logreg_matrix = pd.DataFrame(confusion_matrix(y_test, logreg.predict(x_test)))
logreg_matrix

Unnamed: 0,0,1
0,49,36
1,14,151


In [16]:
logreg_test_score = logreg.score(x_test, y_test)
logreg_test_score

0.80000000000000004

### Decision Tree

In [17]:
def dtree_fit(x_train, y_train):
    dtree = DecisionTreeClassifier()
    
    hyperparam = {'min_samples_leaf': [ 2, 5, 8, 10, 15 ],
                  'min_samples_split': [ 2, 5, 8, 10]}
    
    random_dtree = RandomizedSearchCV(dtree, param_distributions=hyperparam,
                                          cv=5, n_iter=5, n_jobs=2, random_state=123)
    
    random_dtree.fit(x_train, y_train)
    
    print("Best Accuracy", random_dtree.score(x_train, y_train))
    print("Best Param", random_dtree.best_params_)
    
    return random_dtree

In [18]:
best_dtree = dtree_fit(x_train, y_train)

Best Accuracy 0.794666666667
Best Param {'min_samples_split': 8, 'min_samples_leaf': 10}


In [19]:
dtree = DecisionTreeClassifier(min_samples_leaf=best_dtree.best_params_.get('min_samples_leaf'),
                              min_samples_split=best_dtree.best_params_.get('min_samples_split'))
dtree.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=8, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [20]:
dtree_train_score = dtree.score(x_train, y_train)
dtree_train_score

0.79466666666666663

In [37]:
dtree_matrix = pd.DataFrame(confusion_matrix(y_test, dtree.predict(x_test)))
dtree_matrix

Unnamed: 0,0,1
0,52,33
1,40,125


In [21]:
dtree_test_score = dtree.score(x_test, y_test)
dtree_test_score

0.70799999999999996

### Random Forest

In [22]:
def randomForest_fit(x_train, y_train):
    randomForest = RandomForestClassifier()
    
    hyperparam = {'n_estimators': [100, 300, 500, 1000],
                  'min_samples_leaf': [ 2, 5, 8, 10, 15 ],
                  'min_samples_split': [ 2, 5, 8, 10]}
    
    random_randomForest = RandomizedSearchCV(randomForest, param_distributions=hyperparam,
                                          cv=5, n_iter=5, n_jobs=2, random_state=123)
    
    random_randomForest.fit(x_train, y_train)
    
    print("Best Accuracy", random_randomForest.score(x_train, y_train))
    print("Best Param", random_randomForest.best_params_)
    
    return random_randomForest

In [23]:
best_randomForest = randomForest_fit(x_train, y_train)

Best Accuracy 0.934666666667
Best Param {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2}


In [24]:
randomForest = RandomForestClassifier(n_estimators=best_randomForest.best_params_.get('n_estimators'),
                                     min_samples_leaf=best_randomForest.best_params_.get('min_samples_leaf'),
                                     min_samples_split=best_randomForest.best_params_.get('min_samples_split'))
randomForest.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [25]:
randomForest_train_score = randomForest.score(x_train, y_train)
randomForest_train_score

0.92666666666666664

In [38]:
randomForest_matrix = pd.DataFrame(confusion_matrix(y_test, randomForest.predict(x_test)))
randomForest_matrix

Unnamed: 0,0,1
0,39,46
1,10,155


In [26]:
randomForest_test_score = randomForest.score(x_test, y_test)
randomForest_test_score

0.77600000000000002

### SVM

In [27]:
def SVM_fit(x_train, y_train):
    SVM = SVC()
    
    hyperparam = {'kernel': ['rbf', 'linear'],
                  'C': [1000, 333.33, 100, 33.33, 10, 3.33, 1, 0.33, 0.1, 0.033, 0.01, 0.0033, 
                        0.001, 0.00033, 0.0001]}
    
    randomSVM = RandomizedSearchCV(SVM, param_distributions=hyperparam, cv=5,
                                  n_iter=5, n_jobs=2, random_state=123)

    randomSVM.fit(x_train, y_train)
    
    print("Best Accuracy", randomSVM.score(x_train, y_train))
    print("Best Param", randomSVM.best_params_)
    
    return randomSVM

In [28]:
best_SVM = SVM_fit(x_train, y_train)

Best Accuracy 0.713333333333
Best Param {'kernel': 'linear', 'C': 0.0001}


In [29]:
SVM = SVC(kernel=best_SVM.best_params_.get('kernel'),
         C=best_SVM.best_params_.get('C'))
SVM.fit(x_train, y_train)

SVC(C=0.0001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [30]:
SVM_train_score = SVM.score(x_train, y_train)
SVM_train_score

0.71333333333333337

In [39]:
SVM_matrix = pd.DataFrame(confusion_matrix(y_test, SVM.predict(x_test)))
SVM_matrix

Unnamed: 0,0,1
0,0,85
1,0,165


In [31]:
SVM_test_score = SVM.score(x_test, y_test)
SVM_test_score

0.66000000000000003

## Summary

This is the summary of accuracy score that we have

In [32]:
summary = pd.DataFrame(data=[[knn_train_score, knn_test_score],
                             [logreg_train_score, logreg_test_score],
                             [dtree_train_score, dtree_test_score],
                             [randomForest_train_score, randomForest_test_score],
                             [SVM_train_score, SVM_test_score]],
                      index=["knn", "logreg", "dtree","randomForest", "SVM"],
                      columns=["train", "test"])

summary

Unnamed: 0,train,test
knn,0.832,0.704
logreg,0.741333,0.8
dtree,0.794667,0.708
randomForest,0.926667,0.776
SVM,0.713333,0.66


It looks like the random forest model is performed very well in our train data but not in test data. On the other hand, logistic regression model which is not performed well in train data has the best performance in test data.

## Cost - Profit Analysis

Finally, all those numbers have to be translated into profit consideration for the bank. Let us assume that a correct decision of the bank would result in 35% profit at the end of 5 years. A correct decision here means that the bank predicts a good application or creditworthy and it actually turns out to be creditworthy. But when it's the opposite, i.e. bank predicts a good application but it turns out to be bad credit, then the loss is 100%. If the bank predicts an application to be non-creditworthy, then loan facility is not extended to that applicant and bank does not incur any loss (opportunity loss is not considered here). The cost matrix, therefore, is as follows:

<img src='images/cost_profit_table.jpg'>

Out of 1000 applicants, 70% are creditworthy. A loan manager without any model would incur [0.7*0.35 + 0.3 (-1)] = - 0.055 or 0.055 unit loss. If the average loan amount is 3200 DM (approximately), then the total loss will be 1760000 DM and per applicant loss is 176 DM.

<img src='images/result2.jpg'>

Logistic Regression and Random Forest model shows a per unit profit, other methods are not doing well. However, we still need a better model if we want to maximize the profit of the bank.

## Source

This project is based on the following :

* https://onlinecourses.science.psu.edu/stat857/node/215
* http://rstudio-pubs-static.s3.amazonaws.com/171872_ab8dd184af0e4b2cbe3469b2c75b0093.html
* http://freakonometrics.hypotheses.org/48285