## Abstraction

It's important for finance companies to know risk of giving credit to their clients. Credit scores are essential for companies to decide on whether giving credit to someone or not. In this project, we experiment on features of clients to decide on which ones are more important. SVM, logistic regression and desicion trees are utilized and compared. An interface that calculates credit score for given features is provided.

# Intoduction
&nbsp;&nbsp; Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

&nbsp;&nbsp; Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

# Data Dictionary
| Variable Name                        | Description                                                                                                                                              | Type       |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
| SeriousDlqin2yrs                     | Person experienced 90 days past due delinquency or worse                                                                                                 | Y/N        |
| RevolvingUtilizationOfUnsecuredLines | Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits | percentage |
| age                                  | Age of borrower in years                                                                                                                                 | integer    |
| NumberOfTime30-59DaysPastDueNotWorse | Number of times borrower has been 30-59 days past due but no worse in the last 2 years.                                                                  | integer    |
| DebtRatio                            | Monthly debt payments, alimony,living costs divided by monthy gross income                                                                               | percentage |
| MonthlyIncome                        | Monthly income                                                                                                                                           | real       |
| NumberOfOpenCreditLinesAndLoans      | Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)                                                     | integer    |
| NumberOfTimes90DaysLate              | Number of times borrower has been 90 days or more past due.                                                                                              | integer    |
| NumberRealEstateLoansOrLines         | Number of mortgage and real estate loans including home equity lines of credit                                                                           | integer    |
| NumberOfTime60-89DaysPastDueNotWorse | Number of times borrower has been 60-89 days past due but no worse in the last 2 years.                                                                  | integer    |
| NumberOfDependents                   | Number of dependents in family excluding themselves (spouse, children etc.)                                                                              | integer    |


## Related Work 

A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines
http://www.sciencedirect.com/science/article/pii/S0957417404001782

Benchmarking state-of-the-art classification algorithms for credit scoring
https://link.springer.com/article/10.1057/palgrave.jors.2601545

Using neural network ensembles for bankruptcy prediction and credit scoring
http://www.sciencedirect.com/science/article/pii/S0957417407001558

A comparative assessment of ensemble learning for credit scoring
http://www.sciencedirect.com/science/article/pii/S095741741000552X

Comprehensible credit scoring models using rule extraction from support vector machines
http://www.sciencedirect.com/science/article/pii/S0377221706011878

Neural network credit scoring models
http://www.sciencedirect.com/science/article/pii/S0305054899001495

Statistical Classification Methods in Consumer Credit Scoring: a Review
http://onlinelibrary.wiley.com/doi/10.1111/j.1467-985X.1997.00078.x/full

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from pandas import DataFrame
%matplotlib inline

### Train Data Loaded

In [None]:
###
### Load Data Set
###
data = pd.read_csv('cs-training.csv',sep=';').drop('Unnamed: 0', axis = 1)

In [None]:
# '-' in column name creates a problem when accessing
Cols = []
for i in range(len(data.columns)):
    Cols.append(data.columns[i].replace('-', ''))
data.columns = Cols

In [None]:
#
data = data.apply(lambda x: x.fillna(np.nanmedian(x),axis=0))
# Drop rows with missing column data
#data = data.dropna()

###
### Convert Data Into List Of Dict Records
###
data = data.to_dict(orient='records')

###
### Seperate Target and Outcome Features
###

vec = DictVectorizer()

df_data = vec.fit_transform(data).toarray()
feature_names = vec.get_feature_names()
df_data = DataFrame(
    df_data,
    columns=feature_names)


## Missing Values
 There are some values in the dataset that are missing or having really awkward magnitudes. Therefore, firstly, we should take care of them.

In [None]:

#col_mean = np.nanmean(X,axis=0)

#Find indicies that you need to replace
#inds = np.where(np.isnan(X))

#Place column means in the indices. Align the arrays using take
#X[inds]=np.take(col_mean,inds[1])


## Visualization

In this part we will show you the visualization of the data with respect to some aspects of them.


In [None]:
def plot_freq(l):
    ncount = l

    ax2=ax.twinx()

    ax2.yaxis.tick_left()
    ax.yaxis.tick_right()

    ax.yaxis.set_label_position('right')
    ax2.yaxis.set_label_position('left')

    ax2.set_ylabel('Frequency [%]')

    for p in ax.patches:
        x=p.get_bbox().get_points()[:,0]
        y=p.get_bbox().get_points()[1,1]
        ax.annotate('{:.1f}%'.format(100.*y/ncount), (x.mean(), y), 
                ha='center', va='bottom')

    ax2.set_ylim(0,100)
    ax2.grid(None)    

In [None]:
ax = sns.countplot(x = df_data.SeriousDlqin2yrs , palette="Set2")
sns.set(font_scale=1.5)
ax.set_ylim(top = len(data))
ax.set_xlabel('Class')
ax.set_ylabel('Sample Size')
plt.title('Class Frequencies')

plot_freq(l = len(df_data.SeriousDlqin2yrs))

plt.show()

## Linear Discriminant Analysis

In [None]:
from sklearn import  linear_model
#clf = linear_model.LogisticRegression(C=1e5)
#clf.fit(X_Train,Y_Train)
#clf.predict_proba(X[0])


In [None]:
#clf.coef_
#ind = np.where(Y_test == 1)
#ss = clf.predict_log_proba(X_test[ind]) 
#t = ss.argmax(axis = 1)
#a = np.intersect1d(ind,np.where(t == 1))
#100*len(a)/len(t)
#print(clf.predict_proba(X_test[37441].reshape(1,-1)))
#y_test[37441]

# Outlier detection

We are going to examine some features to eliminate outliers. For instance, in *RevolvingUtilizationOfUnsecuredLines*, there are some values that are close to 50000. We should eliminate such values to prevent our error rate being very high. 

### RevolvingUtilizationOfUnsecuredLines Variance

In [None]:
#plt.plot(data2.RevolvingUtilizationOfUnsecuredLines)
plt.figure(figsize=(20,15))
ax = plt.subplot(211)
#ax.set_ylim(0,20)
plt.plot(df_data.RevolvingUtilizationOfUnsecuredLines, 'bo',df_data.RevolvingUtilizationOfUnsecuredLines, 'k')
print('Median: %.7f \nMean: %.7f' %(np.median(df_data.RevolvingUtilizationOfUnsecuredLines),np.mean(df_data.RevolvingUtilizationOfUnsecuredLines)))
ruoelLt2=len(df_data[df_data.RevolvingUtilizationOfUnsecuredLines < 2])
ruoelACt=len(df_data.RevolvingUtilizationOfUnsecuredLines)
print('Values less than 2 : %d in %d. Ratio: %.5f%%' %(ruoelLt2,ruoelACt,100*ruoelLt2/ruoelACt))
#sns.kdeplot(data2.RevolvingUtilizationOfUnsecuredLines, shade=True, color="r")
#data2.age.plot.box()
#data2.RevolvingUtilizationOfUnsecuredLines.plot.box()


Therefore we need to clean outliers by considering the ratio we obtained above.

In [None]:
ind = np.where(df_data.RevolvingUtilizationOfUnsecuredLines>2)
df_data.RevolvingUtilizationOfUnsecuredLines[ind[0]] = 2.

### Age Variance

In [None]:
from collections import Counter
plt.figure(1)
df_data.age.plot.box()
Counter(df_data.age)



In [None]:
plt.figure(2)
sns.set_color_codes()
sns.distplot(df_data.age, color="y")
plt.show()

In [None]:
ind = np.where(df_data.age<21)
df_data.age[ind[0]] = 21.
ind = np.where(df_data.age>94)
df_data.age[ind[0]] = 94.


### NumberOfTime30-59DaysPastDueNotWorse Varience


In [None]:
Counter(df_data.NumberOfTime3059DaysPastDueNotWorse)

In [None]:
### Set outlier values to median , that is 0.
ind = np.where(df_data.NumberOfTime3059DaysPastDueNotWorse>95)
df_data.NumberOfTime3059DaysPastDueNotWorse[ind[0]] = 0.

In [None]:
def mad_based_outlier(points, thresh=3.5):
    if len(points.shape) == 1:
        points = points[:,None]
    median = np.median(points, axis=0)
    diff = np.sum((points - median)**2, axis=-1)
    diff = np.sqrt(diff)
    med_abs_deviation = np.median(diff)

    modified_z_score = 0.6745 * diff / med_abs_deviation

    return modified_z_score > thresh

### DebtRatio Variance

In [None]:
plt.figure(figsize=(20,15))
ax = plt.subplot(211)
#ax.set_ylim(0,20)
plt.plot(df_data.DebtRatio, 'bo',df_data.DebtRatio, 'k')
print('Median: %.7f \nMean: %.7f' %(np.median(df_data.DebtRatio),np.mean(df_data.DebtRatio)))
#ruoelLt2=len(df_data[df_data.RevolvingUtilizationOfUnsecuredLines < 2])
#ruoelACt=len(df_data.RevolvingUtilizationOfUnsecuredLines)
#print('Values less than 2 : %d in %d. Ratio: %.5f%%' %(ruoelLt2,ruoelACt,100*ruoelLt2/ruoelACt))

In [None]:
ax = sns.countplot(mad_based_outlier(df_data.DebtRatio))
plot_freq(l = len(df_data.DebtRatio))

In [None]:
minUpperBound = min([val for (val, out) in zip(df_data.DebtRatio, mad_based_outlier(df_data.DebtRatio)) if out == True])
### Set outlier values to upperbound, that is minUpperBound.
ind = np.where(df_data.DebtRatio>minUpperBound)
df_data.DebtRatio[ind[0]] = minUpperBound

In [None]:
plt.figure(figsize=(20,15))
ax = plt.subplot(211)
plt.plot(df_data.DebtRatio, 'o')

df_data.DebtRatio.describe()

### MonthlyIncome Variance

In [None]:
plt.figure(figsize=(20,15))
ax = plt.subplot(211)
#ax.set_ylim(0,20)
plt.plot(df_data.MonthlyIncome, 'bo',df_data.MonthlyIncome, 'k')
print('Median: %.7f \nMean: %.7f' %(np.median(df_data.MonthlyIncome),np.mean(df_data.MonthlyIncome)))

In [None]:
maxUpperBound = min([val for (val, out) in zip(df_data.MonthlyIncome, mad_based_outlier(df_data.MonthlyIncome)) if out == True])
ind = np.where(df_data.MonthlyIncome>maxUpperBound)
df_data.MonthlyIncome[ind[0]] = maxUpperBound
ind = np.where(df_data.MonthlyIncome<1500)
df_data.MonthlyIncome[ind[0]] = 1500
df_data.MonthlyIncome.describe()

### NumberOfTimes90DaysLate Variance

In [None]:
Counter(df_data.NumberOfTimes90DaysLate)
### Set outlier values to median , that is 0.
ind = np.where(df_data.NumberOfTimes90DaysLate>95)
df_data.NumberOfTimes90DaysLate[ind[0]] = 0

### NumberRealEstateLoansOrLines Variance

In [None]:
Counter(df_data.NumberRealEstateLoansOrLines)
### Set outlier values to 16.
ind = np.where(df_data.NumberRealEstateLoansOrLines>16)
df_data.NumberRealEstateLoansOrLines[ind[0]] = 16


### NumberOfTime60-89DaysPastDueNotWorse Variance


In [None]:
Counter(df_data.NumberOfTime6089DaysPastDueNotWorse)
### Set outlier values to 0.
ind = np.where(df_data.NumberOfTime6089DaysPastDueNotWorse>11)
df_data.NumberOfTime6089DaysPastDueNotWorse[ind[0]] = 0

### NumberOfDependents Variance

In [None]:
Counter(df_data.NumberOfDependents)
ind = np.where(df_data.NumberOfDependents >10)
df_data.NumberOfDependents[ind[0]] = 10

### Train-Test Split

In [None]:
###
### Generate Training and Testing Set 
###
    
outcome_feature = df_data['SeriousDlqin2yrs']
target_features = df_data.drop('SeriousDlqin2yrs', axis=1)

from sklearn.model_selection import train_test_split
"""
    X_1: independent (target) variables for first data set
    Y_1: dependent (outcome) variable for first data set
    X_2: independent (target) variables for the second data set
    Y_2: dependent (outcome) variable for the second data set
"""
X_train, X_test, Y_train, Y_test = train_test_split(target_features, outcome_feature, test_size=0.5, random_state=0)
from sklearn.metrics import confusion_matrix

# Desicion Tree



In [None]:
from sklearn import tree
clf2 = tree.DecisionTreeClassifier(class_weight='balanced',min_impurity_split=1e-05,max_depth=6)
clf2.fit(X_train,Y_train)

In [None]:
Y_pred = clf2.predict(X_test)
pd.crosstab(Y_test, Y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

# GaussianNB

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(priors=[0.07,0.93])
gnb.fit(X_train,Y_train)

In [None]:
Y_pred = gnb.predict(X_test)
pd.crosstab(Y_test, Y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

# SVM


In [None]:
from sklearn import svm
clf3 = svm.SVC(C=2,cache_size=7000,tol=1,class_weight={0:.1,1:.9} )
clf3.fit(X_train,Y_train)

In [None]:
Y_pred = clf3.predict(X_test)
pd.crosstab(Y_test, Y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

# MLP

In [None]:
from sklearn.neural_network import MLPClassifier
clf4 = MLPClassifier(activation="identity")
clf4.fit(X_train, Y_train) 

In [None]:
from sklearn.metrics import confusion_matrix
Y_pred = clf4.predict(X_test)
pd.crosstab(Y_test, Y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

# Logistic Regression

In [None]:
from sklearn import  linear_model
clf = linear_model.LogisticRegression(C=1e5,class_weight= {0:.1, 1:.9} )
clf.fit(X_train,Y_train)


In [None]:
Y_pred = clf.predict(X_test)
netMat = (Y_pred == Y_test)
clf.coef_
ind = np.where(Y_test == 1)
Counter(Y_pred[ind])
pd.crosstab(Y_test, Y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
#ss = clf.predict_log_proba(X_test[ind]) 
#t = ss.argmax(axis = 1)
#a = np.intersect1d(ind,np.where(t == 1))
#100*len(a)/len(t)
#print(clf.predict_proba(X_test[37441].reshape(1,-1)))
#y_test[37441]

# A Small Application

In [51]:
%matplotlib notebook
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import *
from IPython.display import display
import ipywidgets as widgets
plt.style.use('ggplot')

# displaying the text widget
text = widgets.Text(
    placeholder='Enes',
    description='Please, enter the name:',
    disabled=False
)
display(text)
# add button that updates the graph based on the checkboxes
button = widgets.Button(description="Check credibility")
display(button)
resultLabel = widgets.Label(
    value="",
    placeholder = 'Some LaTeX',
    description = 'Result:',
    visible = False,
    disabled = True
)
display(resultLabel)
#display(resultLabel)
# preparing the plot 
#data = pd.DataFrame()
#x = range(1,NUMBER_OF_PINGS+1)
#plots = dict()
#fig, ax = plt.subplots()
#plt.xlabel('iterations')
#plt.ylabel('ms')
#plt.xticks(x)
#plt.show()

# preparing a container to put in created checkbox per domain
checkboxes = []
cb_container = widgets.HBox()
display(cb_container)


# function to deal with the added domain name
def handle_submit(sender):
    # a part of the magic inside python : pinging
    res = !ping -c {NUMBER_OF_PINGS} {text.value}
    hits = res.grep('64 bytes').fields(-2).s.replace("time=","").split()
    if len(hits) == 0:
        print("Domain gave error on pinging")
    else:
         # rebuild plot based on ping result
        data = hits
        data = data.astype(float)
        plots, = ax.plot(x, data, label=text.value)
        plt.legend()
        plt.draw()
        # add a new checkbox for the new domain
        checkboxes.append(widgets.Checkbox(description = text.value, value=True, width=90))
        cb_container.children=[i for i in checkboxes]
        if len(checkboxes) == 1:
            display(button)

# function to deal with the checkbox update button       
def on_button_clicked(b):
    for c in cb_container.children:
        if not c.value:
            plots.set_visible(False)
        else:
            plots.set_visible(True)
    #plt.legend()
    #plt.draw()
    
    if(resultLabel.visible == False):
        resultLabel.value = text.value + ' can be provided with the loan.'    
    else:
        resultLabel.value = ''
        
button.on_click(on_button_clicked)
text.on_submit(handle_submit)
plt.show()

widgets.FloatSlider(
    value=7.5,
    min=0,
    max=10.0,
    step=0.1,
    description='Test:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
    slider_color='black'
)