# IBM MACHINE LEARNING ASSIGNMENT

## INSTRUCTIONS
In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.

1.You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:

k-Nearest Neighbour

Decision Tree

SVM

Logistics Regression


2.The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:

Jaccard index

F1-score

LogLoass

## REVIEW CRITERIA
This final project will be graded by your peers who are completing this course during the same session. This project is worth 25 marks of your total grade, broken down as follows:

Building model using KNN, finding the best k and accuracy evaluation (7 marks)
Building model using Decision Tree and find the accuracy evaluation (6 marks)
Building model using SVM and find the accuracy evaluation (6 marks)
Building model using Logistic Regression and find the accuracy evaluation (6 marks)

In [1]:
#import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline

print ('libraries imported!')

libraries imported!


In [2]:
#download dataset

#uncomment this if using IBM Watson Studiio
#!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv

#load straight from the web
df = pd. read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv')

print('dataset loaded!')
df.head()

dataset loaded!


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,loan_status,Principal,terms,effective_date,due_date,age,education,Gender
0,0,0,PAIDOFF,1000,30,9/8/2016,10/7/2016,45,High School or Below,male
1,2,2,PAIDOFF,1000,30,9/8/2016,10/7/2016,33,Bechalor,female
2,3,3,PAIDOFF,1000,15,9/8/2016,9/22/2016,27,college,male
3,4,4,PAIDOFF,1000,30,9/9/2016,10/8/2016,28,college,female
4,6,6,PAIDOFF,1000,30,9/9/2016,10/8/2016,29,college,male


## 2. Data Preprocessing

In [3]:
#drop columns
df= df.drop(['Unnamed: 0','Unnamed: 0.1'], axis= 1)
df= df.drop(['effective_date','due_date'], axis= 1)

#Rename dataset values
df['loan_status'] = df['loan_status'].replace(['COLLECTION'],'NOT PAIDOFF')
df['education'] = df['education'].replace(['Bechalor'],'Bachelor')

#Change string values to all lowercase
df['loan_status']= df['loan_status'].str.lower()
df['education']= df['education'].str.lower()

#rename columns
df= df.rename(columns = {'Principal': 'principal', 'Gender': 'gender'}, inplace = False)

df.head()

Unnamed: 0,loan_status,principal,terms,age,education,gender
0,paidoff,1000,30,45,high school or below,male
1,paidoff,1000,30,33,bachelor,female
2,paidoff,1000,15,27,college,male
3,paidoff,1000,30,28,college,female
4,paidoff,1000,30,29,college,male


In [4]:
#read X dataset in numpy arrays
X= df[['principal','terms','age','education','gender']].values
print(X[0:5])

#read y dataset in numpy arrays
y= df['loan_status'].values
print(y[0:5])

[[1000 30 45 'high school or below' 'male']
 [1000 30 33 'bachelor' 'female']
 [1000 15 27 'college' 'male']
 [1000 30 28 'college' 'female']
 [1000 30 29 'college' 'male']]
['paidoff' 'paidoff' 'paidoff' 'paidoff' 'paidoff']


In [5]:
#convert the features of X to numerical values
from sklearn import preprocessing

le_education = preprocessing.LabelEncoder()
le_education.fit(['high school or below', 'college', 'bachelor','master or above'])
X[:,3] = le_education.transform(X[:,3])

le_sex = preprocessing.LabelEncoder()
le_sex.fit(['male','female'])
X[:,4] = le_sex.transform(X[:,4]) 
X[0:5]

array([[1000, 30, 45, 2, 1],
       [1000, 30, 33, 0, 0],
       [1000, 15, 27, 1, 1],
       [1000, 30, 28, 1, 0],
       [1000, 30, 29, 1, 1]], dtype=object)

In [6]:
#convert the features of y to numerical values
le_loan_status = preprocessing.LabelEncoder()
le_loan_status.fit(['paidoff','not paidoff'])
y[:] = le_loan_status.transform(y[:]) 
y= y.astype(int)

y[0:5]

array([1, 1, 1, 1, 1])

In [7]:
print(df.shape)
print(df.columns)

(346, 6)
Index(['loan_status', 'principal', 'terms', 'age', 'education', 'gender'], dtype='object')


In [8]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 0.51578458,  0.92071769,  2.33152555,  0.97648333,  0.42056004],
       [ 0.51578458,  0.92071769,  0.34170148, -1.89894843, -2.37778177],
       [ 0.51578458, -0.95911111, -0.65321055, -0.46123255,  0.42056004],
       [ 0.51578458,  0.92071769, -0.48739188, -0.46123255, -2.37778177],
       [ 0.51578458,  0.92071769, -0.3215732 , -0.46123255,  0.42056004]])

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (276, 5) (276,)
Test set: (70, 5) (70,)


## 3. Build Model and Evaluation

### 3.1 K Nearest Neighbor

In [10]:
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

Ks = 10
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    knnyhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, knnyhat)

    std_acc[n-1]=np.std(knnyhat==y_test)/np.sqrt(knnyhat.shape[0])

print(mean_acc)
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

[0.64285714 0.54285714 0.68571429 0.65714286 0.77142857 0.74285714
 0.74285714 0.7        0.74285714]
The best accuracy was with 0.7714285714285715 with k= 5


In [11]:
k = 5
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

KNeighborsClassifier()

In [12]:
knnyhat = neigh.predict(X_test)
print(knnyhat[0:20])
print (y_test [0:20])

[1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1]


### 3.1.1 K Nearest Neighbor Evaluation

In [13]:
print("KNN Jaccard index: %.2f" % jaccard_score(y_test, knnyhat))
print("KNN F1-score: %.2f" % f1_score(y_test, knnyhat, average='weighted') )

KNN Jaccard index: 0.77
KNN F1-score: 0.71


### 3.2 Decision Tree

In [14]:
#import decision tree
from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
loanTree # it shows the default parameters

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [15]:
loanTree.fit(X_train,y_train)
X_test[0:5]

array([[ 0.51578458,  0.92071769, -0.15575453,  0.97648333,  0.42056004],
       [-1.31458942, -0.95911111, -0.15575453, -0.46123255,  0.42056004],
       [ 0.51578458, -0.95911111,  0.01006414,  0.97648333,  0.42056004],
       [ 0.51578458,  0.92071769, -1.15066656, -0.46123255,  0.42056004],
       [ 0.51578458,  0.92071769,  0.50752015, -0.46123255,  0.42056004]])

In [16]:
dtyhat = loanTree.predict(X_test)
print(dtyhat[0:20])
print (y_test [0:20])

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1]


### 3.2.1 Decision Tree Evaluation

In [17]:
print("Decision Tree Jaccard index: %.2f" % jaccard_score(y_test, dtyhat))
print("Decision Tree F1-score: %.2f" % f1_score(y_test, dtyhat, average='weighted') )

Decision Tree Jaccard index: 0.79
Decision Tree F1-score: 0.69


### 3.4 Logistic Regression Model

In [18]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, solver='liblinear')

In [20]:
LRyhat = LR.predict(X_test)
LRyhat

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1])

In [21]:
LRyhatprob = LR.predict_proba(X_test)
LRyhatprob[0:5]

array([[0.43450679, 0.56549321],
       [0.3820972 , 0.6179028 ],
       [0.39928113, 0.60071887],
       [0.44796703, 0.55203297],
       [0.42277497, 0.57722503]])

### 3.4.1 Logistic Regression Evaluation

In [22]:
print("LR Jaccard index: %.2f" % jaccard_score(y_test, LRyhat))
print("LR F1-score: %.2f" % f1_score(y_test, LRyhat, average='weighted') )
print("LR LogLoss: %.2f" % log_loss(y_test, LRyhatprob))

LR Jaccard index: 0.79
LR F1-score: 0.69
LR LogLoss: 0.60


### 3.3 SVM

In [23]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

SVC()

In [24]:
svmyhat = clf.predict(X_test)
print(svmyhat [0:20])
print (y_test [0:20])

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1]


### 3.3.1 SVM Model

In [25]:
print("SVM Jaccard index: %.2f" % jaccard_score(y_test, svmyhat))
print("SVM F1-score: %.2f" % f1_score(y_test, svmyhat, average='weighted') )

SVM Jaccard index: 0.79
SVM F1-score: 0.69


## 3.4 Summary

|Algorithm          |Jaccard|F1-score|LogLoss|
|-------------------|-------|--------|-------|
|KNN                |0.77   |0.71    |nil    |
|Decision Tree      |0.79   |0.69    |nil    |
|Logistic Regression|0.79   |0.69    |0.6    |
|SVM                |0.79   |0.69    |nil    |