KNN is a very popular ML technique for its simplicity and speed. It has been found to be quite accurate if predictor variables are chosen carefully. It can map non-linearity in data for both classification and regression. In this notebook, we will see both classification and regression using KNN

In [1]:
from sklearn import neighbors

In [2]:
import pandas as pd
import numpy as np

Read bankloan dataset from a CSV file.

In [3]:
bankloan = pd.read_csv(filepath_or_buffer= '/home/subhasis/Dropbox/Datasets/bankloan.csv',na_values= '#NULL!')

In [4]:
bankloan.tail()

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
845,34,1,12,15,32.0,2.7,0.24,0.62,
846,32,2,12,11,116.0,5.7,4.03,2.59,
847,48,1,13,11,38.0,10.8,0.72,3.38,
848,35,2,1,11,24.0,7.8,0.42,1.45,
849,37,1,20,13,41.0,12.9,0.9,4.39,


In [5]:
bankloan_unknown = bankloan[bankloan.default.isnull()]

In [6]:
bankloan_known = bankloan[~bankloan.default.isnull()]

In [7]:
bankloan_known.shape, bankloan_unknown.shape

((700, 9), (150, 9))

In [8]:
bankloan_known.default = bankloan_known.default.astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [9]:
bankloan_known.dtypes

age            int64
ed             int64
employ         int64
address        int64
income       float64
debtinc      float64
creddebt     float64
othdebt      float64
default     category
dtype: object

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
from sklearn import preprocessing

In [11]:
bl_X_train, bl_Y_train, bl_X_test, bl_Y_test = train_test_split(bankloan_known.iloc[:,0:8], 
                                                                bankloan_known['default'], 
                                                                test_size = 0.3, random_state = 12345)

In [12]:
bankloan_known.default.value_counts()

0.0    517
1.0    183
Name: default, dtype: int64

In [13]:
bl_X_train.shape, bl_Y_train.shape, bl_Y_test.shape, bl_X_test.shape

((490, 8), (210, 8), (210,), (490,))

In [38]:
KNN_class = neighbors.KNeighborsClassifier(3)

In [15]:
cv_scores = cross_val_score(estimator=KNN_class, X = np.array(bl_X_train),y = np.array(bl_X_test),cv = 5)

In [16]:
cv_scores

array([ 0.71717172,  0.6969697 ,  0.71428571,  0.71134021,  0.75257732])

In [17]:
cv_scores_f1 = cross_val_score(estimator=KNN_class, 
                            X = np.array(bl_X_train),
                            y = np.array(bl_X_test), cv = 5, scoring = 'roc_auc')

In [18]:
cv_scores_f1

array([ 0.63708848,  0.60648148,  0.61006781,  0.68716143,  0.71858072])

In [19]:
pred_KNN_cv = cross_val_predict(estimator=KNN_class,X = bl_X_train, y = bl_X_test, cv = 5)

In [20]:
metrics.cohen_kappa_score(bl_X_test, pred_KNN_cv)

0.20546142457641059

In [21]:
KNN_model = KNN_class.fit(bl_X_train,y=bl_X_test)

In [22]:
KNN_pred_test = KNN_model.predict(bl_Y_train)

In [23]:
metrics.cohen_kappa_score(bl_Y_test,KNN_pred_test)

0.33333333333333337

KNN performs in a better way when the predictor variables are standardised. In this case we will use Min_Max standardization

In [24]:
bankloan.drop(['ed','default'],axis=1).corr()

Unnamed: 0,age,employ,address,income,debtinc,creddebt,othdebt
age,1.0,0.554241,0.599949,0.476218,0.00824,0.278835,0.337855
employ,0.554241,1.0,0.344664,0.625093,-0.033625,0.381738,0.414427
address,0.599949,0.344664,1.0,0.30834,-0.032939,0.161614,0.185488
income,0.476218,0.625093,0.30834,1.0,-0.035585,0.551519,0.603368
debtinc,0.00824,-0.033625,-0.032939,-0.035585,1.0,0.51494,0.572576
creddebt,0.278835,0.381738,0.161614,0.551519,0.51494,1.0,0.644943
othdebt,0.337855,0.414427,0.185488,0.603368,0.572576,0.644943,1.0


Income has high correlation with other variables and hence 'income' is dropped along with 'ed'

In [25]:
X = np.array(bl_X_train.drop(['ed','income'],axis = 1))

In [26]:
KNN_new_model = KNN_class.fit(X = bl_X_train.drop(['ed','income'],axis = 1), y = bl_X_test)

In [27]:
p = KNN_new_model.predict(bl_Y_train.drop(['ed','income'],axis = 1))

In [28]:
metrics.confusion_matrix(bl_Y_test,p), metrics.accuracy_score(bl_Y_test,p)

(array([[136,  24],
        [ 25,  25]]), 0.76666666666666672)

In [29]:
metrics.cohen_kappa_score(p,bl_Y_test)

0.35242290748898686

Let us see how the porformance changes if the variables are scaled

In [31]:
bl_X_train_scaled = preprocessing.scale(X=bl_X_train.drop(['ed','income'],axis=1))
bl_Y_train_scaled = preprocessing.scale(X=bl_Y_train.drop(['ed','income'],axis=1))

In [62]:
from sklearn.model_selection import GridSearchCV # GridSearchCV for estimating best parameters
KNN_final = GridSearchCV(KNN_class,param_grid={'n_neighbors':range(1,21),'weights':['uniform','distance']},cv=5)
KNN_final.fit(X=bl_X_train_scaled,y = bl_X_test)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'weights': ['uniform', 'distance'], 'n_neighbors': range(1, 21)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [63]:
KNN_final.best_params_

{'n_neighbors': 9, 'weights': 'uniform'}

In [64]:
p_scaled = KNN_final.predict(bl_Y_train_scaled)

In [66]:
metrics.confusion_matrix(bl_Y_test,p_scaled), metrics.accuracy_score(bl_Y_test,p_scaled)

(array([[148,  12],
        [ 26,  24]]), 0.81904761904761902)

In [67]:
metrics.cohen_kappa_score(p_scaled,bl_Y_test)

0.44813278008298763

In [None]:
# R codes are given as comments. Also note that R's and Python's random numbers are not same and hence the
# outcomes are going to differ. 
'''
bankloan.old=bankloan[1:700,]
bankloan.old$default=as.character(bankloan.old$default)
bankloan.old$default=as.factor(bankloan.old$default)
bankloan.old$ed=as.factor(bankloan.old$ed)
set.seed(12345);index=sample(700,490)
trainset=bankloan.old[index,]
testset=bankloan.old[-index,]

library(class)
pred.knn=knn(trainset[,-c(2,5,9)],testset[,-c(2,5,9)],cl=trainset$default,k=3)
caret::confusionMatrix(pred.knn,testset$default)

pred.knn.std=knn(scale(trainset[,-c(2,5,9)]),scale(testset[,-c(2,5,9)]),cl=trainset$default,k=5)
caret::confusionMatrix(pred.knn.std,testset$default)
'''