K Neighbors Classifier Example 

In this classification example we try to predict if a customer will fail to make their next credit card paymet (default. The datset, from a Taiwanese bank contains 30,000 records.  There are 23 features, such age, education, etc. All categorical variables (e.g. gender, marital status) have been quantified.  


Default payment (Yes = 1, No = 0), is the response variable (y). The 23 explanatory variables are: 

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
X2: Gender (1 = male; 2 = female). 
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
X4: Marital status (1 = married; 2 = single; 3 = others). 
X5: Age (year). 
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

datasource:
http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

In [119]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report, confusion_matrix  
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler  

For this initial example I am only going to use 5 features. 

In [120]:
names = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','DEFAULT']

In [121]:
default = pd.read_csv("card_default_short2.csv", names=names) 


In [122]:
default.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,DEFAULT
0,20000,2,2,1,24,1
1,120000,2,2,2,26,1
2,90000,2,2,2,34,0
3,50000,2,2,1,37,0
4,50000,1,2,1,57,0


In [123]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 6 columns):
LIMIT_BAL    30000 non-null int64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
DEFAULT      30000 non-null int64
dtypes: int64(6)
memory usage: 1.4 MB


I split the dataset into attributes/features (X) and labels (y); X first 5 columns of dataset, y = target label (eg.yes or no) 
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 7].values
http://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/

In [124]:
X = default.iloc[:, :-1].values
y = default.iloc[:, 5].values

In [125]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

I then create a test set and a training set. The test set contains 20% of our data (6,000) and is used to test the model, the training set is used to train the model
https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data

In [126]:
classifier = KNeighborsClassifier(n_neighbors=5)  
classifier.fit(X_train, y_train)  

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [127]:
y_pred = classifier.predict(X_test)  

In [128]:
classifier.score(X_test,y_test)

0.7331666666666666

I get a classifier score of .737, which really isn't great. By changing the n_neighbors parameter to 8, I can improve this slightly to .767. Still pretty unimpressive.

In [129]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  

[[4242  381]
 [1220  157]]
             precision    recall  f1-score   support

          0       0.78      0.92      0.84      4623
          1       0.29      0.11      0.16      1377

avg / total       0.67      0.73      0.69      6000



Running a confusion matrix, I see that the model accurately predicts the label for default (1) only 29% of the time. 

Precision: Precsion tells us about when it predicts yes, how often is it correct (positive predictive value)
Recall: = True positive rate, sensistivity
f1-score = harmonic average precision and recall 


Maybe adding more features in will improve my model.  I add in all the payment history. 


In [130]:
names2 = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','Pay_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6','DEFAULT',]

In [131]:
default_all = pd.read_csv("credit_card_default_full.csv", names=names2) 


In [132]:
default_all.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,Pay_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [133]:
default_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
LIMIT_BAL    30000 non-null int64
SEX          30000 non-null int64
EDUCATION    30000 non-null int64
MARRIAGE     30000 non-null int64
AGE          30000 non-null int64
Pay_0        30000 non-null int64
PAY_2        30000 non-null int64
PAY_3        30000 non-null int64
PAY_4        30000 non-null int64
PAY_5        30000 non-null int64
PAY_6        30000 non-null int64
BILL_AMT1    30000 non-null int64
BILL_AMT2    30000 non-null int64
BILL_AMT3    30000 non-null int64
BILL_AMT4    30000 non-null int64
BILL_AMT5    30000 non-null int64
BILL_AMT6    30000 non-null int64
PAY_AMT1     30000 non-null int64
PAY_AMT2     30000 non-null int64
PAY_AMT3     30000 non-null int64
PAY_AMT4     30000 non-null int64
PAY_AMT5     30000 non-null int64
PAY_AMT6     30000 non-null int64
DEFAULT      30000 non-null int64
dtypes: int64(24)
memory usage: 5.5 MB


In [134]:
X = default_all.iloc[:, :-1].values
y = default_all.iloc[:, 23].values

In [135]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)  

In [136]:
X_train

array([[260000,      2,      1, ...,  81569, 331788,  10002],
       [110000,      1,      2, ...,      0,   4000,   3700],
       [230000,      2,      2, ...,   4000,   4000,   3000],
       ...,
       [100000,      2,      2, ...,   2606,   2696,   2695],
       [230000,      2,      1, ...,   3459,   4000,   5000],
       [370000,      2,      2, ...,    747,   4060,   3059]])

Because of the large variation in my X values - I will use a scalar to normalize the data. 

In [137]:
scaler = StandardScaler()  
scaler.fit(X_train)



StandardScaler(copy=True, with_mean=True, with_std=True)

In [138]:
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test)  



In [139]:
classifier = KNeighborsClassifier(n_neighbors=6)  
classifier.fit(X_train, y_train)  

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=6, p=2,
           weights='uniform')

In [140]:
y_pred = classifier.predict(X_test)  

In [141]:
classifier.score(X_test,y_test)

0.8038333333333333

Adding the extra features, using the scalar and changing n_neighbors to 6 improves my model score to .80. 
However the to I still only get a score of .75 with n_neighbors = 5, .77 with 8 n_neighbors. However, the ability to predict default = yes has improved from .28 to .61 (below) 

In [144]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  

[[4495  221]
 [ 956  328]]
             precision    recall  f1-score   support

          0       0.82      0.95      0.88      4716
          1       0.60      0.26      0.36      1284

avg / total       0.78      0.80      0.77      6000

