Assignment 4: K-Nearest Neighbors

The problems in this assignment are based on the exercise 7.2 of Chapter 7 in Data Mining for Business Analytics.

Scenario: Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varying sizes of relationship with the bank.

The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business. In particular, it wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use k-nearest neighbour to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.

Data: The file Universalbank.csv contains data on 5000 customers. Data Description. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.


In [1]:
%matplotlib inline
import warnings

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier

Data preparation: Load the data and remove unnecessary columns (ID, ZIP Code). Split the data into training (60%) and validation (40%) sets (use random_state=1).

In [2]:
# Load the data
bank_df = pd.read_csv("dmba/UniversalBank.csv")

# Remove ID and Zip Code columns
bank_df = bank_df.drop(columns=['ID', 'ZIP Code'])

# Verify data is loaded correctly
print("Shape", bank_df.shape)  # determine data frame dimensions
bank_df.head(15)  # view the first 15 observations

Shape (5000, 12)


Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,25,1,49,4,1.6,1,0,0,1,0,0,0
1,45,19,34,3,1.5,1,0,0,1,0,0,0
2,39,15,11,1,1.0,1,0,0,0,0,0,0
3,35,9,100,1,2.7,2,0,0,0,0,0,0
4,35,8,45,4,1.0,2,0,0,0,0,0,1
5,37,13,29,4,0.4,2,155,0,0,0,1,0
6,53,27,72,2,1.5,2,0,0,0,0,1,0
7,50,24,22,1,0.3,3,0,0,0,0,0,1
8,35,10,81,3,0.6,2,104,0,0,0,1,0
9,34,9,180,1,8.9,3,0,1,0,0,0,0


In [3]:
y = bank_df["Personal Loan"]
X = bank_df.drop(columns=["Personal Loan"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
print('Training set:', X_train.shape, 'Validation set:', X_test.shape)

Training set: (3000, 11) Validation set: (2000, 11)


For this assignment, you will want to go back through the reading on the k-NN classifier and think about how to re-purpose the code provided.

Question 1 (10 points) Perform a k-NN classification with all predictors except ID and ZIP code. Compute and report the accuracies (also called correct rates) in the validation set for odd k’s up to 19 (i.e., k = 1, 3, …, 19). What is the best choice of k?

In [4]:
predictors = list(X_train.columns)
scaler = preprocessing.StandardScaler()
scaler.fit(X_train[predictors])

# Transform the predictors
train_X = scaler.transform(X_train[predictors])
train_y = y_train
valid_X = scaler.transform(X_test[predictors])
valid_y = y_test


  return self.partial_fit(X, y)
  
  


In [5]:
# Train a classifier for different values of k
results = []
for k in range(1, 20, 2):
    knn = KNeighborsClassifier(n_neighbors=k).fit(train_X, train_y)
    results.append({
        'k': k,
        'accuracy': accuracy_score(valid_y, knn.predict(valid_X))
    })

# Convert results to a pandas data frame
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,accuracy,k
0,0.9555,1
1,0.9545,3
2,0.9575,5
3,0.9565,7
4,0.952,9
5,0.947,11
6,0.945,13
7,0.9445,15
8,0.942,17
9,0.9425,19


From the dataframe the highest accuracy achieved is 0.9575 at k=5.

Question 2 (7 points) Using the best k, make predictions in the validation set. Based on the numbers in the confusion matrix, explain how are the sensitivity and specificity calculated.


In [6]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_X, train_y)

knnPred = knn.predict(valid_X)
print(classification.confusion_matrix(valid_y, knnPred), "\n")
print('Accuracy :', classification.accuracy_score(valid_y, knnPred))

[[1800    7]
 [  78  115]] 

Accuracy : 0.9575


In [7]:
sensitivity = 115 / (115+78)
specificity = 1800 / (1800+7)
print("The sensitivity of %s is a measure of the actual positives divided by the total positives." % sensitivity)
print("The specificity of %s is a measure of the actual negative divided by the total negatives." % specificity)

The sensitivity of 0.5958549222797928 is a measure of the actual positives divided by the total positives.
The specificity of 0.9961261759822911 is a measure of the actual negative divided by the total negatives.



Question 3 (6 points) Classify a new customer with the following profile: Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities.Account = 0, CD.Account = 0, Online = 1, CreditCard = 1. Make sure that the column order for dataframe describing the new customer matches the column order in the training set.

In [8]:
new_customer_df = pd.DataFrame(
    [[40, 10, 84, 2, 2, 2, 0, 0, 0, 1, 1]],
    columns=['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities Account', 
             'CD Account', 'Online', 'CreditCard'])
new_customer_df

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard
0,40,10,84,2,2,2,0,0,0,1,1


In [9]:
new_train_X = scaler.transform(new_customer_df[predictors])

  """Entry point for launching an IPython kernel.


In [10]:
print("The predicted class for the new customer is %s." % knn.predict(new_customer_df), "\n")
print("The class probabilities are %s." % knn.predict_proba(new_customer_df), "\n")

The predicted class for the new customer is [1]. 

The class probabilities are [[0.2 0.8]]. 



### Points: 22/23

### Comment: Your answers for Q1 and Q2 are absolutely correct! 

### For Q3, your answer for predicted class differs from the model answers. I can see that you have correctly used new_train_X = scaler.transform(new_customer_df[predictors]) however while obtaining predicted class and predicted class probabilities you have used new_customer_df instead of new_train_X.

### By changing new_customer_df by new_train_X the answers match with the model answers. 