<a href="https://colab.research.google.com/github/dton24/Notes/blob/main/KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 4321
# Dr. Mohammad Salehan
## K-Nearest Neighbors Assignment
In this assignment you will conduct KNN classification on a dataset. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Let's start by loading the dataset.

Enter your name below.

In [None]:
%matplotlib inline

from pathlib import Path

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
import matplotlib.pylab as plt

In [None]:
df = pd.read_excel('UniversalBank.xlsx', 'Data')
df.shape

(5000, 14)

In [None]:
df.head(1)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0


Let's check out the proportion of the two classes in the column used as label (i.e., Personal Loan). There is no need to conduct oversampling or undersampling in this assignment.

In [None]:
df['Personal Loan'].value_counts()

0    4520
1     480
Name: Personal Loan, dtype: int64

1. What is the naive rule in this example? What is the accuracy of the naive model?

## Select columns
2. Exclude ID and ZIP Code columns.

In [None]:
new_df = df.drop(df.columns[[0,4]], axis = 1)

In [None]:
new_df.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,25,1,49,4,1.6,1,0,0,1,0,0,0
1,45,19,34,3,1.5,1,0,0,1,0,0,0
2,39,15,11,1,1.0,1,0,0,0,0,0,0
3,35,9,100,1,2.7,2,0,0,0,0,0,0
4,35,8,45,4,1.0,2,0,0,0,0,0,1


## Missing values
3. Check missing values. Drop them if needed.

In [None]:
new_df['Personal Loan'].isna().sum()

0

## Dummies
4. Create dummies if any is needed.

In [None]:
# Not needed

## Partitioning
5. Partition the dataset into train and validation partitions. Use 40% for validation. There is no need to make up artificial records. Set random_state to 26.

In [None]:
train_data, test_data = train_test_split(new_df,test_size = 0.4, random_state = 26)
print(train_data.shape, test_data.shape)

(3000, 12) (2000, 12)


## Preprocessing
6. Conduct all required preprocessing which includes (1) selecting features and (2) normalization. Use 'Personal Loan' as label and the rest of the columns as predictors.<br>
At the end of this cell you should have 2 variables named trainNorm and validNorm representing train and validation partitions.
<br>
Tip: create a list of features and name it features. Use it when needed instead of copy-pasting column names each time.

In [None]:
# Create train_features variable, which contains all of the predictors. This is for training data.
train_features = train_data.drop(columns = 'Personal Loan')

#Create train_label variable, which contains 'Personal Loan', which will be out classification. This is for training data.
train_label = train_data['Personal Loan']

# Initialize model
scaler = preprocessing.StandardScaler()

# Fit the training data
scaler.fit(train_features)

# Normalize the training data features
train_norm_features = pd.DataFrame(scaler.transform(train_features), columns=train_features.columns)

# Create a new data frame consisting of normalized data and concat classifier "Personal Loan" into new df.
trainNorm = pd.concat([train_norm_features, train_label.reset_index(drop=True)], axis=1)

#View normalized dataframe of training set
trainNorm.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard,Personal Loan
0,1.536362,1.470326,1.122297,0.529738,1.772104,0.156296,1.125364,-0.346151,-0.248891,-1.24019,-0.645314,1
1,-0.980131,-0.960922,-0.174284,0.529738,-0.696696,0.156296,0.475851,-0.346151,-0.248891,0.806328,-0.645314,0
2,1.449587,1.557156,1.869479,-0.340587,0.107099,-1.03985,-0.547625,-0.346151,-0.248891,-1.24019,1.549632,0
3,-0.459477,-0.52677,1.100321,-1.210912,1.886932,-1.03985,-0.547625,-0.346151,-0.248891,0.806328,-0.645314,0
4,-0.19915,-0.266279,0.397091,-1.210912,1.886932,-1.03985,-0.547625,2.888909,-0.248891,-1.24019,-0.645314,0


In [None]:
# Create a variable "test_features", which will hold our predictors from our test data
test_features = test_data.drop(columns = 'Personal Loan')

# Create a variable "test_label", which holds out classifier "Personal Loan"
test_label = test_data['Personal Loan']

#Normalize predictors "test_features"
test_norm_features = pd.DataFrame(scaler.transform(test_features), columns = test_features.columns)

# Concat classifier "Personal Loan"
validNorm = pd.concat([test_norm_features, test_label.reset_index(drop=True)], axis = 1)

# View new data frame
validNorm.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard,Personal Loan
0,0.928933,0.862514,1.495888,-0.340587,-0.811524,-1.03985,-0.547625,-0.346151,-0.248891,0.806328,-0.645314,0
1,1.015709,1.036174,-0.745659,1.400063,-0.581868,-1.03985,0.899019,-0.346151,-0.248891,-1.24019,-0.645314,0
2,-0.893355,-1.308243,0.177331,1.400063,1.197964,1.352443,-0.547625,-0.346151,-0.248891,0.806328,-0.645314,0
3,-0.719804,-0.700431,-0.987394,1.400063,-0.122556,-1.03985,-0.547625,2.888909,4.017817,0.806328,-0.645314,0
4,-1.761111,-1.742394,-0.635779,1.400063,-0.75411,0.156296,-0.547625,-0.346151,-0.248891,-1.24019,1.549632,0


## More partitioning
7. create 4 variables train_X, train_y, valid_X, valid_y representing training features, training label, validation features, and validation label respectively.

In [None]:
train_X = trainNorm.drop(columns = 'Personal Loan')
train_y = trainNorm['Personal Loan']
valid_X = validNorm.drop(columns = 'Personal Loan')
valid_y = validNorm['Personal Loan']

## Run KNN
8. Run KNN. Examine k values in range 1 to 15. Remeber that end index of range() function is excluded.

In [None]:
# Train classifier with different values of k (1-15)
results = []
for k in range(1,16):
    knn = KNeighborsClassifier(n_neighbors = k).fit(train_X, train_y)
    results.append({
        'k':k,
        'accuracy': accuracy_score(valid_y, knn.predict(valid_X))
        })
results = pd.DataFrame(results)
results

Unnamed: 0,k,accuracy
0,1,0.9535
1,2,0.9425
2,3,0.952
3,4,0.9445
4,5,0.9515
5,6,0.943
6,7,0.947
7,8,0.942
8,9,0.945
9,10,0.9385


## Select the best value for K
9. Select the best value for K and write it below. Justify your selection.

In [None]:
results.loc[results['accuracy'].idxmax()]

k           1.0000
accuracy    0.9535
Name: 0, dtype: float64