# k-Nearest Neighbors Model
kNN makes predictions based on its similarity to its neighbors or datapoints. We're using this model because:
* kNN is easier to implement compared to the rest of the models. We would like to see how a simpler model fares against more complex models. It can serve as a baseline.
* Our target variable has multiple categories. It may be naive but it works well for multiclass classification due to the majority vote.

It's important to note the following about kNN:
* It may struggle in highly-dimensional datasets and large datasets.
* After cleaning the dataset, there are notable imbalances according to the YData-Profiling Report. kNN may not be as accurate.

In [165]:
# necessary imports
import random
import numpy as np
import pickle
import os
import matplotlib.pyplot as plt

# loading
import pandas as pd 

# for model
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.preprocessing import LabelEncoder

In [166]:
# setup code taken from STINTSY class notebook
# Makes matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline

plt.rcParams['figure.figsize'] = (6.0, 6.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# autoreload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


First, we will consider what variables to consider. (Feature Selection)

In [167]:
numerical_vars = ['PUFC25_PBASIC', 'PUFC05_AGE', 'PUFC18_PNWHRS', 'PUFC19_PHOURS', 'PUFC28_THOURS'] # based on anova results
bincateg_vars = ['PUFC08_CURSCH', 'PUFC09_GRADTECH', 'PUFC20_PWMORE', 'PUFC22_PFWRK'] # by relevance
categ_vars = ['PUFC43_QKB', 'PUFC16_PKB', 'PUFC04_SEX', 'PUFC23_PCLASS', 'PUFURB2K10'] # by highest cramers v

# merge these into one
features = numerical_vars + bincateg_vars + categ_vars
print(features)

['PUFC25_PBASIC', 'PUFC05_AGE', 'PUFC18_PNWHRS', 'PUFC19_PHOURS', 'PUFC28_THOURS', 'PUFC08_CURSCH', 'PUFC09_GRADTECH', 'PUFC20_PWMORE', 'PUFC22_PFWRK', 'PUFC43_QKB', 'PUFC16_PKB', 'PUFC04_SEX', 'PUFC23_PCLASS', 'PUFURB2K10']


Then, we'll load the exported dataset into the environment.

In [168]:
# load dataset into the environment
file_dir = './cleaned_df.csv'

def load_dataset(filename):
    df = pd.read_csv('cleaned_df.csv') 
    knn_df = df.copy()
    return knn_df

knn_df = load_dataset(file_dir)
# knn_df.head(10)

In [169]:
# knn expects everything to be in numeical format

label_encoder = LabelEncoder()
for col in bincateg_vars:
    knn_df[col] = label_encoder.fit_transform(knn_df[col])

# Encoding multiclass categorical variables
for col in categ_vars:
    knn_df[col] = label_encoder.fit_transform(knn_df[col])

features = numerical_vars + bincateg_vars + categ_vars

print(knn_df[features].values)
print(knn_df['PUFC14_PROCC'].values)

[[ 0.         49.          8.         ...  1.          4.
   0.        ]
 [ 0.         61.          4.         ...  0.          6.
   0.        ]
 [ 5.52545294 19.          8.         ...  1.          2.
   0.        ]
 ...
 [ 0.         32.          4.         ...  0.          4.
   0.        ]
 [ 0.         29.          8.         ...  1.          0.
   0.        ]
 [ 0.         18.          4.         ...  1.          4.
   0.        ]]
['Skilled Agricultural, Forestry and Fishery Workers'
 'Elementary Occupations' 'Elementary Occupations' ...
 'Skilled Agricultural, Forestry and Fishery Workers' 'Managers'
 'Managers']


In [170]:
n_neighbors = 5

knn_model = KNeighborsClassifier(n_neighbors, algorithm='kd_tree')

In [171]:
X = knn_df[features]
y = knn_df['PUFC14_PROCC']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's select our features and our target. Let's also view the shape so that we can tell if the data was prepared correctly.

In [172]:
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)

print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

Training data shape:  (56968, 14)
Training labels shape:  (56968,)
Test data shape:  (14242, 14)
Test labels shape:  (14242,)


We can now begin training the model.

In [173]:
knn_model.fit(X_train, y_train) # training the model

In [174]:
y_predicted = knn_model.predict(X_test)

Let's test the accuracy of our model.

In [175]:
accuracy = accuracy_score(y_test, y_predicted) # from sklearn
print(f"{round(accuracy * 100, 3)}% accuracy")

65.265% accuracy


Observations:

### Cross Validation

We use cross validation to determine the optimal value for our hyperparameter k.

In [None]:
k_folds = 15 # chosen because we are working with a large dataset
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 25, 50, 55, 100] 
# from most sensitive to noise (more overfitting) to least sensitive (more underfitting) 

In [177]:
best_k = None
best_score = 0

for k in k_choices:
    knn = KNeighborsClassifier(n_neighbors=k, algorithm='kd_tree')
    scores = cross_val_score(knn, X_train, y_train, cv=k_folds, scoring='accuracy')
    mean_score = np.mean(scores)
    print(f"k={k}, Cross-Validation Accuracy: {round(mean_score * 100, 3)}%")
    
    if mean_score > best_score:
        best_k = k
        best_score = mean_score

print(f"Best k: {best_k} with accuracy: {round(best_score * 100, 3)}%")

k=1, Cross-Validation Accuracy: 64.44%
k=3, Cross-Validation Accuracy: 65.584%
k=5, Cross-Validation Accuracy: 66.59%
k=8, Cross-Validation Accuracy: 66.493%
k=10, Cross-Validation Accuracy: 66.228%
k=12, Cross-Validation Accuracy: 66.014%
k=15, Cross-Validation Accuracy: 65.475%
k=20, Cross-Validation Accuracy: 64.757%
k=25, Cross-Validation Accuracy: 63.948%
k=50, Cross-Validation Accuracy: 61.643%
k=55, Cross-Validation Accuracy: 61.212%
k=100, Cross-Validation Accuracy: 59.11%
Best k: 5 with accuracy: 66.59%


FOR LATER: https://peerdh.com/blogs/programming-insights/optimizing-k-nearest-neighbors-for-large-datasets-in-scikit-learn
i used kd_tree algorithm but go abck to it later

In [179]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [1, 3, 5, 7, 9], 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

{'algorithm': 'brute', 'n_neighbors': 7}
