***Thad Hoskins***

Mini-Project 2 - K-Nearest Neighbor Classifier

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

***Question 3: Your own mini project***

Find your own dataset suitable for classification or regression with at least three input variables and 200 or more cases:<br>
Depending on the target variable of interest, you would build a k-nearest neighbor classifier or regressor using the appropriate sklearn estimator. Find some interesting unique dataset that is not popularly used in the internet. 
Address the following and include code/output snippets from b) to f). Include the response under each sub question. 

State your research question:

Is K Nearest Neighbor Classifier good for online data usage based upon the factors of age, gender, webpages, video hours, and income.

This can be used for efforts of an ISP to target power users, limit or facilitate certain online activities (no net neutrality here), market packages to users based on power user or not, etc.

Since this is a classification issue, the target variable is binary: Usage

usage is 1 usage data exceeds 15 gb per week, otherwise, usage is 0.

b) Data pre-processing (to the extent deemed necessary: remember the knn algorithm depends on distances, so you need to rescale, normalize or standardize your input values to make sure no variable influences the predictions due to it scale). 

In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/MGCodesandStats/datasets/master/internetlogit.csv")
data.head(5)

Unnamed: 0,age,gender,webpages,videohours,income,usage
0,36,0,32,0.061389,6021,0
1,33,0,49,8.516667,10239,1
2,46,1,22,0.0,1374,0
3,53,0,16,2.762222,5376,0
4,27,1,30,0.0,1393,0


In [3]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = data.copy()

cols = ["age", "webpages", "videohours", "income"]

for col in cols:
    data_scaled[col] = scaler.fit_transform(np.array(data[col]).reshape(-1, 1))

data_scaled.isnull().sum().sort_values(ascending=False)

age           0
gender        0
webpages      0
videohours    0
income        0
usage         0
dtype: int64

In [4]:
data_scaled.dropna(axis=0, how='any', inplace=True)

data_scaled

Unnamed: 0,age,gender,webpages,videohours,income,usage
0,0.414634,0,0.023404,0.001364,0.501750,0
1,0.341463,0,0.041489,0.189259,0.853250,1
2,0.658537,1,0.012766,0.000000,0.114500,0
3,0.829268,0,0.006383,0.061383,0.448000,0
4,0.195122,1,0.021277,0.000000,0.116083,0
...,...,...,...,...,...,...
961,0.097561,0,0.017021,0.020204,0.311583,0
962,0.780488,0,0.024468,0.045802,0.655500,0
963,0.073171,0,0.014894,0.061265,0.106500,0
964,0.926829,1,0.018085,0.033951,0.926167,1


In [5]:
X_scaled = data_scaled.drop(["usage"], axis=1)
y = data_scaled["usage"]

c) Data splitting

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=.2, random_state=42)

d) Model construction 

In [7]:
knn_reg = KNeighborsClassifier(n_neighbors=5)
knn_reg.fit(X_train, y_train.values.flatten())

print(f"Train score: {knn_reg.score(X_train, y_train)}")
print(f"Test score: {knn_reg.score(X_test, y_test)}")

Train score: 0.9572538860103627
Test score: 0.9123711340206185


Pretty good results. Next is a little hyper-parameter tuning.

In [8]:
knn_reg.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

e) Hyperparameter turning (choose whatever approach your like)

In [9]:
knn_reg.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

In [10]:
params = {"metric":["euclidean", "manhattan", "minkowski"], 
          "n_neighbors": list(range(1,51, 2)), 
          "weights": ["uniform", "distance"]}
gs = GridSearchCV(knn_reg, params, cv=5)
gs.fit(X_train, y_train.values.flatten())

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
                         'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
                                         23, 25, 27, 29, 31, 33, 35, 37, 39, 41,
                                         43, 45, 47, 49],
                         'weights': ['uniform', 'distance']})

In [11]:
gs.best_params_

{'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'}

In [12]:
gs_best_estimator = gs.best_estimator_
gs_best_estimator

KNeighborsClassifier(metric='euclidean', n_neighbors=7, weights='distance')

f) Use the best or optimal parameter values to build a model, then compute the accuracy score for your estimator. 
Discuss about overfitting for the model 

In [13]:
y_pred = gs_best_estimator.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {test_acc}")

Accuracy: 0.9175257731958762


In [14]:
y_train_pred = gs_best_estimator.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)
print(f"Train accuracy: {train_acc}")

Train accuracy: 1.0


While the training accuracy is 100%, which should cause some skepticism, the accuracy of the training set is very similar to the test accuracy. The model does not appear to be overfit.