# Mini Project: K-Nearest Neighbors

Written by Adam Ten Hoeve  
COMP 4448 - Data Science Tools 2  
Summer 2021

In [2]:
# Load Required Packages
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split

Find your own dataset suitable for classification or regression with at least three input variables and 200 or more cases: Depending on the target variable of interest, you would build a k-nearest neighbor classifier or regressor using the appropriate sklearn estimator. Find some interesting unique dataset that is not popularly used in the internet. Address the following and include code/output snippets from b) to f). Include the response under each sub question.

Data can be found here: [https://archive.ics.uci.edu/ml/datasets/abalone]

In [1]:
# Define a function to calculate MSE of a numpy array.
def mse(y, y_pred):
    # Compute the difference between the true and predicted values
    resid = y.to_numpy().flatten() - y_pred
    # Compute the sum of the squared residuals
    sum_of_resid = np.sum(resid**2)
    # Divide by the length of the data
    final = sum_of_resid / len(y)
    return(final)

In [3]:
# Read in the abalone dataset
df_abalone = pd.read_csv("abalone.csv", header=None)
# Rename the columns to their actual values
df_abalone.columns = ["sex", "length", "diameter", "height", "weight_whole", \
                      "weight_shucked", "weight_viscera", "weight_shell", "rings"]

# Extract the response variable
y_aba = df_abalone["rings"]

# Standardize the numerical data
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(df_abalone.drop(["sex", "rings"], axis=1))
df_scaled = pd.DataFrame(data_scaled, columns=["length", "diameter", "height", "weight_whole", \
                                   "weight_shucked", "weight_viscera", "weight_shell"])
# Convert gender to categorical
df_dummy = pd.get_dummies(df_abalone["sex"], drop_first=True)

# Combine the cleaned dataframes into an X dataframe
df_aba = pd.concat([df_scaled, df_dummy], axis=1)
df_aba.head()

Unnamed: 0,length,diameter,height,weight_whole,weight_shucked,weight_viscera,weight_shell,I,M
0,0.513514,0.521008,0.084071,0.181335,0.150303,0.132324,0.147982,0,1
1,0.371622,0.352941,0.079646,0.079157,0.066241,0.063199,0.068261,0,1
2,0.614865,0.613445,0.119469,0.239065,0.171822,0.185648,0.207773,0,0
3,0.493243,0.521008,0.110619,0.182044,0.14425,0.14944,0.152965,0,1
4,0.344595,0.336134,0.070796,0.071897,0.059516,0.05135,0.053313,1,0


In [4]:
# Search for missing data
np.sum(df_aba.isna())

length            0
diameter          0
height            0
weight_whole      0
weight_shucked    0
weight_viscera    0
weight_shell      0
I                 0
M                 0
dtype: int64

In [5]:
# Split the data into training and test sets
X_train_aba, X_test_aba, y_train_aba, y_test_aba = train_test_split(df_aba, y_aba,
                                                                    test_size=0.2,
                                                                    random_state=42)
print(X_train_aba.shape)
print(y_train_aba.shape)
print(X_test_aba.shape)
print(y_test_aba.shape)

(3341, 9)
(3341,)
(836, 9)
(836,)


In [6]:
# Create a KNN regressor and fit it to the training data
knn_aba = KNeighborsRegressor()
# Use GridSearch to find the best k-value for the model
param_grid = {"n_neighbors": np.arange(1, 21)}
grid_aba = GridSearchCV(knn_aba, param_grid, cv=8)
# Fit the GridSearch to the training data
grid_aba.fit(X_train_aba, y_train_aba)
# Set the model as the best model
knn_aba = grid_aba.best_estimator_
print("Best k-value from training data:", grid_aba.best_params_)

Best k-value from training data: {'n_neighbors': 16}


In [7]:
# Using the best model, predict the values of the training and test sets
aba_preds_train = knn_aba.predict(X_train_aba)
aba_preds_test = knn_aba.predict(X_test_aba)
# Compute the MSE of the predictions on both sets
print("MSE on the training set:", mse(y_train_aba, aba_preds_train))
print("MSE on the test set:", mse(y_test_aba, aba_preds_test))

MSE on the training set: 4.251068635887459
MSE on the test set: 4.932355524820574
