### K-Nearest Neighbor Model for Car Classification

This notebook use KNN model to predict car classification based on the train/test split data from data prep notebook
- construct model will be fitted with default paramters and result evaluated
- tune model to get optimial k parameter using f1 score
- explore impact on model performance by different features

In [1]:
#import relevant packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pickle
import functions as fn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics as metric
plt.style.use('ggplot')

#run data_prep notebook to create train/test split data pickles

In [2]:
#retrieve train/test split data
with open('X_train.pickle', 'rb') as file:
    X_train = pickle.load(file)
    
with open('X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)

with open('y_train.pickle', 'rb') as file:
    y_train = pickle.load(file)
    
with open('y_test.pickle', 'rb') as file:
    y_test = pickle.load(file)

In [3]:
#normalizing feature data (X_train and X_test) using Standard Scaler
scaler = StandardScaler()
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)

scaled_df_train = pd.DataFrame(scaled_data_train, columns = X_train.columns)
scaled_df_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
scaled_df_train.head()
#preview scaled data

Unnamed: 0,Year,Engine HP,Engine Cylinders,highway MPG,city mpg,Popularity,MSRP,Performance,Factory Tuner,Hybrid,...,x4_Cargo Van,x4_Convertible,x4_Coupe,x4_Crew Cab Pickup,x4_Extended Cab Pickup,x4_Passenger Minivan,x4_Passenger Van,x4_Regular Cab Pickup,x4_Sedan,x4_Wagon
0,1.070747,2.463359,1.174737,-1.163008,-0.833465,-0.665766,0.863701,-0.554457,3.549477,-0.231191,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
1,0.522695,-0.933291,-0.89528,0.364742,0.17821,-0.526818,-0.514851,-0.554457,-0.281732,-0.231191,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
2,-2.765616,-0.206048,0.139728,-0.290008,-0.49624,-0.79027,-0.429133,1.803566,-0.281732,-0.231191,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
3,0.522695,-0.762175,-0.89528,-0.290008,-0.159015,-0.137486,-0.437842,-0.554457,-0.281732,-0.231191,...,-0.079359,-0.292225,-0.3571,-0.218013,6.979335,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067
4,-0.573409,-1.617754,-0.89528,0.801242,0.965069,-0.68709,-0.643172,-0.554457,-0.281732,-0.231191,...,-0.079359,-0.292225,-0.3571,-0.218013,-0.14328,-0.073439,-0.131456,-0.134932,-0.551356,-0.195067


In [4]:
#fit base model using default parameters
knn = KNeighborsClassifier()
knn.fit(scaled_data_train, y_train)

#predict for test and train data sets
test_preds = knn.predict(scaled_data_test)
train_preds = knn.predict(scaled_data_train)

In [5]:
#evaluate model performance, metrics for test set
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.84      0.73      0.78        37
           1       0.80      0.77      0.78        64
           2       0.91      0.92      0.91       478
           3       0.91      0.92      0.91       317
           4       0.89      0.91      0.90       460
           5       0.92      0.78      0.84        72
           6       0.92      0.92      0.92        64

    accuracy                           0.90      1492
   macro avg       0.88      0.85      0.87      1492
weighted avg       0.90      0.90      0.90      1492



In [6]:
#metrics for train set, there is slight overfitting
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.89      0.84      0.87       109
           1       0.87      0.87      0.87       157
           2       0.94      0.97      0.96      1431
           3       0.95      0.95      0.95      1008
           4       0.94      0.94      0.94      1337
           5       0.90      0.80      0.85       205
           6       0.97      0.95      0.96       227

    accuracy                           0.94      4474
   macro avg       0.92      0.90      0.91      4474
weighted avg       0.94      0.94      0.94      4474



In [7]:
#find best k based on f1 score
fn.optimize_k_knn(scaled_data_train, y_train, scaled_data_test, y_test, min_k=1, max_k=10)

(1, 0.9684986595174263)

In [8]:
#fit model with paramter k=1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(scaled_data_train, y_train)
test_preds = knn.predict(scaled_data_test)
train_preds = knn.predict(scaled_data_train)

In [9]:
#checking testing metrics
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.97      1.00      0.99        37
           1       0.97      0.94      0.95        64
           2       0.97      0.97      0.97       478
           3       0.97      0.98      0.97       317
           4       0.97      0.96      0.97       460
           5       0.93      0.96      0.95        72
           6       0.98      0.98      0.98        64

    accuracy                           0.97      1492
   macro avg       0.97      0.97      0.97      1492
weighted avg       0.97      0.97      0.97      1492



In [10]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       109
           1       1.00      1.00      1.00       157
           2       1.00      1.00      1.00      1431
           3       1.00      1.00      1.00      1008
           4       1.00      1.00      1.00      1337
           5       1.00      1.00      1.00       205
           6       1.00      1.00      1.00       227

    accuracy                           1.00      4474
   macro avg       1.00      1.00      1.00      4474
weighted avg       1.00      1.00      1.00      4474



Optimal k value is 1, indicating that some features are highly deterministic in predicating classes

In [11]:
#remove Popularity from features
scaled_df_train_2 = scaled_df_train.drop('Popularity', axis=1)
scaled_df_test_2 = scaled_df_test.drop('Popularity', axis=1)

In [12]:
knn = KNeighborsClassifier()
knn.fit(scaled_df_train_2, y_train)
test_preds = knn.predict(scaled_df_test_2)
train_preds = knn.predict(scaled_df_train_2)

In [13]:
#metrics for test data
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.72      0.76      0.74        37
           1       0.84      0.75      0.79        64
           2       0.90      0.90      0.90       478
           3       0.87      0.90      0.89       317
           4       0.86      0.87      0.86       460
           5       0.83      0.67      0.74        72
           6       0.92      0.91      0.91        64

    accuracy                           0.87      1492
   macro avg       0.85      0.82      0.83      1492
weighted avg       0.87      0.87      0.87      1492



In [14]:
#metrics for training data
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       109
           1       0.88      0.85      0.86       157
           2       0.94      0.95      0.95      1431
           3       0.93      0.94      0.93      1008
           4       0.92      0.92      0.92      1337
           5       0.89      0.76      0.82       205
           6       0.96      0.94      0.95       227

    accuracy                           0.93      4474
   macro avg       0.91      0.89      0.90      4474
weighted avg       0.93      0.93      0.93      4474



In [15]:
#check for best k parameter by optimizing f1 score
fn.optimize_k_knn(scaled_df_train_2, y_train, scaled_df_test_2, y_test, min_k=1, max_k=10)

(1, 0.9403485254691689)

In [16]:
#build model with k=1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(scaled_df_train_2, y_train)
test_preds = knn.predict(scaled_df_test_2)
train_preds = knn.predict(scaled_df_train_2)

In [17]:
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.86      0.97      0.91        37
           1       1.00      0.94      0.97        64
           2       0.95      0.96      0.95       478
           3       0.95      0.96      0.96       317
           4       0.94      0.93      0.93       460
           5       0.83      0.79      0.81        72
           6       0.97      0.95      0.96        64

    accuracy                           0.94      1492
   macro avg       0.93      0.93      0.93      1492
weighted avg       0.94      0.94      0.94      1492



In [18]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       109
           1       1.00      1.00      1.00       157
           2       1.00      1.00      1.00      1431
           3       1.00      1.00      1.00      1008
           4       1.00      1.00      1.00      1337
           5       1.00      1.00      1.00       205
           6       1.00      1.00      1.00       227

    accuracy                           1.00      4474
   macro avg       1.00      1.00      1.00      4474
weighted avg       1.00      1.00      1.00      4474



Optimal k value is again 1, indicating one feature other than Popularity is still highly deterministic in predicating classes.

Investigate features impact on model performance by using smaller sets of features then add in additional features to evaluate impact on model performacne

In [19]:
#check available feature 
X_train.columns

Index(['Year', 'Engine HP', 'Engine Cylinders', 'highway MPG', 'city mpg',
       'Popularity', 'MSRP', 'Performance', 'Factory Tuner', 'Hybrid',
       'High-Performance', 'Hatchback', 'Exotic', 'Crossover', 'Luxury',
       'Flex Fuel', 'Diesel', 'x0_diesel', 'x0_electric',
       'x0_flex-fuel (premium unleaded recommended/E85)',
       'x0_flex-fuel (premium unleaded required/E85)',
       'x0_flex-fuel (unleaded/E85)', 'x0_premium unleaded (recommended)',
       'x0_premium unleaded (required)', 'x0_regular unleaded',
       'x1_AUTOMATED_MANUAL', 'x1_AUTOMATIC', 'x1_DIRECT_DRIVE', 'x1_MANUAL',
       'x2_all wheel drive', 'x2_four wheel drive', 'x2_front wheel drive',
       'x2_rear wheel drive', 'x3_Compact', 'x3_Large', 'x3_Midsize',
       'x4_2dr Hatchback', 'x4_4dr Hatchback', 'x4_4dr SUV', 'x4_Cargo Van',
       'x4_Convertible', 'x4_Coupe', 'x4_Crew Cab Pickup',
       'x4_Extended Cab Pickup', 'x4_Passenger Minivan', 'x4_Passenger Van',
       'x4_Regular Cab Pickup', 'x

In [20]:
#start with just two features
col_features = ['highway MPG', 'city mpg']

scaled_data_train_num = scaled_df_train[col_features]
scaled_data_test_num = scaled_df_test[col_features]

In [21]:
#fit model
knn = KNeighborsClassifier()
knn.fit(scaled_data_train_num, y_train)
test_preds = knn.predict(scaled_data_test_num)
train_preds = knn.predict(scaled_data_train_num)

In [22]:
#evaluate model, model perform well below models with all features
print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.67      0.05      0.10        37
           1       0.27      0.14      0.19        64
           2       0.54      0.67      0.60       478
           3       0.48      0.44      0.46       317
           4       0.55      0.61      0.58       460
           5       0.38      0.14      0.20        72
           6       0.63      0.42      0.50        64

    accuracy                           0.53      1492
   macro avg       0.50      0.35      0.38      1492
weighted avg       0.52      0.53      0.51      1492



In [23]:
print(metric.classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.62      0.09      0.16       109
           1       0.40      0.24      0.30       157
           2       0.57      0.72      0.64      1431
           3       0.57      0.46      0.51      1008
           4       0.59      0.67      0.63      1337
           5       0.58      0.20      0.29       205
           6       0.67      0.39      0.50       227

    accuracy                           0.57      4474
   macro avg       0.57      0.40      0.43      4474
weighted avg       0.57      0.57      0.56      4474



In [24]:
fn.optimize_k_knn(scaled_data_train_num, y_train, scaled_data_test_num, y_test, min_k=1, max_k=10, score='macro')

(8, 0.40574613368571555)

In [25]:
#requires higher number of neighbors

In [26]:
#add Engine HP
col_features = ['highway MPG', 'city mpg', 'Engine HP']

scaled_data_train_num = scaled_df_train[col_features]
scaled_data_test_num = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_data_train_num, y_train)
test_preds = knn.predict(scaled_data_test_num)
train_preds = knn.predict(scaled_data_train_num)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.70      0.84      0.77        37
           1       0.81      0.59      0.68        64
           2       0.88      0.89      0.88       478
           3       0.84      0.80      0.82       317
           4       0.83      0.90      0.86       460
           5       0.73      0.53      0.61        72
           6       0.83      0.83      0.83        64

    accuracy                           0.84      1492
   macro avg       0.80      0.77      0.78      1492
weighted avg       0.84      0.84      0.84      1492



Model improves drastically, which means Engine HP is a highly deterministic feature

In [27]:
fn.optimize_k_knn(scaled_data_train_num, y_train, scaled_data_test_num, y_test, min_k=1, max_k=10, score='macro')

(1, 0.8930854344258243)

In [28]:
col_features = ['highway MPG', 'city mpg', 'MSRP']

scaled_data_train_num = scaled_df_train[col_features]
scaled_data_test_num = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_data_train_num, y_train)
test_preds = knn.predict(scaled_data_test_num)
train_preds = knn.predict(scaled_data_train_num)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.43      0.41      0.42        37
           1       0.31      0.25      0.28        64
           2       0.65      0.69      0.67       478
           3       0.71      0.73      0.72       317
           4       0.67      0.71      0.69       460
           5       0.41      0.18      0.25        72
           6       0.67      0.59      0.63        64

    accuracy                           0.65      1492
   macro avg       0.55      0.51      0.52      1492
weighted avg       0.64      0.65      0.64      1492



In [29]:
fn.optimize_k_knn(scaled_data_train_num, y_train, scaled_data_test_num, y_test, min_k=1, max_k=10, score='macro')

(1, 0.6494737286044253)

Adding 'MSRP' feature increases model performance but not nearly as much as model with 'Engine HP'

In [30]:
#fit model using only engine prediction
col_features = ['Engine HP']

scaled_data_train_num = scaled_df_train[col_features]
scaled_data_test_num = scaled_df_test[col_features]

knn = KNeighborsClassifier()
knn.fit(scaled_data_train_num, y_train)
test_preds = knn.predict(scaled_data_test_num)
train_preds = knn.predict(scaled_data_train_num)

print(metric.classification_report(y_test, test_preds))

              precision    recall  f1-score   support

           0       0.53      0.73      0.61        37
           1       0.40      0.50      0.44        64
           2       0.75      0.82      0.78       478
           3       0.65      0.67      0.66       317
           4       0.79      0.73      0.76       460
           5       0.46      0.24      0.31        72
           6       0.72      0.56      0.63        64

    accuracy                           0.70      1492
   macro avg       0.61      0.61      0.60      1492
weighted avg       0.70      0.70      0.70      1492



Using only 'Engine HP' feature to predict already results in 70% accuracy and 0.6 F1 score, showing that it has highest impact on prediction

#### Summary

- KNN model can predict carmaker origin on test date with 94% accuracy and 0.93 F1-score
- The mdoel shows slight overfitting with training data having 100% accuracy and 1.0 F1-score
- Because optimized k=1, it shows there is a single dominant features that is highly deterministic in making predictions
- After investigation feature 'Engine HP' is confirmed to have high impact on making predictions