# k-Nearest Neighbors

This notebook explores the performance of the k-Nearest Neighbors Classification Model. 

First we import and load the data.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [2]:
train_data = pd.read_csv("../data/ohe_train_recipes_v2.csv",index_col="id")
train_data_tfidf = pd.read_csv("../data/tfidf_train_recipes_v2.csv",index_col="id")

In [3]:
train_data.head(2)

Unnamed: 0_level_0,1% buttermilk,1% chocolate milk,1% cottage cheese,1% milk,"2 1/2 to 3 lb. chicken, cut into serving pieces",2% cottage cheese,2% low fat cheddar chees,2% lowfat greek yogurt,2% milk mozzarella cheese,2% reduced-fat milk,...,yuzu,yuzu juice,za'atar,zest,zesty italian dressing,zinfandel,ziti,zucchini,zucchini blossoms,cuisine
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,spanish
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,mexican


In [4]:
train_data_tfidf = train_data_tfidf.merge(train_data['cuisine'], left_index=True, right_index=True)

Create train/validation splits of the data

In [5]:
X_train, X_val, y_train, y_val = train_test_split(train_data.drop(columns=['cuisine']),
                                                  train_data['cuisine'],
                                                  test_size=0.3,random_state=22)

In [6]:
X_train_tf, X_val_tf, y_train_tf, y_val_tf = train_test_split(train_data_tfidf.drop(columns=['cuisine']),
                                                              train_data_tfidf['cuisine'],
                                                              test_size=0.3,random_state=22)

## Nearest Neighbor Classifiers

Implement and score k-NN on the data (both one hot encoded and TF-IDF encoded) 

In [7]:
nn_model = KNeighborsClassifier(n_neighbors=5)
nn_model.fit(X_train, y_train)
nn_model.score(X_train, y_train), nn_model.score(X_val, y_val)

(0.6872957149527675, 0.5320539679879326)

In [32]:
nn_model_tf = KNeighborsClassifier(n_neighbors=5)
nn_model_tf.fit(X_train_tf, y_train_tf)
nn_model_tf.score(X_train_tf, y_train_tf), nn_model_tf.score(X_val_tf, y_val_tf)

(0.811716533170504, 0.7150758401072655)

In [23]:
X_train_tf.shape, X_val_tf.shape, y_val_tf.shape

((27841, 6215), (11933, 6215), (11933,))

## Grid Search CV

Use cross validation to find the optimal $k$ for the algorithm

In [33]:
from sklearn.model_selection import GridSearchCV

In [34]:
parameters = {'n_neighbors':[1,3,5,7,9,11,13]}
neighbors = KNeighborsClassifier()
grid_search = GridSearchCV(neighbors, parameters, cv=3, verbose=2)

In [35]:
grid_search.fit(X_val_tf, y_val_tf)

Fitting 3 folds for each of 7 candidates, totalling 21 fits
[CV] END ......................................n_neighbors=1; total time=   2.1s
[CV] END ......................................n_neighbors=1; total time=   2.0s
[CV] END ......................................n_neighbors=1; total time=   2.0s
[CV] END ......................................n_neighbors=3; total time=   2.0s
[CV] END ......................................n_neighbors=3; total time=   2.1s
[CV] END ......................................n_neighbors=3; total time=   2.0s
[CV] END ......................................n_neighbors=5; total time=   2.0s
[CV] END ......................................n_neighbors=5; total time=   2.0s
[CV] END ......................................n_neighbors=5; total time=   2.0s
[CV] END ......................................n_neighbors=7; total time=   2.0s
[CV] END ......................................n_neighbors=7; total time=   2.0s
[CV] END ......................................n_

In [36]:
grid_search.best_params_, grid_search.best_score_

({'n_neighbors': 11}, 0.6964720976686839)

Nearest neighbors may perform poorly in high dimensional space. We will use PCA to reduce the dimensionality before applying the classifier. 

In [13]:
from sklearn.decomposition import PCA

In [14]:
print("Number of features:",X_train.shape[1])

(27841, 6215)

In [15]:
pca = PCA(n_components=500)
pca.fit(X_train)
X_train_red = pca.transform(X_train)
X_val_red = pca.transform(X_val)

In [16]:
pca_tf = PCA(n_components=500)
pca_tf.fit(X_train_tf)
X_train_tf_red = pca_tf.transform(X_train_tf)
X_val_tf_red = pca_tf.transform(X_val_tf)

In [17]:
nn_model = KNeighborsClassifier(n_neighbors=5)
nn_model.fit(X_train_red, y_train)
nn_model.score(X_train_red, y_train), nn_model.score(X_val_red, y_val)

(0.6986458819726303, 0.5491494175814967)

In [18]:
nn_model = KNeighborsClassifier(n_neighbors=5)
nn_model.fit(X_train_tf_red, y_train)
nn_model.score(X_train_tf_red, y_train), nn_model.score(X_val_tf_red, y_val)

(0.7391616680435329, 0.584429732674097)

This reduction did not improve performance. 

## Test Predictions

Generate predictions for the test set to evaluate model preformance.

In [39]:
test_data = pd.read_csv("../data/ohe_test_recipes_v2.csv",index_col="id")


In [41]:
final_model = grid_search.best_estimator_
test_predictions = final_model.predict(test_data)

In [42]:
pd.Series(test_predictions, index=test_data.index, name='cuisine').to_csv("model_predictions/nearest_neighbors.csv")
## kaggle score: 0.66492