### K nearest neighbors

Tried a few different random searches to fine tune the hyperparameters.

Took a really long time though and didn't get anything very promising, just overfit results, so I moved on to other models.

Training accuracy score: 0.999430103922226, 

Testing accuracy score: 0.5524939662107804

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

In [2]:
df = pd.read_json('../data/cleaned_v1.json')

In [3]:
X = df['ingredients']
y = df['cuisine']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [5]:
pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('ss', StandardScaler(with_mean=False)),
    ('knn', KNeighborsClassifier())
])

In [12]:
pipe1_params = {
    'cvec__min_df': [1,2,3,4],
    'cvec__max_features': [2250, 2500, 2750, 3000 ],
    'knn__n_neighbors': [3,4,5,6],
    'knn__weights': ['distance']
}

In [13]:
rs1 = RandomizedSearchCV(estimator=pipe1,
                        param_distributions=pipe1_params,
                        cv=5,
                        scoring="accuracy",
                        n_jobs=-1,
                        verbose=1)

In [14]:
rs1.fit(X_train, y_train) 

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  5.8min finished


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                             ('ss',
                                              StandardScaler(with_mean=False)),
                                             ('knn', KNeighborsClassifier())]),
                   n_jobs=-1,
                   param_distributions={'cvec__max_features': [2250, 2500, 2750,
                                                               3000],
                                        'cvec__min_df': [1, 2, 3, 4],
                                        'knn__n_neighbors': [3, 4, 5, 6],
                                        'knn__weights': ['distance']},
                   scoring='accuracy', verbose=1)

In [15]:
rs1.best_score_

0.5513241702983573

In [16]:
rs1.best_params_

{'knn__weights': 'distance',
 'knn__n_neighbors': 6,
 'cvec__min_df': 2,
 'cvec__max_features': 2750}

In [17]:
rs1.score(X_train, y_train), rs1.score(X_test, y_test)

(0.999430103922226, 0.5524939662107804)