# KNN
This notebook shows the process of preprocessing and hyperparameter selection for KNN.

In [1]:
import pandas as pd
df = pd.read_csv('dataset.csv')
df = df.sample(20000)
df_target = df['humor']
df_data = df.copy()
df_data.drop(columns='humor')

df_target.head()

35199     False
169917     True
75070     False
10146     False
128242     True
Name: humor, dtype: bool

## Preprocessing
The preprosessing for KNN consists of only stemming, since this approach appeared to show the best results.
Also, the data gehts vectorized via Tf/idf.

In [2]:
from sklearn import preprocessing

#encode target to numeric
label_encoder = preprocessing.LabelEncoder()
df_target = label_encoder.fit_transform(df_target)
#df_target

In [3]:
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import re, string

#when running for the first time you need to activate this line for once.
#nltk.download('stopwords')

#definition of stemming function
token_pattern = re.compile(r"(?u)\b\w\w+\b") # split on whitespace

def tokenize(text):
    stemmer = PorterStemmer()
    stems = []
    
    tokens = token_pattern.findall(text)
    for item in tokens:
        stems.append(stemmer.stem(item))
    return stems

In [4]:
#Stem data with Tfidf vectorizer
stem_vectorizer = TfidfVectorizer(tokenizer=tokenize, min_df=0.01)
matrix = stem_vectorizer.fit_transform(df_data['text'])

df_data_stemmed = pd.DataFrame(matrix.toarray(), columns=stem_vectorizer.get_feature_names())
#display(df_data_stemmed)




## Grid Search and Cross Validation

In the following, different n-Values, algorithms and distance metrics are tested. Afterwards, a model with the best parameters is trained and its accuracy gets calculated.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

# Create train/test split
df_data_train, df_data_test, df_target_train, df_target_test = train_test_split(
    df_data_stemmed, df_target, test_size=0.2, random_state=42)

grid_params = {
    'n_neighbors': [18,22,25],
    'algorithm': ['auto','ball_tree','kd_tree','brute'],
    'metric': ['euclidean','cosine']

}
knn_estimator = KNeighborsClassifier()

estimator = GridSearchCV(knn_estimator,grid_params,cv=3)
estimator.fit(df_data_train, df_target_train)
print(estimator.best_estimator_)

final_estimator_knn = estimator.best_estimator_

df_prediction = final_estimator_knn.predict(df_data_test)

print("params= {} acc: {}".format(estimator.best_params_, accuracy_score(df_target_test, df_prediction)))

18 fits failed out of a total of 72.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\neighbors\_classification.py", line 198, in fit
    return self._fit(X, y)
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\neighbors\_base.py", line 437, in _fit
    self._check_algorithm_metric()
  File "C:\Users\mcl.NB-MCL\anaconda3\envs\datamining\lib\site-packages\sklearn\neighbo

KNeighborsClassifier(metric='cosine', n_neighbors=25)
params= {'algorithm': 'auto', 'metric': 'cosine', 'n_neighbors': 25} acc: 0.81075
