KNN assignment.
Fahimeh Gholami

## Part 1: Familiarization and Basic Testing of the kNN Algorithm

### 1.1 Dataset Selection
- We have chosen the Mushroom Classification dataset from [Kaggle](https://www.kaggle.com/datasets/uciml/mushroom-classification).
- This dataset consists of 23 columns: 1 target column and 22 feature columns, all of which are categorical.
- The target column indicates whether a mushroom is edible or not, while the 22 feature columns provide classification attributes.
- Therefore, it is an ideal for implementing the KNN algorithm.

In [30]:
# import the libraries
import pandas as pd
import numpy as np
from sklearn import datasets

from sklearn.model_selection import train_test_split, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

#### Data cleaning

In [31]:
# load the data
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1xh2GbpscpCIe4ypUfRyfoIFABrUWk4zy")
df

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [32]:
# Check null, passed.
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [33]:
# Check if there is duplicated value, passed.
df.duplicated().sum()

0

In [34]:
# Get basic statistic, passed.
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


### 1.2 Algorithm Application

#### We convert the categorized variables to numeric variables.

In [35]:
labelEncoder = LabelEncoder()
df = df.apply(labelEncoder.fit_transform)

#### Separating Features and Target Variable
- We set the column "class" to x, which means the features variable.
- Then set the rest columns to y, which means the features variable.

In [36]:
x = df.iloc[:, 1:]
y = df.iloc[:, :1].squeeze()

In [37]:
x

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,5,2,4,1,6,1,0,1,4,0,...,2,7,7,0,2,1,4,2,3,5
1,5,2,9,1,0,1,0,0,4,0,...,2,7,7,0,2,1,4,3,2,1
2,0,2,8,1,3,1,0,0,5,0,...,2,7,7,0,2,1,4,3,2,3
3,5,3,8,1,6,1,0,1,5,0,...,2,7,7,0,2,1,4,2,3,5
4,5,2,3,0,5,1,1,0,4,1,...,2,7,7,0,2,1,0,3,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,3,2,4,0,5,0,0,0,11,0,...,2,5,5,0,1,1,4,0,1,2
8120,5,2,4,0,5,0,0,0,11,0,...,2,5,5,0,0,1,4,0,4,2
8121,2,2,4,0,5,0,0,0,5,0,...,2,5,5,0,1,1,4,0,1,2
8122,3,3,4,0,8,1,0,1,0,1,...,1,7,7,0,2,1,0,7,4,2


In [38]:
y

0       1
1       0
2       0
3       1
4       0
       ..
8119    0
8120    0
8121    0
8122    1
8123    0
Name: class, Length: 8124, dtype: int64

#### Implementing the k-NN

In [39]:
# We created a function for our task.
def knn_classifier_and_evaluate(K, test_size_value):
    # Splitting the Dataset into Training and Testing Sets.
    # We allocated some rows for training and the remaining 20% for testing.
    # Also, we set the random_state to ensure that the same rows are  allocated to the training and testing sets each time.
    x_train, x_test, y_train, y_test= train_test_split(x, y, test_size = test_size_value, shuffle = True, random_state = 0)

    # Implementing the KNN classifier, and evaluating the score.
    knn = KNeighborsClassifier(K)
    knn.fit(x_train, y_train)
    y_pred_sklearn= knn.predict(x_test)
    score = accuracy_score(y_test, y_pred_sklearn)
    print(f"K: {K}, test_size: {test_size_value}, score: {score}")

We set K to 3, and test_size_value to 0.2 first, and got a 99.88% accuracy.

In [40]:
K = 3
test_size_value = 0.2
knn_classifier_and_evaluate(K, test_size_value)

K: 3, test_size: 0.2, score: 0.9987692307692307


### 1.3 Use Case Identification

## Part 2: In-Depth Experimentation with the kNN Algorithm

### 2.1 Parameter Experimentation
- We set the K to different values.
- The "3" is the best choice in these tests.

In [41]:
# We created a list for candidate values of K.
K_list = [3, 5, 7, 9, 11]
test_size_value = 0.2
# We tested each candidate K.
for K in K_list:
    knn_classifier_and_evaluate(K, test_size_value)

K: 3, test_size: 0.2, score: 0.9987692307692307
K: 5, test_size: 0.2, score: 0.9969230769230769
K: 7, test_size: 0.2, score: 0.9963076923076923
K: 9, test_size: 0.2, score: 0.9969230769230769
K: 11, test_size: 0.2, score: 0.9963076923076923


### 2.2 Train-Test Split Analysis
- We set the K to 3, and set the test_size_value to different values.
- We observed that using more data for training generally results in higher accuracy.
- However, this comes with the risk of overfitting.

In [42]:
K = 3
test_size_values = [0.9, 0.7, 0.5, 0.2]
for test_size_value in test_size_values:
    knn_classifier_and_evaluate(K, test_size_value)

K: 3, test_size: 0.9, score: 0.9839989059080962
K: 3, test_size: 0.7, score: 0.995604009143661
K: 3, test_size: 0.5, score: 0.999015263417036
K: 3, test_size: 0.2, score: 0.9987692307692307


### 2.3 Implementing the K-Fold Cross Validation
- We implemented 5-fold cross-validation, which means the dataset was divided into 5 groups. The model was also trained 5 times.
- For each iteration, we used 4 groups as the training dataset and the remaining group as the test dataset.
- We set the k_neighbors to 3.
- The K-fold cross-validation results show that the model's accuracy is consistently close to 100% across. This indicates that the algrithm performs well on this dataset.

In [43]:

k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

k_neighbors = 3
accuracies = []

for train_index, test_index in kf.split(x):
    x_train, x_test = x.iloc[train_index], x.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    knn = KNeighborsClassifier(n_neighbors=k_neighbors)
    knn.fit(x_train, y_train)
    y_pred = knn.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)

average_accuracy = np.mean(accuracies)
print(f"Accuracies for each fold: {accuracies}")
print(f"Average Accuracy across all folds: {average_accuracy:.2f}")

Accuracies for each fold: [0.9987692307692307, 0.9987692307692307, 1.0, 0.9993846153846154, 1.0]
Average Accuracy across all folds: 1.00
