# Team 10's KNN assignment.
Team member:
- Xin Feng
- Fahimeh Gholami
- Mu Zhao

## Introduction

- We have chosen the Mushroom Classification dataset from [Kaggle](https://www.kaggle.com/datasets/uciml/mushroom-classification).
- This dataset consists of 23 columns: 1 target column and 22 feature columns, all of which are categorical.
- The target column indicates whether a mushroom is edible or not, while the 22 feature columns provide classification attributes.
- Therefore, it is an ideal for implementing the KNN algorithm.

In [89]:
# import the libraries
import pandas as pd
import numpy as np
from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

## Data cleaning

In [47]:
# load the data
df = pd.read_csv("https://drive.google.com/uc?export=download&id=1xh2GbpscpCIe4ypUfRyfoIFABrUWk4zy")
df

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [48]:
# Check null, passed.
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [49]:
# Check if there is duplicated value, passed.
df.duplicated().sum()

0

In [50]:
# Get basic statistic, passed.
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


## Implementing the KNN Algorithm

### Separating Features and Target Variable
- We set the column "class" to x, which means the features variable.
- Then set the rest columns to y, which means the features variable.

In [77]:
x = df.iloc[:, 1:]
y = df.iloc[:, :1].squeeze()

In [78]:
x

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,k,s,n,f,n,a,c,b,y,e,...,s,o,o,p,o,o,p,b,c,l
8120,x,s,n,f,n,a,c,b,y,e,...,s,o,o,p,n,o,p,b,v,l
8121,f,s,n,f,n,a,c,b,n,e,...,s,o,o,p,o,o,p,b,c,l
8122,k,y,n,f,y,f,c,n,b,t,...,k,w,w,p,w,o,e,w,v,l


In [79]:
y

0       p
1       e
2       e
3       p
4       e
       ..
8119    e
8120    e
8121    e
8122    p
8123    e
Name: class, Length: 8124, dtype: object

### Splitting the Dataset into Training and Testing Sets
- We allocated 80% of the rows for training and the remaining 20% for testing.
- To randomize the allocation of rows, we set shuffle=True, ensuring that the rows are shuffled before splitting.
- Also, we set the random_state to ensure that the same rows are  allocated to the training and testing sets each time.

In [80]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 0)

In [81]:
x_train

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
7434,k,s,g,f,n,f,w,b,p,e,...,s,w,w,p,w,t,p,w,s,g
7725,x,f,w,f,n,f,w,b,g,e,...,s,w,w,p,w,t,p,w,n,g
783,x,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,k,s,m
1928,f,s,w,f,n,f,w,b,h,t,...,f,w,w,p,w,o,e,k,s,g
7466,k,y,e,f,y,f,c,n,b,t,...,k,w,p,p,w,o,e,w,v,l
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4931,x,y,e,t,n,f,c,b,e,e,...,s,e,e,p,w,t,e,w,c,w
3264,x,f,g,f,f,f,c,b,h,e,...,k,p,n,p,w,o,l,h,y,p
1653,x,s,g,f,n,f,w,b,h,t,...,s,w,w,p,w,o,e,n,s,g
2607,f,f,n,t,n,f,c,b,n,t,...,s,g,g,p,w,o,p,n,v,d


### Converting Categorical Variables to Numerical Format
- Since all the feature columns are categorical variables, we have to convert them to number for calculating the distances.
- We can convert by Label Encoding to generate the result like: red is 0, yellow is 2, blue is 3. But it makes no sense since the distances between red to blue, or red to yellow are actually same.
- So we use One-hot encoding in this case, which represents each category as a separate binary column. This approch is more suitable for the distances calculation.

In [86]:
x_train = pd.get_dummies(x_train)
x_test = pd.get_dummies(x_test)

x_train, x_test = x_train.align(x_test, join='left', axis=1, fill_value=0)

x_train = x_train.astype(int)
x_test = x_test.astype(int)

In [87]:
x_train

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
7434,0,0,0,1,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
7725,0,0,0,0,0,1,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
783,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,1,0,0,0
1928,0,0,1,0,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
7466,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4931,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3264,0,0,0,0,0,1,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1653,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
2607,0,0,1,0,0,0,1,0,0,0,...,0,1,0,1,0,0,0,0,0,0


### Normalize the Data
- Since we applied the one-hot encoding, now the variable with more unique values expanded to more columns, which means these variables may have higher weight.
- It's better for use to scale the data for preventing these side effects.

In [95]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [96]:
x_train_scaled

array([[-0.24144708, -0.02149006, -0.80124335, ..., -0.40325154,
        -0.21921853, -0.15836308],
       [-0.24144708, -0.02149006, -0.80124335, ..., -0.40325154,
        -0.21921853, -0.15836308],
       [-0.24144708, -0.02149006, -0.80124335, ..., -0.40325154,
        -0.21921853, -0.15836308],
       ...,
       [-0.24144708, -0.02149006, -0.80124335, ..., -0.40325154,
        -0.21921853, -0.15836308],
       [-0.24144708, -0.02149006,  1.24806028, ..., -0.40325154,
        -0.21921853, -0.15836308],
       [-0.24144708, -0.02149006, -0.80124335, ..., -0.40325154,
        -0.21921853, -0.15836308]])

### Implementing the k-NN Classifier And Evaluate the Accuracy.
- We implemented the KNN classifier and tried to find the best K.
- But we got the almost same result for every K.

In [115]:
# We created a function for finding a best K.
def find_best_k(K, score_list):
    knn = KNeighborsClassifier(K)
    knn.fit(x_train_scaled, y_train)
    y_pred_sklearn= knn.predict(x_test_scaled)
    score = accuracy_score(y_test, y_pred_sklearn)
    score_list.append({'K': K, 'accuracy': score})

In [118]:
# We created a list for candidate values of K.
K_list = [3, 5, 7, 9, 11]
# The score's list.
score_list = []

# We tested each candidate K.
for K in K_list:
    find_best_k(K, score_list)

# We printed out the result.
for score in score_list:
    print(score)

{'K': 3, 'accuracy': 1.0}
{'K': 5, 'accuracy': 1.0}
{'K': 7, 'accuracy': 1.0}
{'K': 9, 'accuracy': 1.0}
{'K': 11, 'accuracy': 1.0}


## Implementing the K-Fold Cross Validation