# Gender classification through KNN

## When to use KNN(K-Nearest Neighbours)
    - it can be used for classification as well as regression also but mostly it is used for the Classification problems.
    - it memorize the training data instead learning. Due to the very same reason, it is also known as a lazy algorithm.
    - Majorly used in multiclass classification
    - It does not get affected by outliers
    - It is computationally extensive because it measures distance with each data point
    - It can be used for non-linear dataset also
## Assumptions
    - No assumption
    

In [1]:
import pandas as pd
import numpy as np
import pprint as pp
import math
from sklearn.metrics import classification_report

#### Reading dataset

In [2]:
df = pd.read_csv('./dataset/dataset.csv')
df.head()

Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,Cool,Rock,Vodka,7UP/Sprite,F
1,Neutral,Hip hop,Vodka,Coca Cola/Pepsi,F
2,Warm,Rock,Wine,Coca Cola/Pepsi,F
3,Warm,Folk/Traditional,Whiskey,Fanta,F
4,Cool,Rock,Vodka,Coca Cola/Pepsi,F


#### Dataset encoding

In [3]:
df = df.astype('category')
for each_column in df.columns:
    pp.pprint(dict(enumerate(df[each_column].cat.categories)))
    df[each_column] = df[each_column].cat.codes
df.head()

{0: 'Cool', 1: 'Neutral', 2: 'Warm'}
{0: 'Electronic',
 1: 'Folk/Traditional',
 2: 'Hip hop',
 3: 'Jazz/Blues',
 4: 'Pop',
 5: 'R&B and soul',
 6: 'Rock'}
{0: 'Beer', 1: "Doesn't drink", 2: 'Other', 3: 'Vodka', 4: 'Whiskey', 5: 'Wine'}
{0: '7UP/Sprite', 1: 'Coca Cola/Pepsi', 2: 'Fanta', 3: 'Other'}
{0: 'F', 1: 'M'}


Unnamed: 0,Favorite Color,Favorite Music Genre,Favorite Beverage,Favorite Soft Drink,Gender
0,0,6,3,0,0
1,1,2,3,1,0
2,2,6,5,1,0
3,2,1,4,2,0
4,0,6,3,1,0


#### Spliting dataset

In [4]:
df_train = df.sample(frac=1)
x_train = df_train.values[: , :-1]
y_train = df_train.values[:, -1]

print ('x_train: {}\ny_train: {}'.format(x_train.shape, y_train.shape))

x_train = x_train.tolist()
y_train = y_train.tolist()

print(x_train)
print(y_train)

x_train: (66, 4)
y_train: (66,)
[[0, 0, 0, 1], [0, 4, 5, 0], [0, 0, 2, 2], [1, 6, 1, 1], [2, 0, 5, 1], [2, 6, 2, 1], [0, 6, 1, 1], [0, 5, 0, 1], [2, 3, 5, 1], [0, 0, 1, 2], [0, 6, 3, 1], [1, 6, 0, 1], [1, 2, 3, 1], [2, 0, 3, 2], [0, 4, 1, 0], [0, 4, 0, 3], [2, 3, 4, 2], [0, 0, 1, 2], [2, 4, 2, 1], [0, 2, 0, 1], [0, 4, 1, 1], [0, 6, 5, 0], [0, 6, 3, 0], [0, 6, 1, 3], [0, 1, 0, 3], [2, 3, 3, 1], [0, 5, 1, 1], [0, 6, 3, 1], [2, 6, 2, 0], [0, 5, 4, 0], [0, 2, 5, 1], [0, 6, 0, 1], [2, 5, 4, 1], [2, 6, 0, 2], [0, 4, 2, 2], [2, 6, 3, 0], [0, 6, 2, 1], [2, 4, 4, 2], [0, 2, 0, 1], [1, 4, 5, 1], [2, 2, 0, 1], [2, 3, 1, 2], [2, 5, 5, 3], [2, 4, 2, 0], [2, 4, 5, 0], [2, 6, 5, 1], [2, 5, 1, 1], [2, 0, 2, 1], [0, 1, 4, 0], [0, 0, 4, 1], [0, 2, 1, 3], [0, 4, 0, 1], [0, 4, 2, 0], [1, 2, 0, 0], [0, 6, 2, 1], [0, 4, 1, 3], [0, 6, 3, 1], [0, 4, 4, 3], [0, 4, 4, 2], [2, 1, 4, 2], [2, 1, 2, 2], [0, 6, 3, 1], [0, 6, 5, 1], [1, 4, 1, 0], [1, 2, 1, 2], [0, 4, 0, 2]]
[1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 

#### Logic for calculating distance

In [5]:
def get_distances_label(input_param, x, y, k):
    distances, label = [],[]
    used_index = []
    for i in range(k):
        min_distance = None
        min_index = 0
        for each_row, j in zip(x, range(len(x))):
            if(j in used_index):continue
            distance = 0
            for x1,x2 in zip(each_row, input_param):
                distance += (x1 - x2)**2
            distance = math.sqrt(distance)
            if(min_distance == None or min_distance > distance): 
                min_distance = distance
                min_index = j
        used_index.append(min_index)
        distances.append(min_distance)
        label.append(y[used_index[-1]])
#       distance = np.sqrt(sum((each_row - input_param)**2)) #numpy formula
    return distances,label

#### prediction

In [6]:
def predict(input_param, x, y, k=9):
    d, l = get_distances_label(input_param, x, y, k)
    if sum(l) > k / 2 : return 'M'
    else: return 'F'
    

In [7]:
predicted_class = predict([2, 3, 5, 1], x_train, y_train)
print('Predicted class is {}'.format(predicted_class))

Predicted class is F


#### Evaluation

In [8]:
y_pred = []
for each in x_train:
    y_pred.append(0 if predict(each, x_train, y_train) == 'F' else 1)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.88      0.75        33
           1       0.82      0.55      0.65        33

    accuracy                           0.71        66
   macro avg       0.74      0.71      0.70        66
weighted avg       0.74      0.71      0.70        66

