<a href="https://colab.research.google.com/github/au1206/ML-Practice/blob/main/Knn_e2eml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KNN from scratch 


## Downloading the data
Dataset:  Palmer Penguins data set

In [10]:
!wget https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv

--2022-04-27 06:06:53--  https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15241 (15K) [text/plain]
Saving to: ‘penguins.csv’


2022-04-27 06:06:53 (96.6 MB/s) - ‘penguins.csv’ saved [15241/15241]



## Import Dependencies

In [11]:
import numpy as np

## Processing the data

In [12]:
with open('penguins.csv', 'rt') as f:
  data = f.readlines()

In [13]:
column_headers = data[0].split(',')
print(column_headers)

['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year\n']


In [14]:
data[:10]

['species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year\n',
 'Adelie,Torgersen,39.1,18.7,181,3750,male,2007\n',
 'Adelie,Torgersen,39.5,17.4,186,3800,female,2007\n',
 'Adelie,Torgersen,40.3,18,195,3250,female,2007\n',
 'Adelie,Torgersen,NA,NA,NA,NA,NA,2007\n',
 'Adelie,Torgersen,36.7,19.3,193,3450,female,2007\n',
 'Adelie,Torgersen,39.3,20.6,190,3650,male,2007\n',
 'Adelie,Torgersen,38.9,17.8,181,3625,female,2007\n',
 'Adelie,Torgersen,39.2,19.6,195,4675,male,2007\n',
 'Adelie,Torgersen,34.1,18.1,193,3475,NA,2007\n']

In [15]:
# label: species, 
# feature: 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex'
def process_data(data):
  species = {'Adelie':0 , 'Gentoo': 1, 'Chinstrap': 2}
  gender = {'male': 0, 'female': 1}
  features = np.zeros((len(data), 5))
  labels = np.zeros(len(data))

  for id, line in enumerate(data):
    line = line.split(',')
    try:
      feature = [float(x.strip()) for x in line[2:6]]
      sex = float(gender[line[6]])
      features[id, :4] = feature
      features[id, -1] = sex
      labels[id] = species[line[0]]


    except Exception:
      # basically ignoring thhe handling of NA righht now. Since its KNN shouold be fine
      pass

  return features, labels

In [16]:
features, labels = process_data(data)

## The train test split

In [40]:
def train_test_split(features, labels, frac=0.8):
  """
  function to create the train test splits
  Params:
  features: processed features obtained after process data call. 2D numpy matrix
  labels: processed labels obtained after process data call. 1D numpy matrix
  frac: 0-1, defines the train percentage split 

  Return:
  x_train: features for train set
  y_train: labels for train set
  x_test: features for test set
  y_test: labels for test set
  """
  idx = np.arange(features.shape[0])
  np.random.shuffle(idx)
  n= int(frac*len(idx))
  train_id = idx[:n]
  test_id = idx[n:]
  x_train = features[train_id,:]
  y_train = labels[train_id]
  x_test = features[test_id,:]
  y_test = labels[test_id]

  # Feature Scaling: really important for knn and similar algos.
  mean = np.mean(x_train, axis=0)
  std = np.sqrt(np.var(x_train, axis=0))
  x_train = (x_train - mean) / std
  x_test = (x_test - mean) / std
  
  return x_train, y_train, x_test, y_test

In [41]:
x_train, y_train, x_test, y_test = train_test_split(features, labels)

In [42]:
print(f"Size of x_train: {x_train.shape}")
print(f"Size of y_train: {y_train.shape}")
print(f"Size of x_test: {x_test.shape}")
print(f"Size of x_test: {y_test.shape}")


Size of x_train: (276, 5)
Size of y_train: (276,)
Size of x_test: (69, 5)
Size of x_test: (69,)


## KNN

In [43]:
class knn():
  def __init__(self, k: int):
    self.k = k

  def predict(self, test_feature, train_features):
    """
    returns the k nearest neighbors
    """
    dist = np.sum(np.abs(train_features - test_feature[np.newaxis, :]),axis=1)
    order = np.argsort(dist) # the indices of the nearest neighbors
    return order[: self.k]

  def score(self, train_labels, test_label):
    label_counter = np.zeros(3) # number of classes are 3, can replace with a dict as well
    for elem in train_labels:
      label_counter[int(elem)]+=1
    y_pred = np.argmax(label_counter)  # argmax returns first occurence in case of tie
    print("Predicted class: ", label_counter, y_pred)
    if y_pred == test_label:
      return 1
    else:
      return 0

  def evaluate(self, test_features, train_features, test_labels, train_labels):
    total_score = 0
    for i in range(len(test_features)):
      top_k = self.predict(test_features[i,:], train_features)
      score  = self.score(train_labels[top_k], test_labels[i] )
      total_score+=score

    return total_score/len(test_features)

In [47]:
cls = knn(k=5)

In [48]:
acc = cls.evaluate(x_test, x_train, y_test, y_train)

Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 0. 5.] 2
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [4. 0. 1.] 0
Predicted class:  [3. 0. 2.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [0. 0. 5.] 2
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [5. 0. 0.] 0
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 0. 5.] 2
Predicted class:  [0. 0. 5.] 2
Predicted class:  [0. 5. 0.] 1
Predicted class:  [0. 0. 5.] 2
Predicted class:  [5. 0. 0.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [0. 5. 0.] 1
Predicted class:  [5. 0. 0.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [5. 0. 0.] 0
Predicted class:  [3. 0. 2.] 0
Predicted class:  [3. 0. 2.] 0
Predicted class:  [0. 5. 0.] 1
Predicted class:  [5. 0. 0.] 0
Predicte

In [50]:
print(f"The accuracy of our KNN with k=5 is {acc*100}%")

The accuracy of our KNN with k=5 is 98.55072463768117%
