<a href="https://colab.research.google.com/github/alexjohnson21/ubiquitous-sniffle/blob/master/cse450_prove02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prove 02 - kNN Classifier



## Initial setup
Using the code provided in the assignment description, we will read the CSV straight from the course github into a variable and use it from there.

In [89]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import collections
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KDTree

# URL Still current despite the official course code change
url = "https://byui-cs.github.io/cs450-course/week01/iris.data"
data = pd.read_csv(url)

# Make sure we got what we needed
print(data)

     sepal_length  sepal_width  petal_length  petal_width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


### Splitting the data into "features" and "target"

In [90]:
# Perform a split from "all" to "all minus 1" (first to last -1) to get features
features = data.iloc[:, :-1].to_numpy()

# Perform a split on just the last element to get data
targets = data.iloc[:, -1].to_numpy()

# Sanity check
## print(features, '\n\n\n', target)
print(np.size(features, 0))

150


## Passing the feature and target arrays to *test_train_split*

In [0]:
# Taken from the assignment description on the course Github

# Randomize and split the samples into two groups. 
# 30% of the samples will be used for testing.
# The other 70% will be used for training.
train_data, test_data, train_targets, test_targets = train_test_split(features, targets, test_size=.3)

## Creating a hardcoded classifier

In [0]:
class HardCodedClassifier:
  # Default constructor
  def __init__(self):
    self.features = None
    self.targets = None
    self.predictions = None
  
  # fit - trains the classifier
  def fit(self, features, targets):
    self.features = features
    self.targets = targets
  
  # predict - makes predictions based on the training (hardcoded for now)
  def predict(self, test_features):
    self.predictions = np.array(['Iris-setosa'] * np.size(test_features, 0))
    return self.predictions

## Creating a kNN classifier

In [0]:
# I learned about list comprehension after I implemented the predict and predict_row
# methods. I might update them to be more elegant with list comprehension in the
# future.

class KnnClassifier:
  # Default Constructor
  def __init__(self):
    self.features = None
    self.targets = None
    self.predictions = None
  
  # fit - trains the classifier
  def fit(self, features, targets):
    self.features = features
    self.targets = targets
  
  def calc_distance(self, x1, x2):
    return np.sqrt((x1[0] - x2[0]) **2 + (x1[1] - x2[1]) **2)

  # predict - makes predictions based on the training
  def predict(self, test_data, k):
    predictions = []
    for row in test_data:
      predictions += [self.predict_row(row, k)]
    return predictions

  # predict_row - helper function to predict a single entity in test_data
  def predict_row(self, row, k):
    distances = []
    for feature in self.features:
      distances += [self.calc_distance(row, feature)]
    indices = np.argsort(distances)[0:k]
    return collections.Counter(self.targets[indices]).most_common(1)[0][0]

## Instantiate a KnnClassifier and test its accuracy

In [94]:
# Terrible naming scheme, but it's quick...

knc = KnnClassifier()
knc.fit(train_data, train_targets)

knc_predictions = knc.predict(test_data, 10)

accuracy_knc = accuracy_score(test_targets, knc_predictions)
accuracy_knc

0.8

## Instantiate and experiment with a SKLearn KNeighborsClassifier


In [95]:
# Terrible naming scheme, but it's quick...

KNC = KNeighborsClassifier(n_neighbors=3)
KNC.fit(train_data, train_targets)
KNC_predictions = KNC.predict(test_data)

accuracy_KNC = accuracy_score(test_targets, KNC_predictions)
accuracy_KNC

0.9777777777777777

## Conclusions

The SciKit-Learn KNN classifier works very well. Its accuracy fluctuates between 0.87 and 1.00 pretty consistently. The accuracy of my own classifier fluctuates between 0.60 and 0.85. The SKLearn version is clearly written much better and leaves anything I could write in the dust.

## BONUS: Experimenting with a KD Tree

### Set up the KDTree and run some tests

In [96]:
# Again, not a good naming scheme. But it's quick.

# Reference used: scikit-learn docs at 
#   https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html

kdt = KDTree(train_data, leaf_size=2)
dist, ind = kdt.query(test_data, k=3)
kdt_predictions = []

for value in train_targets[ind]:
  kdt_predictions += [collections.Counter(value).most_common(1)[0][0]]

# kdt_predictions

accuracy_kdt = accuracy_score(test_targets, kdt_predictions)

accuracy_kdt

0.9777777777777777

### Further conclusions

The KDTree is much more accurate than even the sklearn KNearestNeighbors classifier. Accuracy fluctuates anywhere from 0.91 to 1.00, with a high frequency of 1.00. I have little to no idea as to the best classification algorithms. However, even in this assignment, I've seen that there are probably a lot of well-written classifiers that may have different benefits in different situations.