# CS 3120 Machine Learning HW1

Devon DeJohn, Spring 2020

In [1]:
import sys
sys.path.append('../')
from source import k_nearest_neighbors
model = k_nearest_neighbors.Model("../data/animals")

## Imports

```python
import cv2
import pathlib
import random
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report as crep
from itertools import product
from tabulate import tabulate
```

## Train, Test, Validate

My default model's parameters are slightly different than `sklearn`'s model, as noted in the initial call to `super().__init__()`:

```python
DEFAULT = {
    "n_neighbors": 3,
    "metric": "manhattan",
    "weights": "distance",
    "n_jobs": -1
}

class Model(KNeighborsClassifier):
    """A container for a KNN classifier"""
    def __init__(self, path: str):
        super().__init__(**DEFAULT)
        self.dims = (16,16)
        self.labels = {}
        self.path = path
        self.parts = ["train", "test", "validate"]
        self.load_data()
        self.fit(self.train.X, self.train.Y)
    # end
```

I also decided to implement my own version of `train_test_split` that supports multiple partitions, not just two:

In [2]:
model.datasets()


Partition 0: 'train'
       size: 2100 / 3000
       pcnt: 70.0 %

Partition 1: 'test'
       size: 600 / 3000
       pcnt: 20.0 %

Partition 2: 'validate'
       size: 300 / 3000
       pcnt: 10.0 %


## K=3, using the $\ell_1$-norm

For `n_neighbors=3`, using the `manhattan` distance metric, or the $\ell_1$-norm, we have:

In [3]:
model.report()


              precision    recall  f1-score   support

         cat       0.49      0.44      0.46       202
         dog       0.47      0.57      0.52       223
       panda       0.73      0.63      0.68       175

    accuracy                           0.54       600
   macro avg       0.56      0.54      0.55       600
weighted avg       0.55      0.54      0.54       600



## Retraining

I added the ability to retrain the model directly through the `Model` type by making `sklearn.KNeighborsClassifier` a super class of `Model`, so that the classifier's attributes are accessible via the `Model` type.

## Performance

Taking the cartesian product of these parameters, we can retrain our model on every combination and measure the mean accuracy:

```python
dims = [8, 16, 32, 64]
neighbors = [3, 5, 7, 9]
metrics = ["manhattan", "euclidean"]
weights = ["uniform", "distance"]
```

Each time `Model.retrain()` is called, the seed for the `random` module is reset, so the testing data are the same across each run.

Each subplot below represents a distance metric (tabular), and a weighting algorithm (columnar). The horizontal axis of each subplot represents the pixel dimensions of the rescaled image data, and the vertical axis represents the number of neighbors counted in voting.

In [None]:
model.cycle_params()

As seen from the tabular data above, we achieved the highest accuracy with `9` neighbors, using the inverse of the `distance` as a vote weight, measured using the $\ell_1$-norm, for a `16 x 16`-pixel image.

In general, $k$-nearest neighbors is a poor choice for naive image classification. Potential improvements could include a very rudimentary face recognition algorithm which would isolate the head/face of the animal in the image, crop the image so that the face/head is the only portion shown, then resize.

Even then, the variation in the image data is highly dependent on lighting, camera angle, and many other factors which can't be accounted for simply by analyzing the pixel intensities themselves.

In [None]:
model.retrain((16,16), {"n_neighbors": 9, "metric": "manhattan", "weights": "distance", "n_jobs": -1})
model.report("val")