# Homework 1 - KNN - Girish Narayanswamy
## CSCI 5622 - Spring 2019

For today's assignment, we will be implementing our own K-Nearest Neighbors (KNN) algorithm.

*But Professor Quigley, hasn't someone else already written KNN before?*

Yes, you are not the first to implement KNN, or basically any algorithm we'll work with in this class. But 1) I'll know that you know what's really going on, and 2) you'll know you can do it, because 2a) someday you might have to implement some machine learning algorithm from scratch - maybe for a new platform (do you need to run python on your SmartToaster just to get it to learn how users like their toast?), maybe because you want to tweak the algorithm (there's always a better approach...), or maybe because you're working on something important and you need to control exactly what's on there (should you really be running anaconda on your secret spy plane?).

That said - we're not going to implement *everything*. We'll start by importing a few helper functions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets

# test imports - unused
import sklearn.model_selection

*Wait a minute - didn't we just import Scikit-learn (sklearn)? The package with baked-in machine learning tools?*

Yes - but it also has a ton of helper functions, including a dataset we'll be using later. But, for now, let's set up a KNNClassifier class.

In [None]:
import sklearn.neighbors

class KNNClassifier:
    
    def __init__(self, X, y, k = 5):
        """
        Initialize our custom KNN classifier
        PARAMETERS
        X - our training data features
        y - our training data answers
        k - the number of nearest neighbors to consider for classification
        """
        self._model = sklearn.neighbors.BallTree(X)
        self._y = y
        self._k = k
        self._counts = self.getCounts()
        
    def getCounts(self):
        """
        Creates a dictionary storing the counts of each answer class found in y
        RETURNS
        counts - a dictionary of counts of answer classes
        """
        
        counts = dict({1:0,-1:0})
        #BEGIN Workspace 1.1
        #TODO: Modify and/or add to counts so that it returns a count of each answer class found in y
        
        vals, val_counts = np.unique(self._y, return_counts = True)
        counts = dict(zip(vals.tolist(), val_counts.tolist()))
        
        #END Workspace 1.1
        return(counts)
    
    def majority(self, indices):
        """
        Given indices, report the majority label of those points.
        For a tie, report the most common label in the data set.
        PARAMETERS
        indices - an np.array, where each element is an index of a neighbor
        RETURNS
        label - the majority label of our neighbors
        """
        label = 0
        #BEGIN Workspace 1.2
        #TODO: Determine majority, assign it to label
        
        label_vals, label_counts = np.unique(np.array(((self._y).flatten())[indices]), return_counts=True)  # grabs labels and their occurrences
        max_label_count = np.max(label_counts)  # maximum occurrence value
        tie_indexes = (np.argwhere(label_counts == max_label_count)).flatten()  # grabs indexs of all labels with max occurrence

        label = label_vals[tie_indexes[0]]  # init output as first of tied labels
        tie_break_label_count = label_counts[tie_indexes[0]]

        if tie_indexes.size > 1:  # if tie exists
            for i in label_vals[tie_indexes]:  # iterate through all label options
                if self._counts[i] > tie_break_label_count:  # check their counts in the whole training set
                    label = i  # if occur more often switch label
        
        
        #END Workspace 1.2
        return(label)
    
    def classify(self, point):
        """
        Given a new data point, classify it according to the training data X and our number of neighbors k into the appropriate class in our training answers y
        PARAMETERS
        point - a feature vector of our test point
        RETURNS
        ans - our predicted classification
        """
        ans = 0
        #BEGIN Workspace 1.3
        #TODO: perform classification of point here
        #HINT: use the majority function created above
        #HINT: use the euclidian distance discussed in lecture to find nearest neighbors
        
        distances, indices = self._model.query(point.reshape((1, -1)), k = self._k)
        ans = self.majority(indices)
        
        #END Workspace 1.3
        return(ans)
    
    def confusionMatrix(self, testX, testY):
        """
        Generate a confusion matrix for the given test set
        PARAMETERS
        testX - an np.array of feature vectors of test points
        testY - the corresponding correct classifications of our test set
        RETURN
        C - an N*N np.array of counts, where N is the number of classes in our classifier
        """
        #C = np.array() # modified below
        N = len(self._counts)
        C = np.zeros((N, N))
        
        #BEGIN Workspace 1.4
        #TODO: Run classification for the test set, compare to test answers, and add counts to matrix
        
        # the following is down to account for negative labels, and sparse label values (non-continuous)
        labels_list = np.fromiter(self._counts.keys(), dtype=float)  # list of all labels
        
        # 3.1 usage
        confusedIndx = [] # used to iterate through the test images 
        confusedLabel = [] # used to show what the classifier thought the label is
        
        # iterate through test set 
        for i in range(0, testY.size - 1): # iterate through number test points
            
            test_label = self.classify(testX[i]) # grab the classification of test point x
            actual_label = (testY.flatten())[i] # grab the actual label
            
            test_label_index = (np.argwhere(labels_list == test_label)).flatten() # find the index of that label in the labels list
            actual_label_index = (np.argwhere(labels_list == actual_label)).flatten() # find the index of that label in the labels list
            
            C[test_label_index[0], actual_label_index[0]] += 1 # iterate the index [test_label_index, actual_label_index] by 1 in the confusion matrix
            
            # 3.1 usage
            if test_label != actual_label:
                confusedIndx.append(i)
                confusedLabel.append(test_label)
                      
        #END Workspace 1.4
        return(C, confusedIndx, confusedLabel)
    
    def accuracy(self, C):
        """
        Generate an accuracy score for the classifier based on the confusion matrix
        PARAMETERS
        C - an np.array of counts
        RETURN
        score - an accuracy score
        """
        score = np.sum(C.diagonal()) / C.sum()
        return(score)

*But professor, this code isn't complete!*

### Problem 1: Complete our KNN Classifier - 40 Points (10 each)

1.1 - Complete the getCounts function to return the count of each class found in the training set

1.2 - Complete the majority function to determine the majority class of a series of neighbors

1.3 - Complete the classify function to capture the predicted class of a new datapoint

 - HINT: Use the BallTree documentation to determine how to retrieve neighbors from the model (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree)

1.4 - Complete the confusionMatrix function to reveal the results of classification

You can take a look at the unit tests below to see how we create data to input into our classifier, what kinds of things we expect as output, etc. You should also consider expanding the test cases to make sure your classifier is working correctly.


In [None]:
import unittest

class KNNTester(unittest.TestCase):
    def setUp(self):
        self.x = np.array([[3,1],[2,8], [2,7], [5,2],[3,2],[8,2],[2,4]])
        self.y = np.array([[1, -1, -1, 1, -1, 1, -1]])
        self.knnfive = KNNClassifier(self.x, self.y)
        self.knnthree = KNNClassifier(self.x, self.y, 3)
        self.knnone = KNNClassifier(self.x, self.y, 1)
        
        self.testPoints = np.array([[2,1], [2,6], [4, 4]])
        
    def testCounter(self):
        """
        Test getCounts function from knnclassifier
        """
        self.assertEqual(self.knnfive._counts[1], 3)
        self.assertEqual(self.knnfive._counts[-1], 4)
        
    def testKNNOne(self):
        """
        Test if the classifier returns "correct" (expected) classifications for k = 1
        """
        self.assertEqual(self.knnone.classify(self.testPoints[0]), 1)
        #BEGIN Workspace
        #Add more tests as needed
        #END Workspace
    
    #BEGIN Workspace
    #Add more test functions as requested
    def testConfusionMtx(self):
        """
        Test if the confusion matrix is properly built for knn 5
        """
        testX = np.array([[3,1],[2,8], [2,7], [5,2],[3,2],[8,2],[2,4]])
        testY = np.array([[1, -1, -1, 1, -1, 1, -1]])
        
        
        C = KNNClassifier.confusionMatrix(self.knnfive, testX, testY)[0]
        score = KNNClassifier.accuracy(self.knnfive, C)
        
        
        self.assertEqual(C[0][0], 2)
        self.assertEqual(C[0][1], 0)
        self.assertEqual(C[1][0], 1)
        self.assertEqual(C[1][1], 3)
    
    #HINT - You'll want to make sure your
    #END Workspace
    
tests = KNNTester()
myTests = unittest.TestLoader().loadTestsFromModule(tests)
unittest.TextTestRunner().run(myTests)

OK - now we've demonstrated that our KNN classifier works, let's think about our problem space! 

## Our Dataset - Identifying Digits from Images

It's a pretty common problem - just imagine working at the post office, or at a bank, and you're handed a hand-written envelope, or check, or other piece of information and you have to identify exactly what it says. Did they pay 500 or 600 dollars? Is the letter going to 80309 (campus) or 30309 (Atlanta)?

Let's be a little smart about this - let's up some classes and helper functions to help us out.

### Problem 2: Implement KNN on Digits dataset - 30 Points

2.1 Randomly divide our Digits dataset into training and testing sets (15 Points)

2.2 Report the number of examples in training and testing, as well as measuring then number of pixels in each image (5 points)

2.3 Create a confusion matrix of our classifier for K = 5 (10 points) *HINT: Doing this may cause you to catch mistakes in your classifier. Go fix those!*

In [None]:

class Numbers:
    def __init__(self):
        #load data from sklearn
        digits = sklearn.datasets.load_digits()
        
        #BEGIN Workspace 2.1
        m, n = digits.data.shape
        idx = np.random.permutation(m) # grab a random perm from 0-m
        range_train = range(0,round(0.8*m)) # 80 percent split
        range_test = range(round(0.8*m), m) # 20 percent split
        
        self.train_x = np.array(digits.data[range_train]) # A 2D np.array of training examples
        self.train_y = np.array(digits.target[range_train]) # A 1D np.array of training answers
        self.test_x = np.array(digits.data[range_test]) # A 2D np.array of testing examples
        self.test_y = np.array(digits.target[range_test]) # A 1D np.array of testing answers
        
        # used for problem 3.1
        self.test_imgs = digits.images[range_test]

        # Alternately... 
        #self.train_x, self.test_x, self.train_y, self.test_y = sklearn.model_selection.train_test_split(digits.data, digits.target, test_size=0.20)
        
        #self.train_x = np.array() # A 2D np.array of training examples, REPLACE
        #self.train_y = np.array() # A 1D np.array of training answers, REPLACE
        #self.test_x = np.array() # A 2D np.array of testing examples, REPLACE
        #self.test_y = np.array() # A 1D np.array of testing answers, REPLACE
        #TODO: Divide our dataset into Train and Test datasets (80/20 split), replacing the variables above
         
        #END Workspace 2.1
        
    def report(self):
        """
        Report information about the dataset using the print() function
        """
        #BEGIN Workspace 2.2
        #TODO: Create printouts for reporting the size of each set and the size of each datapoint
        
        print("---------------------------------")
        print("Training and Test Dataset Report:")
        print("---------------------------------")
        print("train_x size:", self.train_x.shape[0])
        print("train_y size:", self.train_y.shape[0])
        print("test_x size:", self.test_x.shape[0])
        print("test_y size:", self.test_y.shape[0])
        print("pixels per data point:", self.train_x.shape[1])
        print("")
        
        #END Workspace 2.2
        

    def classify(self):
        """
        Create a classifier using the training data and generate a confusion matrix for the test data
        """
        #BEGIN Workspace 2.3
        #TODO: Create classifier from training data, generate confusion matrix for test data
        
        self.knnInst = KNNClassifier(self.train_x, self.train_y,self.k)        
        C, confusedIndx, confusedLabel = KNNClassifier.confusionMatrix(self.knnInst, self.test_x, self.test_y)
        score = KNNClassifier.accuracy(self.knnInst, C)
        
        print("---------------------------------")
        print("Classifier Report:")
        print("---------------------------------")
        print(C, "Confusion Matrix")
        print("")
        print(score, "Confusion Score")
        print("")
        
        # 3.1 usage
        for i in range(len(confusedIndx)):
            self.viewDigit(self.test_imgs[confusedIndx[i]])
            print("Confused for:", confusedLabel[i])
            print("Actually:", self.test_y[confusedIndx[i]])
            print("")
        
        #END Workspace 2.3
        
    def viewDigit(self, digitImage):
        """
        Display an image of a digit
        PARAMETERS
        digitImage - a data object from the dataset
        """
        plt.gray()
        plt.matshow(digitImage)
        plt.show()  

In [None]:
# These unit tests are not run because 3.1 effectively checks all these things 

import unittest

class NumbersTester(unittest.TestCase):
    def setUp(self):
        self.numbers_test = Numbers()
        self.numbers_test.k = 5
        
    def testReport(self):
        """
        Test getCounts function from knnclassifier
        """
        self.numbers_test.report()
        
    def testClassify(self):
        """
        Test getCounts function from knnclassifier
        """
        self.numbers_test.classify()
        
    def testViewDigit(self):
        """
        Test viewDigit, and get sample images of digits 1-9
        """
        for i in range(10):
            self.numbers_test.viewDigit(sklearn.datasets.load_digits().images[i])
    
#tests = NumbersTester()
#myTests = unittest.TestLoader().loadTestsFromModule(tests)
#unittest.TextTestRunner().run(myTests)

*Wow, I can't believe we just created a KNN Classifier - but can't we make it better?*

Yes, we saw above that our classifier didn't work perfectly. Let's explore that issue a little further

### Problem 3: Improving KNN on Digits - 30 Points

3.1 Determine which classes are most often confused (from our confusion matrix above), inspect some examples of these digits (using the viewDigit function in our Numbers class), and write a brief (4 - 5 sentences) description of why you think these particular numbers may be misclassified.

3.2 Explore the influence of the number of nearest neighbors (i.e. try changing our K). Plot the relationship between K and accuracy, and write a brief (4 - 5 sentences) description of how this factor impacts our accuracy.

3.3 (Bonus) Explore the influence of the train / test split of our data (i.e. copy our Numbers class into Numbers2 below and try changing the split for our dataset). Plot the relationship between the split % and accuracy, and write a brief (4 - 5 sentences) description of its impact.

In [None]:
#BEGIN 3.1a
#TODO: Print out problem class images

# For a classification with 0.8 training set split, and 5 KNN
# The Following code generates and prints the Confusion matrix and the accuracy score
# It also prints the size report
# Finally it prints all the confused images as well as the actual classification and what the images was labeled
     
test31 = Numbers()
test31.k = 5
test31.classify()

#END 3.1a

#### 3.1b
TODO: Write description of misclassification

The above output shows the misclassified images, as well as their labeled and actual classifications. Common images that are often confused seem to be 3, 7, 8, and 9. Either these values are confused for others, or other values are confused for these. These numbers may be misclassified often because they may be be fairly close to all image points in Euclidean space. Thus they often fall into the nearest neighbors of an image, and thus multiple image classifications may fall into their neighborhood.

To be fair to the algorithm, for many of the images confused (displayed above) I could not correctly classify many of the image by eye. 

In [None]:
class Numbers2:
    def __init__(self, trainPercentage):
        #load data from sklearn
        digits = sklearn.datasets.load_digits()
        
        #BEGIN Workspace 3.3a
        
        m, n = digits.data.shape
        idx = np.random.permutation(m) # grab a random perm from 0-m
        range_train = range(0,round(trainPercentage*m)) # trainPercentage percent split
        range_test = range(round(trainPercentage*m), m) # 1 - trainPercentage percent split
        
        self.train_x = np.array(digits.data[range_train]) # A 2D np.array of training examples
        self.train_y = np.array(digits.target[range_train]) # A 1D np.array of training answers
        self.test_x = np.array(digits.data[range_test]) # A 2D np.array of testing examples
        self.test_y = np.array(digits.target[range_test]) # A 1D np.array of testing answers
        
        #self.train_x = np.array() # A 2D np.array of training examples, REPLACE
        #self.train_y = np.array() # A 1D np.array of training answers, REPLACE
        #self.test_x = np.array() # A 2D np.array of testing examples, REPLACE
        #self.test_y = np.array() # A 1D np.array of testing answers, REPLACE
        #TODO: Divide our dataset into Train and Test datasets (using trainPercentage), replacing the variables above
        #HINT: You should be able to mostly copy your own work from the original Numbers class
        #END Workspace 3.3a

    def classify(self, k):
        """
        Create a classifier using the training data and generate a confusion matrix for the test data
        """
        #BEGIN Workspace 3.2a
        #TODO: Create classifier from training data (using k nearest neighbors), generate confusion matrix for test data
        #HINT: You can copy your own work from the original Numbers class
        
        self.knnInst = KNNClassifier(self.train_x, self.train_y, k)
        C = KNNClassifier.confusionMatrix(self.knnInst, self.test_x, self.test_y)[0]
        score = KNNClassifier.accuracy(self.knnInst, C)
        
        return score
        
        #END Workspace 3.2a
        
    def viewDigit(digitImage):
        """
        Display an image of a digit
        PARAMETERS
        digitImage - a data object from the dataset
        """
        plt.gray()
        plt.matshow(digitImage)
        plt.show()

# PLEASE NOTE: THE GRAPHS TAKE A VERY LONG TIME TO GENERATE 

# 3.2 relationship of accuracy and k        
test32 = Numbers2(0.8)
scoreArray32 = np.zeros((2,len(test32.train_x)))
for k in range(1,len(test32.train_x)):
    scoreArray32[0][k] = k
    scoreArray32[1][k] = test32.classify(k)
    
plt.figure(1)
plt.scatter(scoreArray32[0,:], scoreArray32[1,:])
plt.title("Accuracy Score as a Factor of k Nearest Neighbors")
plt.xlabel("Number of Nearest Neighbors (k)")
plt.ylabel("Accuracy Score")
plt.show()

# 3.3 relationship of accuracy and split   
scoreArray33 = np.zeros((2,10)) # check 10 different splits 
for split in range(1,10):
    
    test33 = Numbers2(split/10)
    scoreArray33[0][split] = (split/10)
    scoreArray33[1][split] = test33.classify(100)
    
plt.figure(2)
plt.scatter(scoreArray33[0,:], scoreArray33[1,:])
plt.title("Accuracy Score as a Factor of Percentage Split of Training Set")
plt.xlabel("Percentage Split of Training Set")
plt.ylabel("Accuracy Score")
plt.show()

#### 3.2b
TODO: Write description of influence of neighbor count

The neighbor count k, has an inverse relationship with the accuracy score. As k increases the accuracy drops almost linearly. This is because, as images of similar label are close in Euclidean space, for lower values of k the nearest neighbors comprise of images with the same label as the image being classified. However when k starts to become large, large numbers of images are in the neighborhood that are of a different label. Even though these images may be far in Euclidean space, if they are in the k-nearest neighborhood they are weighed equal to those images close to the image being classified. As neighbors are not weighted in any way, according to their distance from the image being classified, as k increases the accuracy decreases.  

#### 3.3b
TODO: Write description of influence of training / testing split

As the the training set percentage increases, the accuracy of the KNN classification increases similar to a log function. If the training set is small, relative to the test set, then there are few known values to classify images in the test set. This leads to poor accuracy. On the other hand, for large a training set, the nearest neighbors of an image being classified are much more densely populated around the image, and better represent the classification of that image. Thus as the training size percentage increases the accuracy also increases. The log-like growth of this trend implies that after a certain percentage the accuracy beings to plateau (at some relatively high value) for increasing percentage splits of the training set.