# Assignment 3 - Linear Classifiers

The goal of this assignment is to familiarize yourself with the SVM classifiers available in scikit-learn and practice applying them to data.

This assignment does _not_ require you to implement your own classifier from scratch, but you may need to look at the SciKit-Learn documentation to figure out how to call the library methods.

**CSC 8515 - Machine Learning  
Assignment 3  
Scaffolding by Dr. Ben Mitchell  
Assignment completed by: James Fung  **

In [1]:
# import the things we'll need
import numpy as np
from sklearn import neighbors
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split

## Load some data
Let's go ahead and use the iris dataset again.  Load it the same way we did in Assignment 1b, and split it into 60% train and 40% test just like we did there.

**Be sure to use the argument `random_state=0` just like we did in the previous assignment.**  This is a "seed" for the random number generator; any particular seed value should always result in the same set of "random" numbers.  We'll use fixed seed values here because it will allow everyone to get the same results.

In [2]:
# then load some data...
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=2)

## Create a Nearest Neighbor classifier
Create a nearest neighbor classifier just like we did in Assignment 1b; here, we will use it as a baseline for comparing our new classifiers.  Be sure to train it and then evaluate its performance on the testing data (again, you should be able to use code from the previous assignment with little or no modification to do this).  Use an n_neighbors value of 1.  This should give you the same accuracy as you got in the previous assignment.

Once you've got it working, try changing the `random_state` value in the train_test_split function to 1 and re-running your nearest-neighbor classifier.  Try it again with a value of 2.  Write down each of the accuracies below; the first one has been done for you.

In [3]:
import math
def distance(x, y):
    #Zip x and y into a singular array so that computations can be performed on each tuple within.
        distance = math.sqrt(sum([(xi-yi)**2 for (xi,yi) in zip(x,y)]))
        return distance
import itertools #To "slice" dictionaries
import collections #To create an ordered dictionary
from statistics import mode

def myNearestNeighbor(train, trainLabels, test, testLabels, k):
    
    #Compute a list that contains all of the predicted labels for each test query.
    bestlabel = []
    query = 0
    #While loop for variables in test.
    while query <= len(test)-1:
        bestdist = math.inf
        compare = 0
        templabel = []
        distances = []
        #While loop for variables in train.
        while compare <= len(train)-1:
            dist = distance(train[compare],test[query])
            #Append the distances to the distance list.
            distances.append(dist)
            #Append the labels at the same time to a temporary label list.
            templabel.append(trainLabels[compare])
            compare += 1
            
        #The below code will be used to find the k nearest neighbors.
        #However, for situations of a tie, an exception will be raised and an additional neighbor will be used.
        #Create an ordered dictionary, sorted by the distance key, so that one could call in the "first k elements."
        distdictionary = collections.OrderedDict(sorted(zip(distances,templabel)))
        try:
            nearestdict = list(itertools.islice(distdictionary.items(),0,k))
            #Create a list that is the 1st element for each tuple, which is the corresponding label to each distance.
            nearestlabels = [x[1] for x in nearestdict]
            bestlabel.append(mode(nearestlabels))
        
        except:
            #Additional neighbor is added.
            nearestdict = list(itertools.islice(distdictionary.items(),0,k+1))
            nearestlabels = [x[1] for x in nearestdict]
            bestlabel.append(mode(nearestlabels))
            
        query += 1
    
    #Compute the accuracy, using a counter that will add for every prediction that is correct between the predicted label and the actual label.
    xacc = 0
    yacc = 0
    counter = 0
    while xacc <= len(test)-1:
        if bestlabel[xacc] == testLabels[yacc]:
            counter += 1
        xacc += 1
        yacc += 1
    accuracy = counter/len(test)
    return accuracy

In [4]:
myNearestNeighbor(X_train,y_train,X_test,y_test,1)

0.9666666666666667

**Question 1: accuracy of nearest neighbor for different random seed values**

seed=0, accuracy = 0.9166666666666666

seed=1, accuracy = 0.9666666666666667

seed=2, accuracy = 0.9666666666666667

## Create a linear SVM
Code to create a linear SVM is given; however, it is left to you to add lines to train the SVM on the training data and then evaluate its accuracy on the testing data.  This should work just like the training/testing process did using the nearestNeighbor class.

Test your classifier on different train/test splits using the same three random-seed values as previously, and report the accuracies you get.

In [5]:
linearSvm = svm.SVC(kernel='linear')
linearSvm.fit(X_train,y_train)
linearSvm.score(X_test,y_test)

0.96666666666666667

**Question 2: accuracy of linear SVM for different random seed values**

seed=0, accuracy = 0.96666666666666667

seed=1, accuracy = 0.98333333333333328

seed=2, accuracy = 0.96666666666666667

## Create a polynomial-kernel SVM
Create another SVM, only this time use the argument `kernel='poly'` to make an SVM using a polynomial kernel.  Train and test it with different seed values as before.

In [6]:
polySvm = svm.SVC(kernel='poly')
polySvm.fit(X_train,y_train)
polySvm.score(X_test,y_test)

0.94999999999999996

**Question 3: accuracy of polynomial SVM for different random seed values**

seed=0, accuracy = 0.96666666666666667

seed=1, accuracy = 0.98333333333333328

seed=2, accuracy = 0.94999999999999996

## Create a rbf-kernel SVM

Create another svm, this time using `'rbf'` for the kernel type.  Train and test as before.

In [7]:
rbfSvm = svm.SVC(kernel='rbf')
rbfSvm.fit(X_train,y_train)
rbfSvm.score(X_test,y_test)

0.96666666666666667

**Question 4: accuracy of rbf SVM for different random seed values**

seed=0, accuracy = 0.94999999999999996

seed=1, accuracy = 0.98333333333333328

seed=2, accuracy = 0.96666666666666667