**Homework 1**

We begin with the usual import, and a new one:

In [1]:
import numpy as np
from sklearn.datasets import load_iris

Now load the iris dataset.

In [2]:
iris=load_iris()
X=iris.data 
y=iris.target

The columns of the numpy array `X` (our "feature matrix") give the Sepal Length, Sepal Width, Petal Length and Petal Width of 150 different observed iris flowers. `y` is our "target", an array of 150 integers indicating the specific species of iris, where 0=Setosa, 1=Versicolor, and 2=Virginica.

Here are the first few rows of `X`:

In [3]:
X[:5,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

For this assignment, we'll only work with the Petal Length and Petal Width of each flower, so we can redefine `X` to be just the last two columns:

In [4]:
X=X[:,2:]
X.shape

(150, 2)

Define a function `sq_distances` with inputs `X` (a numpy array with two columns), `length` and `width` (the Petal Length and Petal Width of an unknown flower). The function should return an array of squared distances from the unknown point to each point in `X`. Use vectorized Numpy operations, NOT A FOR LOOP. 

In [27]:
def sq_distances(X,length,width):
  p = np.array([length, width])
  return np.sum((X-p)**2, axis=1)

Define a function `SpeciesOfKNeighbors` that gives the species label (a number 0, 1, or 2) of the k nearest neighbors from the point with given Petal Length and Petal Width to the points in `X`. (The list of species labels for each point in `X` is contained in the array `y`.) *Hint: The numpy function `argsort()` is useful for this problem.*

In [41]:
def SpeciesOfNeighbors(X,y,length,width,k):
  # Call sq_distances to get distances
  dist = sq_distances(X, length, width)

  # using argsort() to order distances with an index output
  index_order = np.argsort(dist)
  index_order = index_order[:k] # Taking the k-smallest distances

  return y[index_order] # returns the species at the k-smallest distances

Create a function `majority` that takes an array of labels, and returns the label that appears the most often. *Hint: The numpy functions `bincount()` and `argmax()` can be useful here.*

In [39]:
def majority(labels):
  # counting the occurances with bincount()
  label_count = np.bincount(labels)
  return np.argmax(label_count) # returns the index of the highest number

Combine your previous functions to create a function `KNN` which takes a feature matrix `X` of known Petal Lengths and Petal Widths, a target array `y` containing their species labels, a hyperparameter `k`, and the `length` and `width` of the petal of an unknown flower. Your function should return the most common species index among the k nearest neighbors of the unknown flower. 

In [36]:
def KNN(X,y,length,width,k):
    labels = SpeciesOfNeighbors(X, y, length, width, k)
    return majority(labels)

Test your code by playing with a few values for length, width, and k. For example, try:

In [53]:
KNN(X,y,1,1,7)


0

Moving forward, we'll write our ML models as classes that conform to the standards of the sklearn package. Let's do this now. Modify your functions above to create appropriate methods for the following class:

In [57]:
class KNeighborsClassifier():
    def __init__(self,k):
        self.n_neighbors=k

    def fit(self,X,y):
        self.X=X
        self.y=y

    def sq_distances(self,length,width):
        p = np.array([length, width])
        return np.sum((self.X-p)**2, axis=1)

    def SpeciesOfNeighbors(self,length,width):
        # Call sq_distances to get distances
        dist = sq_distances(self.X, length, width)

        # using argsort() to order distances with an index output
        index_order = np.argsort(dist)
        index_order = index_order[:self.n_neighbors] # Taking the k-smallest distances

        return self.y[index_order] # returns the species at the k-smallest distances

    def majority(self,labels):
        # counting the occurances with bincount()
        label_count = np.bincount(labels)
        return np.argmax(label_count) # returns the index of the highest number

    def predict(self,length, width):
        labels = SpeciesOfNeighbors(self.X, self.y, length, width, self.n_neighbors)
        return majority(labels)

If done correctly, the following code should produce the same answer as before:

In [58]:
knn=KNeighborsClassifier(7)
knn.fit(X,y)
knn.predict(1,1)

0