<a href="https://colab.research.google.com/github/cpaniaguam/CSC104/blob/main/CSC104_Assignment20_kNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## $k$-nearest neighbors
We would like to assign a label to an object $x_o$ based on the labels of the objects $x_1,x_2,\ldots, x_n.$ One of the simplest ways to do this is to consider the labels of $x_0$'s $k$ neareast objects. Each $x_i$ votes for $x_0$ to be assigned their label. The label of $x_0$ is the label with the most votes. 

Here is a very simplistic example.

In [9]:
import pandas as pd
import numpy as np
np.random.seed(123)
training_df = pd.DataFrame({'A':np.random.randint(1,100,10),
                     'B':np.random.random(10),
                     'label':np.random.choice(['X','Y'],10)})
# we create a validation set for later
validation_df = pd.DataFrame({'A':np.random.randint(1,10,5),
                     'B':np.random.random(5),
                     'label':np.random.choice(['X','Y'],5)})

Let us classify object `x_0` with values `{'A':1,'B':1.25}`. We will assign `x_0` the label of *the nearest* object ($k=1$) in `training_df`. First let us define the euclidean distance function we are going to use.

In [10]:
x_0 = pd.DataFrame({'A':1,'B':1.25},index=[0]) # target object to label
def distance(x_0, row):
    return ((x_0.A-row["A"])**2+(x_0.B-row["B"])**2)**.5
    # return ((x_0.A-row.A)**2+(x_0.B-row.B)**2)**.5

Now we test this function by computing the distance between `x_0` and the first observation in `training_df`.

In [11]:
distance(x_0,training_df[0:0+1])

0    66.005575
dtype: float64

The distance from `x_0` to the first row of `training_df` is about 2.19598. Is this the closest point in `training_df` to `x_0`? Let us find out!

In [12]:
nrows= training_df.shape[0]
distances = np.zeros(nrows)
for row in range(nrows):
    distances[row] = distance(x_0,training_df.iloc[row])
minidx = distances.argmin()
dist_df = pd.DataFrame({'distance':distances,'label':training_df['label']})
print(dist_df)    
print(f'\nx_0 is assigned label \'{training_df.iloc[minidx].label}\' from its nearest neighbor\n\n{training_df[minidx:minidx+1]}')

    distance label
0  66.005575     X
1  92.004469     Y
2  98.001385     X
3  17.019354     Y
4  83.008535     X
5  57.006367     Y
6  86.001524     Y
7  97.005874     Y
8  96.006014     X
9  47.005491     X

x_0 is assigned label 'Y' from its nearest neighbor

    A         B label
3  18  0.438572     Y


## Exercise 1

To see how well $k=1$ performs,to each member of `validation_df` we assign the label of the corresponding nearest object in `training_df`.

1.   Write code to determine *predicted* labels for all objects in `validation_df` using the nearest neighbor to `training_df`.

2.   Compare the predicted labels with the original labels. How many objects are misclassified?
2. Compute the *error rate* for $k=1$ in `validation_df` using data from `training_df`.




In [13]:
# Your code for excercise 1 goes here

## The $k=3$ and $k=5$ cases

Instead of looking at the nearest neighbor's label, we will consider the labels of the 3 neartest objects in `training_df` to `x_0`.

In [14]:
sorted_dist_df = dist_df.sort_values(by=['distance']).reset_index()
k = 3
sorted_dist_df.iloc[:k]

Unnamed: 0,index,distance,label
0,3,17.019354,Y
1,9,47.005491,X
2,5,57.006367,Y


As you can see, two out of the three nearest neighbors of `x_0` have label `'Y'`. In this case, `x_0` gets assigned label `'Y'` also.

In [15]:
k = 5
sorted_dist_df.iloc[:k]

Unnamed: 0,index,distance,label
0,3,17.019354,Y
1,9,47.005491,X
2,5,57.006367,Y
3,0,66.005575,X
4,4,83.008535,X


By inspection, `x_0` will be assigned label`'X'` for $k=5.$ Let us do this using code also.

In [16]:
from collections import Counter # check out the collections module here https://docs.python.org/3/library/collections.html
counts = Counter(sorted_dist_df.iloc[:k]['label'])
print("x_0 is assigned label", "'X'" if counts['Y'] < counts['X'] else "'Y'") 

x_0 is assigned label 'X'


## Excercise 2

Repeat all tasks in Excercise 1 for $k = 3,5,7,9.$

In [17]:
# Your code for excersice 2 goes here

## Excercise 3

Consider, yet again, the [county](https://www.rdocumentation.org/packages/openintro/versions/1.7.1/topics/countyComplete) dataset.

In [18]:
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://raw.githubusercontent.com/cpaniaguam/CSC104/main/county_complete.csv')
df=df[['state','name','pop2017','poverty_2017','homeownership_2010','median_household_income_2017','metro_2013']]
counties_sample = df.sample(n=1000, random_state=34) # choose a random sample of 1000 counties
train, test = train_test_split(counties_sample, test_size = 0.3,random_state = 1)
newport = df[df['name']=='Newport County']
newport


Unnamed: 0,state,name,pop2017,poverty_2017,homeownership_2010,median_household_income_2017,metro_2013
2313,Rhode Island,Newport County,83460.0,9.0,63.6,75463.0,1.0



Here are the three steps of the $k$NN classification algorithm:


1.   Find the distance between the observation to be classified and all other observations

1.   Select the k-nearest observations.
2.   Classify the observation according to majority vote of k-nearest neighbors.

### To do

1. For the `train` and `test` dataframes above, select the best $k$ (for $k= 1,\ldots,20$) for which the $k$NN classifier has the lowest error rate.
2. Predict the `metro_2013` class for `newport` using the best $k$ for this data. Is the prediction correct?



In [19]:
# Your code for excercise 3 goes here