## $k$-nearest neighbors
We would like to assign a label to an object $x_o$ based on the labels of the objects $x_1,x_2,\ldots, x_n.$ One of the simplest ways to do this is to consider the labels of $x_0$'s $k$ neareast objects. Each $x_i$ votes for $x_0$ to be assigned their label. The label of $x_0$ is the label with the most votes. 

Here is a very simplistic example.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
np.random.seed(123)
training_df = pd.DataFrame({'A':np.random.randint(1,100,10),
                     'B':np.random.random(10),
                     'label':np.random.choice(['X','Y'],10)})
# we create a validation set for later
validation_df = pd.DataFrame({'A':np.random.randint(1,10,5),
                     'B':np.random.random(5),
                     'label':np.random.choice(['X','Y'],5)})

In [2]:
training_df.shape

(10, 3)

Let us classify object `x_0` with values `{'A':1,'B':1.25}`. We will assign `x_0` the label of *the nearest* object ($k=1$) in `training_df`. First let us define the euclidean distance function we are going to use.

In [2]:
x_0 = pd.DataFrame({'A':1,'B':1.25},index=[0]) # target object to label
def distance(x_0, row):
    return ((x_0.A-row["A"])**2+(x_0.B-row["B"])**2)**.5
    # return ((x_0.A-row.A)**2+(x_0.B-row.B)**2)**.5

Now we test this function by computing the distance between `x_0` and the first observation in `training_df`.

In [3]:
distance(x_0,training_df[0:1])

0    66.005575
dtype: float64

The distance from `x_0` to the first row of `training_df` is about 2.19598. Is this the closest point in `training_df` to `x_0`? Let us find out!

In [4]:
nrows= training_df.shape[0]
distances = np.zeros(nrows)
for row in range(nrows):
    distances[row] = distance(x_0,training_df.iloc[row])
minidx = distances.argmin()
dist_df = pd.DataFrame({'distance':distances,'label':training_df['label']})
print(dist_df)    
print(f'\nx_0 is assigned label \'{training_df.iloc[minidx].label}\' from its nearest neighbor\n\n{training_df[minidx:minidx+1]}')

    distance label
0  66.005575     X
1  92.004469     Y
2  98.001385     X
3  17.019354     Y
4  83.008535     X
5  57.006367     Y
6  86.001524     Y
7  97.005874     Y
8  96.006014     X
9  47.005491     X

x_0 is assigned label 'Y' from its nearest neighbor

    A         B label
3  18  0.438572     Y


## Exercise 1

To see how well $k=1$ performs,to each member of `validation_df` we assign the label of the corresponding nearest object in `training_df`.

1.   Write code to determine *predicted* labels for all objects in `validation_df` using the nearest neighbor to `training_df`.

2.   Compare the predicted labels with the original labels. How many objects are misclassified?
2. Compute the *error rate* for $k=1$ in `validation_df` using data from `training_df`.




In [5]:
temp = pd.DataFrame(columns= ['distance', 'label']) # create dataframe to hold distances from each member of validation_df and their label
pairs = {} # used in similar fashion to Counter(), to compute errors
for i in range(validation_df.shape[0]):
    for row in range(training_df.shape[0]):
        temp = temp.append({'distance' : float(distance(validation_df[['A', 'B']][i:i+1], training_df.iloc[row])), 
                            'label' : training_df.label[row]}, ignore_index= True)
    pairs[str(i)] = training_df.iloc[int(temp[['distance']].idxmin())].label
    temp = temp.iloc[0:0]

In [6]:
total = 0
for x, y in zip(pairs.values(), validation_df['label']):
    if x == y:
        total += 1
        continue
        
print("{} of the objects are classified correctly. Thus, {} are misclassified.".format(total, validation_df.shape[0] - total))

1 of the objects are classified correctly. Thus, 4 are misclassified.


In [7]:
print("The error rate is {} = {}%".format(total/validation_df.shape[0], total/validation_df.shape[0]*100))

The error rate is 0.2 = 20.0%


## The $k=3$ and $k=5$ cases

Instead of looking at the nearest neighbor's label, we will consider the labels of the 3 neartest objects in `training_df` to `x_0`.

In [8]:
sorted_dist_df = dist_df.sort_values(by=['distance']).reset_index()
k = 3
sorted_dist_df.iloc[:k]

Unnamed: 0,index,distance,label
0,3,17.019354,Y
1,9,47.005491,X
2,5,57.006367,Y


As you can see, two out of the three nearest neighbors of `x_0` have label `'Y'`. In this case, `x_0` gets assigned label `'Y'` also.

In [9]:
k = 5
sorted_dist_df.iloc[:k]

Unnamed: 0,index,distance,label
0,3,17.019354,Y
1,9,47.005491,X
2,5,57.006367,Y
3,0,66.005575,X
4,4,83.008535,X


By inspection, `x_0` will be assigned label`'X'` for $k=5.$ Let us do this using code also.

In [10]:
from collections import Counter # check out the collections module here https://docs.python.org/3/library/collections.html
counts = Counter(sorted_dist_df.iloc[:k]['label'])
print(counts)
print("x_0 is assigned label", "'X'" if counts['Y'] < counts['X'] else "'Y'") 

Counter({'X': 3, 'Y': 2})
x_0 is assigned label 'X'


## Excercise 2

Repeat all tasks in Excercise 1 for $k = 3,5,7,9.$

In [11]:
for k in range(3, 10, 2):
    counts = Counter(sorted_dist_df.iloc[:k]['label'])
    print(f"For k = {k}, x_0 is assigned label", "'X'" if counts['Y'] < counts['X'] else "'Y'") 

For k = 3, x_0 is assigned label 'Y'
For k = 5, x_0 is assigned label 'X'
For k = 7, x_0 is assigned label 'Y'
For k = 9, x_0 is assigned label 'Y'


## Excercise 3

Consider, yet again, the [county](https://www.rdocumentation.org/packages/openintro/versions/1.7.1/topics/countyComplete) dataset.

In [12]:
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://raw.githubusercontent.com/cpaniaguam/CSC104/main/county_complete.csv')
df=df[['state','name','pop2017','poverty_2017','homeownership_2010','median_household_income_2017','metro_2013']]
counties_sample = df.sample(n=1000, random_state=34) # choose a random sample of 1000 counties
train, test = train_test_split(counties_sample, test_size = 0.3,random_state = 1)
newport = df[df['name']=='Newport County']
newport

Unnamed: 0,state,name,pop2017,poverty_2017,homeownership_2010,median_household_income_2017,metro_2013
2313,Rhode Island,Newport County,83460.0,9.0,63.6,75463.0,1.0



Here are the three steps of the $k$NN classification algorithm:


1.   Find the distance between the observation to be classified and all other observations

1.   Select the k-nearest observations.
2.   Classify the observation according to majority vote of k-nearest neighbors.

### To do

1. For the `train` and `test` dataframes above, select the best $k$ (for $k= 1,\ldots,20$) for which the $k$NN classifier has the lowest error rate.
2. Predict the `metro_2013` class for `newport` using the best $k$ for this data. Is the prediction correct?



In [13]:
def n_distance(x_0, row):
    """ Perform nth dimensional euclidean distance"""
    return ((x_0.pop2017-row["pop2017"])**2+(x_0.poverty_2017-row["poverty_2017"])**2
            +(x_0.homeownership_2010-row["homeownership_2010"])**2+
            (x_0.median_household_income_2017-row["median_household_income_2017"])**2)**.5

In [14]:
temp_train = train # create copy of training set 
indx_total = temp_train.shape[0] 

#add new column distance which computes distance from newport to each row, using columns:
# population 2017, poverty 2017, home ownership 2017, median household income 2017
temp_train['distance'] = [float(n_distance(newport, temp_train.iloc[i])) for i in range(indx_total)] 

In [15]:
sorted_temp_train = temp_train.sort_values(by=['distance']).reset_index() # sort by distance, descending order
sorted_temp_train

Unnamed: 0,index,state,name,pop2017,poverty_2017,homeownership_2010,median_household_income_2017,metro_2013,distance
0,364,Florida,Nassau County,82721.0,11.4,79.4,64294.0,1.0,1.119343e+04
1,2967,Washington,Island County,83159.0,9.5,73.9,61516.0,0.0,1.395025e+04
2,1778,New Jersey,Cape May County,93553.0,10.6,74.3,62332.0,1.0,1.656176e+04
3,1766,New Hampshire,Cheshire County,75960.0,10.2,71.5,60148.0,0.0,1.705284e+04
4,718,Indiana,Floyd County,77071.0,10.8,73.1,59451.0,1.0,1.723959e+04
...,...,...,...,...,...,...,...,...,...
695,1830,New York,Bronx County,1471160.0,29.7,20.7,36593.0,1.0,1.388244e+06
696,1225,Massachusetts,Middlesex County,1602947.0,8.2,63.9,92878.0,1.0,1.519587e+06
697,1312,Michigan,Wayne County,1753616.0,23.7,67.2,43702.0,1.0,1.670458e+06
698,2579,Texas,Dallas County,2618148.0,17.7,54.7,53626.0,1.0,2.534782e+06


In [17]:
sorted_temp_train.tail()

Unnamed: 0,index,state,name,pop2017,poverty_2017,homeownership_2010,median_household_income_2017,metro_2013,distance
695,1830,New York,Bronx County,1471160.0,29.7,20.7,36593.0,1.0,1388244.0
696,1225,Massachusetts,Middlesex County,1602947.0,8.2,63.9,92878.0,1.0,1519587.0
697,1312,Michigan,Wayne County,1753616.0,23.7,67.2,43702.0,1.0,1670458.0
698,2579,Texas,Dallas County,2618148.0,17.7,54.7,53626.0,1.0,2534782.0
699,2417,South Dakota,Oglala Lakota County,14354.0,,51.3,,,


In [None]:
k_max = int(train.shape[0]**0.5) # zybook: good practice to use sqrt(training net size) therefore, will not go above this value
for k in range(1, k_max, 2):
    counts = Counter(sorted_temp_train.iloc[:k]['metro_2013'])
    print(f"For k = {k}, metro_2013 is assigned label", "'1'" if counts[0] < counts[1] else "'0'") 
    try:
        print(f"Error rate for {k}: {counts[0]/(counts[0]+counts[1])}\n")
    except ZeroDivisionError:
        print(f"Error rate for {k}: 0\n")

In [None]:
# k = 1 had the lowest error rate. However, this is on the training set. Let us try the same principle on the validation set

temp_test = test
indx_total = temp_test.shape[0]
temp_test['distance'] = [float(n_distance(newport, temp_test.iloc[i])) for i in range(indx_total)]

sorted_temp_test = temp_test.sort_values(by=['distance']).reset_index()

k_max = int(test.shape[0]**0.5)
for k in range(1, k_max, 2):
    counts = Counter(sorted_temp_test.iloc[:k]['metro_2013'])
    print(f"For k = {k}, metro_2013 is assigned label", "'1'" if counts[0] < counts[1] else "'0'") 
    try:
        print(f"Error rate for {k}: {counts[0]/(counts[0]+counts[1])}\n")
    except ZeroDivisionError:
        print(f"Error rate for {k}: 0\n")

### K = 1 seems like the best k for this data. It is an odd number which prevents ties. It scored the lowest error rate on both the validation set and training set. If choosing k = 1 disrupts confidence, k = 3 would be a suitable substitute. 