## Model definition:
- Get k nearest neighbors that are closest to the input from the training data by using the Eulidean distance formula:
$${\left\| {u - v} \right\|_2} = {\left( {\sum\limits_{i = 1}^d {{{\left( {{u_i} - {v_i}} \right)}^2}} } \right)^{\frac{1}{2}}}$$
- After that, the label that has the highest frequency in k nearest neighbor is the output.

## How to fit it:



## How to use it:
1. Separate the train data into 2 parts:
    - X: the matrix of all features
    - y: the labels
2. Use the Euclidean distance formula to calculate the distance between the input and all points of X
3. Find k points that are nearest to the input and append its label to a list
4. Return the label has the largest frequency.

In [1]:
from math import sqrt
import numpy as np
import collections

class NearestNeighbor:
	def __init__(self, k):
		self.k = k

	def fit(self,X, y):
		self.X = X
		self.y = y

	def predict(self, x):
		# find the k nearest points to x
		distances = np.sum((x - self.X)**2, axis= 1)
		distances = [(distances[i],self.y[i]) for i in range(len(self.y))]
		distances.sort()
		neighbors = {}
		for i in range(self.k):
			if distances[i][1] not in neighbors:
				neighbors[distances[i][1]] = 0
			neighbors[distances[i][1]] += 1

		# return the majority-vote of  y1,...,yk
		a = - float('inf')
		for label in neighbors:
			if neighbors[label] > a:
				a = neighbors[label]
				prediction = label
		return prediction

In [2]:
import pandas as pd
import numpy as np

df_train = pd.read_csv("../data/digit-recognizer/data/train.csv")
df_test = pd.read_csv("../data/digit-recognizer/data/test.csv")
train = df_train.to_numpy()
X = train[:,1:]
y = train[:,0]
classifier = NearestNeighbor(15)
classifier.fit(X,y)

f = open("../data/digit-recognizer/outputs/submission_knn.csv", "w")
f.write("ImageId,Label\n" )
for i in range(len(df_test)):
    row = df_test.loc[i]
    x = row.to_numpy()
    a = classifier.predict(x)
    f.write(str(i+1) + "," + str(a) +"\n")
f.close()