# My own K-Neighbors Regressor
For practice purposes (and for fun) I will develop my own K-Nearest Neighbors regressor. 
To make sure that my regressor works, I will compare the resuts to the ones obtained using the scikit-learn library.

We will work with a random dataset that contains 4 features and one target column, with values ranging from 0 to 1.

## Creating a random dataset

In [122]:
import numpy as np
import pandas as pd

data = np.random.rand(100,5)
data = pd.DataFrame(data)
data.columns = ['F1','F2','F3','F4','target']

data.head()

Unnamed: 0,F1,F2,F3,F4,target
0,0.076566,0.965948,0.115933,0.626517,0.488666
1,0.873861,0.334111,0.879368,0.529912,0.475038
2,0.707311,0.449446,0.263374,0.282958,0.469835
3,0.988797,0.099247,0.949173,0.257056,0.219627
4,0.841992,0.247182,0.058081,0.649768,0.799055


## Defining our KNN regressor

In [132]:
def euclidean_distance(predict_row,train_row):
    distance = 0
    for i,j in zip(predict_row,train_row):
        distance += (i-j)**2
    return distance**(1/2)

def train_test(data,features,target,k=5):
    train = data.sample(frac=0.5,random_state=1).copy()
    test = data[~data.index.isin(train.index)].copy()
    predictions = list()
    for tested_row in test[features].iterrows(): # euclidean distance for each test row
        distances = list()
        for train_row in train[features].iterrows():
            distance = euclidean_distance(tested_row[1],train_row[1]) #iterrows method returns tuple (rowindex,row)
            distances.append(distance)
        train['distance'] = distances
        train_sorted = train.sort_values(by='distance') #add K here and after, check ascending
        prediction = train_sorted[target].iloc[:k].mean()
        predictions.append(prediction)
    test['prediction'] = predictions
    return test

def mean_squared_err(test_data,target,prediction):
    squared_error = (test_data[target] - test_data[prediction])**2
    return squared_error.mean()

def rmse(test_data,target,prediction):
    return mean_squared_err(test_data,target,prediction)**(1/2)

In [141]:
features = ['F1','F2','F3','F4']

rmses = list()
for n_neighbors in range(2,21):
    df_predictions = train_test(data,features,'target',k=n_neighbors)
    rmses.append(rmse(df_predictions,'target','prediction'))

rmses_df = pd.DataFrame({'n_neighbors':np.arange(2,21),'RMSE_own':rmses})
rmses_df = rmses_df.set_index('n_neighbors')

## KNN using the sklearn library

In [143]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

train = data.sample(frac=0.5,random_state=1)
test = data[~data.index.isin(train.index)]
rmses = list()
for n_neighbors in range(2,21):
    knn = KNeighborsRegressor(n_neighbors=n_neighbors)
    knn.fit(train[features],train['target'])
    predictions = knn.predict(test[features])
    rmses.append(mean_squared_error(test['target'],predictions)**(1/2))
rmses_df['RMSE_sklearn'] = rmses

## Comparison of results
The results below show pretty much the same RMSES for each k value -it works!

In [144]:
rmses_df

Unnamed: 0_level_0,RMSE_own,RMSE_sklearn
n_neighbors,Unnamed: 1_level_1,Unnamed: 2_level_1
2,0.339207,0.339207
3,0.328827,0.328827
4,0.316849,0.316849
5,0.303201,0.303201
6,0.309328,0.309328
7,0.310319,0.310319
8,0.307591,0.307591
9,0.308575,0.308575
10,0.303799,0.303799
11,0.308281,0.308281
