# Nearest Neighbors Problem Set

In [13]:
# -- imports --
import numpy as np
import pandas as po
import matplotlib.pyplot as plt

# -- kNN --
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## Problem 1

Consider the following simple data-set:

<img src="https://github.com/BeaverWorksMedlytics2020/Data_Public/raw/master/Images/Week1/knn_notebook_example_table.png" alt="Example Table" width="600">

Now consider the Sample:
    $$X= 4, Y = 4, Z = 2$$

Using kNN, what is the class for this sample for $k = 1$ and $k = 3?$ Use the Eucledian metric.

$k = 1$: class 1
$k = 3$: class 2

## Problem 2
Earlier in the tutorial we were told that kNN depends on several factors, one of them being $k$. Consider the following datasets below, find the optimal value of $k$ that gives the highest accuracy. Visualize your data! Can you come up with some rule for getting a good idea of what $k$ is? 

HINT: look for a pattern/bound! Answer should be in terms of the size of the dataset $n$. 

In [14]:
# Sovle this problem for each of these datasets
from sklearn.datasets import load_iris 
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_wine 

# Load those datasets into some easily accessible variables
#The datasets are already normalized, so that saves us some steps!
iris = load_iris()                    #iris dataset: size = 150
breast_cancer = load_breast_cancer()  #diabetes dataset: size = 569
wine = load_wine()                    #wine dataset: size 178

# This function will perfom KNN classification for a specified k
def split_train_test_dataset(dataset, k, test_size=0.2):
    """Loads and performs KNN classification on the provided dataset"""
    # Grab and split the dataset
    X_train, X_val, y_train, y_val = train_test_split(
        dataset.data, dataset.target, test_size=test_size, random_state=0)

    # Build a KNN classifier, fit it and test its predictions
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    print("Validation Accuracy is {:5.1%}".format(
        accuracy_score(y_val, knn.predict(X_val))))


In [15]:
data = [iris, breast_cancer, wine]

for i in data:
  for j in range(1, 20, 2):
    split_train_test_dataset(i, j)

Validation Accuracy is 100.0%
Validation Accuracy is 96.7%
Validation Accuracy is 96.7%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 100.0%
Validation Accuracy is 91.2%
Validation Accuracy is 91.2%
Validation Accuracy is 93.9%
Validation Accuracy is 94.7%
Validation Accuracy is 96.5%
Validation Accuracy is 96.5%
Validation Accuracy is 96.5%
Validation Accuracy is 96.5%
Validation Accuracy is 96.5%
Validation Accuracy is 96.5%
Validation Accuracy is 77.8%
Validation Accuracy is 77.8%
Validation Accuracy is 80.6%
Validation Accuracy is 77.8%
Validation Accuracy is 75.0%
Validation Accuracy is 72.2%
Validation Accuracy is 75.0%
Validation Accuracy is 72.2%
Validation Accuracy is 77.8%
Validation Accuracy is 77.8%


Write a single mathematical expression describing the relationship you found between $n$ (the size of the dataset) and $k$ (the number of datapoints used to classify each validation datum).

(YOUR ANSWER HERE)

## Problem 3
Now, we will **be writing our k-NNA**. Recall that we said a kNN is comprised of a predictions and using those predictions to classify the data. Here we will try to mimic sklearn's kNN methods. We will be using the Pima diabetes dataset. 

### Loading and splitting data

In [16]:
# -- loading dataset -- #
url = "https://github.com/BeaverWorksMedlytics2020/Data_Public/raw/master/NotebookExampleData/Week1/diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = po.read_csv(url, names=names)

# -- dropping NaN rows -- #
invalid = ['plas', 'pres', 'skin', 'test', 'mass']

for i in invalid:
    data[i].replace(to_replace=0, value=np.nan, inplace=True)
    
data = data.dropna(axis=0).reset_index(drop=True)
data

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
1,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
2,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
3,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
4,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1
...,...,...,...,...,...,...,...,...,...
387,0,181.0,88.0,44.0,510.0,43.3,0.222,26,1
388,1,128.0,88.0,39.0,110.0,36.5,1.057,37,1
389,2,88.0,58.0,26.0,16.0,28.4,0.766,22,0
390,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0


Now, let's clearly define which columns will act as explanatory variables, and which column will be the target value, and split the dataset between your training data and testing data. Let's try an 80-20 split and use sklearn's [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method (set random_state = 0 so we get the same output each time).

In [33]:
from sklearn.model_selection import train_test_split

# columns we will use to make predictions with (features!) feel free to play around with these
X_cols = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']
# column that we want to predict
y_col = 'class'


# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data[X_cols], data[y_col], test_size=0.2, random_state=0)

# further split X and y of training into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

print('There are {} training samples with {} features and {} associated classification labels'.format(*X_train.shape, *y_train.shape))
print('There are {} validation samples with {} features and {} associated classification labels'.format(*X_val.shape, *y_val.shape))
print('There are {} test samples with {} features and {} associated classification labels'.format(*X_test.shape, *y_test.shape))

There are 250 training samples with 8 features and 250 associated classification labels
There are 63 validation samples with 8 features and 63 associated classification labels
There are 79 test samples with 8 features and 79 associated classification labels
0


### Normalizing Data

Let's not forget to normalize the data! We'll use sklearn's StandardScaler normalization like we did before to normalize the training **and** validation/data.

In [25]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

for i in list(X_train):
    feature_data_train = X_train[i].values.reshape(-1, 1)
    scaler.fit(feature_data_train)
    X_train[i] = scaler.transform(feature_data_train)

for j in list(X_test):
    feature_data_test = X_test[j].values.reshape(-1, 1)
    scaler.fit(feature_data_test)
    X_test[j] = scaler.transform(feature_data_test)
    
for k in list(X_val):
    feature_data_val = X_val[k].values.reshape(-1, 1)
    scaler.fit(feature_data_val)
    X_val[k] = scaler.transform(feature_data_val)

         preg      plas      pres  ...      mass      pedi       age
173 -0.763913  0.494808 -2.166661  ... -0.657013  0.357848 -0.888644
289 -0.763913 -1.193513 -0.269712  ...  0.659616 -0.379316 -0.705039
115  1.130086  1.858451  1.109888  ...  0.408125  0.159717  1.773615
335  0.498753 -0.024675  0.247638  ...  0.141841 -0.763924 -0.337831
181  0.498753 -0.803900 -1.476861  ...  0.127047 -0.093774 -0.154227
..        ...       ...       ...  ...       ...       ...       ...
257  1.130086 -0.966239 -0.614612  ...  0.023492  0.602599  0.855595
11   2.077085  0.040260 -0.097262  ... -0.301967 -0.950400  0.855595
249 -0.763913  0.494808 -0.787061  ...  1.118217  0.014032 -0.980446
101 -1.079579  0.527275 -0.528387  ...  1.399295 -0.291906 -0.705039
248 -1.079579  0.364937 -0.269712  ...  1.354915 -0.484209 -0.705039

[250 rows x 8 columns]
         preg      plas      pres  ...      mass      pedi       age
144 -0.371501  1.069242  0.053651  ... -0.694745 -0.511444 -0.118075
280  1.274

### Writing our kNN

Now for the fun part! Fill in the 3 following methods, euclidean_dist(), predict(), and knn().

The predict method that we'll make below needs to: 
1. Compute the euclidean distance between the “new” observation and all the data points in the training set. 
2. Assign the corresponding label to the observation
3. Select the k nearest ones and perform a "majority vote"

In [37]:
# Euclidean distance function from tutorial
def euclidean_dist(datum1, datum2):
    inner_val = 0.0
    
    for g in range(datum1.shape[0]):
        inner_val += (datum1[g]- datum2[g]) ** 2
    
    distance = np.sqrt(inner_val)
    return(distance)

In [47]:
from collections import Counter
from sklearn.neighbors import KNeighborsClassifier

def predict(x_training, y_training, x_test_sample, k):
    
    # create list for distances and targets
    distances = []
    targets = []

    # Needs work
    for i in x_training:
      distances.append(euclidean_dist(x_test_sample, i))
    
    for i in y_training:
      targets.append(y_training)
    
    print(distances)
    return 0
    


In [48]:
def knn(x_training, y_training, x_testing, k):
    predictions = []
    
    for i in x_testing:
      predictions.append(predict(x_training, y_training, i, k))
    
    return predictions

When done, test your code by running the methods here!

In [49]:
from sklearn.metrics import accuracy_score
import time

start = time.time()
predictions_slow = knn(X_train, y_train, X_val, k=5)

print('Took {} seconds'.format(time.time() - start))
print("Validation Accuracy is ", accuracy_score(y_val,predictions_slow)*100)

AttributeError: ignored

Check sklearn's predictions on validation data from the tutorial notebook and make sure they match yours. Sklearn is faster, but you should get the same answers.