#### PA 1 - kNN

Name: Viswesh Uppalapati 

PID: A15600068

#### Problem 1.1 Errors

| k | 1 | 5 | 9 | 15 |
| :- | :- | :- | :- | :- |
| Training Error | 0.0 | 0.055 | 0.072 | 0.0925 |
| Validation Error | 0.082 | 0.096 | 0.106 | 0.105 |

The best predictor on the validation data is when k = 1. The test error when k = 1 is:

test_error = 0.094

#### Problem 1.2 Errors

| k | 1 | 5 | 9 | 15 |
| :- | :- | :- | :- | :- |
| Training Error | 0.0 | 0.1935 | 0.226 | 0.2585|
| Validation Error | 0.32 | 0.289 | 0.283 | 0.293 |

The best predictor on the projected validation data is when k = 9. The projected test 
error when k = 9 is:

test_error = 0.296

The accuracy of the prediction significantly reduces after the projection (validation errors for the projections are atleast 0.20 higher than the validation error from part 1.1). However, the runtime is much faster on the projected data (look at the "Runtime Comparison Original Data Vs. Projected Data" section of the code below). The projection reduces the dimension of the data from 784 to 20, making calculations and matrix operations much faster and allowing the predictions to be outputted faster. There is a tradeoff between accuracy and runtime here.


#### Imports and Data

In [46]:
import pandas as pd
import numpy as np
import os

from sklearn.metrics.pairwise import euclidean_distances

In [47]:
train_data = pd.read_csv('pa1train.txt', sep=" ", header=None)
val_data = pd.read_csv('pa1validate.txt', sep=" ", header=None)
test_data = pd.read_csv('pa1test.txt', sep = " ", header = None)

#### kNN Algorithm

In [48]:
def kNN(k, train_data, test_data):
    Y_train = train_data[784]
    X_train = train_data.drop(784, axis = 1)
    Y_test = test_data[784]
    X_test = test_data.drop(784, axis = 1)
    
    preds = []
    for i in range(X_test.shape[0]):
        preds.append(predict(k, X_train, Y_train, X_test.iloc[i]))
    error = np.mean(np.array(preds) != Y_test)
        
    return error

In [49]:
def predict(k, X_train, Y_train, test_pt):
    dists = euclidean_distances(X_train, np.array(test_pt).reshape(1, test_pt.shape[0]))
    temp = X_train.assign(**{'dist' : dists, 'labels' : Y_train})
    temp = temp.sort_values('dist', ascending = True).iloc[0:k]
    common_labels = temp['labels'].value_counts().sort_values(ascending = False)
    return common_labels[common_labels.values >= common_labels.values[0]].sample(1).index[0]

#### Calculation of Error on Training Data

In [50]:
kNN(1, train_data, train_data)

0.0

In [51]:
# accuracy check: should be around 0.04
kNN(3, train_data, train_data)

0.045

In [52]:
kNN(5, train_data, train_data)

0.055

In [53]:
kNN(9, train_data, train_data)

0.072

In [54]:
kNN(15, train_data, train_data)

0.0925

#### Calculation of Error on Validation Data

In [55]:
kNN(1, train_data, val_data)

0.082

In [56]:
kNN(5, train_data, val_data)

0.096

In [57]:
kNN(9, train_data, val_data)

0.106

In [58]:
kNN(15, train_data, val_data)

0.105

#### Best Predictor on Test Data

In [59]:
# lowest validation error when k = 1
best_val_predictor = 1
kNN(best_val_predictor, train_data, test_data)

0.094

#### Projecting Data

In [60]:
proj_data = pd.read_csv('projection.txt', sep=" ", header=None)

In [61]:
def project(data):
    Y_data = data[784]
    dots = np.dot(data.drop(784, axis = 1), proj_data)
    return pd.concat([pd.DataFrame(dots), Y_data], axis = 1)
    

In [62]:
proj_train = project(train_data)
proj_val = project(val_data)
proj_test = project(test_data)

#### Calculation of Error on Projected Training Data

In [63]:
kNN(1, proj_train, proj_train)

0.0

In [64]:
# accuracy check: should be around 0.16
kNN(3, proj_train, proj_train)

0.151

In [65]:
kNN(5, proj_train, proj_train)

0.1935

In [66]:
kNN(9, proj_train, proj_train)

0.226

In [67]:
kNN(15, proj_train, proj_train)

0.2585

#### Calculation of Error on Projected Validation Data

In [68]:
kNN(1, proj_train, proj_val)

0.32

In [69]:
kNN(5, proj_train, proj_val)

0.289

In [70]:
kNN(9, proj_train, proj_val)

0.283

In [71]:
kNN(15, proj_train, proj_val)

0.293

#### Best Predictor on Projected Test Data

In [74]:
# lowest validation error when k = 9
best_val_predictor = 9
kNN(best_val_predictor, proj_train, proj_test)

0.296

#### Runtime Comparison Original Data Vs. Projected Data

In [75]:
%timeit kNN(1, train_data, train_data)

2min 47s ± 3.59 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [76]:
%timeit kNN(1, proj_train, proj_train)

20 s ± 6.69 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [77]:
%timeit kNN(9, train_data, val_data)

37.7 s ± 2.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [78]:
%timeit kNN(9, proj_train, proj_val)

4.78 s ± 357 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
