# K Nearest Neighbors (KNN) and the curse of dimensionality.
In this test we will implement and explore a one-shot method for learned classification.

## Section 1 : Dataset (10 points)
We will first set up a couple datasets to use as well as create visualizations for them.

In [None]:
class_one = {'mean':(2,0.5), 'std':(2,1), 'n_points':100}
class_two = {'mean':(0.5,1), 'std':(0.5,0.3), 'n_points':100}
distributions = [class_one, class_two]

def create_classification_data(distributions):
    '''
    Samples from the given gaussian distributions (each with mean, std, and number of points)
    returns a single dataset of all points with class labels corresponding to the index of the distribution in the distributions list
    
    You may assume that each distribution has the same dimenstions. 
    '''

    return

In [None]:
# TODO: Plot the points given the example distributions, each class should be its own color.
data = create_classification_data(distributions)

In [5]:
# TODO : Plot the points for data with these 3d distributions
class_one = {'mean':(2,0.5,3), 'std':(2,1,1), 'n_points':20}
class_two = {'mean':(0.5,1,0), 'std':(0.5,0.3,0), 'n_points':10}
class_three = {'mean':(1,3,1), 'std':(2,0.5,2), 'n_points':30}
distributions = [class_one, class_two, class_three]

data = create_classification_data(distributions)

## Section 2: K Nearest Neighbors (20 points)
We will now implement a k-nearest neighbors classifier and observe its behavior

In [None]:
# 10 points -- implementation

def K_nearest_neighbor(k, point, data):
    '''
    Given a point (in cartesean space) determine the k closest (euclidean distance) datapoints. 
    Return an integer (0, ..., C) corresponding to the most common class out of those k points where C is the final class (number of classes -1)

    Hint: Later sections may require you to run this on many datapoints, it may be beneficial to optimize the runtime of your algorithm
    '''

    return

In [None]:
# 5 points -- testing

# TODO first measure the accuracy of your K nearest Neighbors implementation (5)
def measure_accuracy(k, train_data, test_data):
    '''
    Train_data and Test_data will both be tuples of coordinates and a class label (integer 0,...,C)

    Using your implementation of KNN and the training data predict a label for each test data point.
    return the accuracy of these labels as a percentage.
    '''

    return

train_data = create_classification_data(distributions)

test_dist = [{a: (dist[a] if a != 'n_points' else 10) for a in dist} for dist in distributions] # dictionary + list comprehension to change n_points to 10 for all dists
test_data = create_classification_data(test_dist)

print("accuracy at k = 4:", measure_accuracy(4,train_data,test_data))

[{'mean': (2, 0.5, 3), 'std': (2, 1, 1), 'n_points': 10}, {'mean': (0.5, 1, 0), 'std': (0.5, 0.3, 0), 'n_points': 10}, {'mean': (1, 3, 1), 'std': (2, 0.5, 2), 'n_points': 10}]


In [None]:
# 5 points -- Plotting

# TODO Plot k values versus accuracy, find the k value which has maximum accuracy.
# Also plot the runtime versus k value using the time library

import time

## Section 2.5 (Bonus) : Advanced Plotting (10 points)
Here is a chance to show off your plotting and DP skills. Don't attempt this unless you are sure of both of those.

In [None]:
# TODO Create a plot of the decision boundaries of your KNN Clustering algorithm

def plot_decision_boundary(k, train_data):
    '''
    This plot should have the training data plotted along with shaded regions for each of the areas in which your KNN classifies differently
    Regions' colors should correspond to the classes they predict. 
    '''
    pass

# TODO Plot this for k = 1, k = 5 and your optimal k.

## Section 3 : High Dimensionality (20 points)
We will now explore how our method performs on high-dimensional data with noise

In [None]:
# 5 points -- High Dimenstion Data Test

# TODO modify your training and testing data to have 'additional_dims' additional dimensions all with mean 0 and std 0.2
def create_noisy_high_dim_data(distbutions, additional_dims, mean=0, std=0.2):
    '''
    ex. for additional dims = 2:
    {'mean':(2,0.5,3), 'std':(2,1,1), 'n_points':20} -> {'mean':(2,0.5,3,0,0), 'std':(2,1,1,0.2,0.2), 'n_points':20}

    then sample using your previous method
    '''

    return


In [None]:
# 5 points -- plots (and runtime check)

# TODO using your optimal k value from above, create plots showing the accuracy and runtime of your KNN given different values of 'additonal dims' up to 50

### Question 3.1 (5 points)
Explain why the above behavior is happening, should we expect the same on real-world data?
</br></br>
**Answer**:

### Question 3.2 (5 points)
Provide a method to mitigate this issue, be detailed in explaining the method and why it will help fix this problem.
</br></br>
**Answer**:

## Section 3.5 (bonus) : 10 points
Here you can implement your suggestion from 3.2, you will score points based off of the effect it has along with the generality of the solution.

In [None]:
# TODO implement your fix here.

## Section 4: Adverserial Data (20 points)
Now we will look at an adverserial case and see if we have any way to still use KNN.

In [None]:
# 5 points -- Adverserial dataset

def create_adverserial_data(max_n):
    '''
    This dataset will contain all points [0,max_n]^2 -- i.e. (0,0), (0,1), ... (0, max_n), ..., (max_n, max_n)
    each point will be either class 0 or 1 which will be equal to its coordinates mod 2 (even or odd)

    i.e.  (0,0) -> 0, (1,2) -> 1, ... (a, b) -> a+b % 2
    '''

    return

adverserial_data = create_adverserial_data(10) # 121 points

# TODO plot this data

In [None]:
# 5 points -- Testing on Adverserial Data

# TODO create a plot of k versus accuracy (same as Section 2) for this new data.

In [None]:
# 5 points -- Trigonometric Basis Expansion

# TODO Write a method that will convert a dataset via trigonometric basis expansion.
def trig_basis_expansion(data, scale):
    '''
    Basis expand the input data, such that:
    (x,y) -> (sin(nx), cos(nx), sin(ny), cos(ny))

    where n = scale
    '''

    return

# TODO Seperate out 1/5 of the data points to create a train and test set, then (after basis expanding):
# Plot your KNNs accuracy versus a few scale value in the range 1 - 5. 

### Question 4.1 (5 points):
What value(s) of scale allow for 100% accuracy on this particular dataset, why? Justify your answer.
</br></br>
**Answer:**

## Section 5: Open Ended (30 points)
Now we will try our method on a real Dataset to see how it performs

In [None]:
# You may use any libraries you would like for this section, though you MUST use your above implementation of KNN as the basis of your solution.

# We will be using MNIST as the dataset (to make it tractable computationally)
def load_mnist(batch_size=32, train=True):

    to_tensor_transform = torchvision.transforms.ToTensor() # You may remove this if you would like to use non-tensors
    dataset = torchvision.datasets.MNIST('../dataset/', train=train, download=True, transform=to_tensor_transform)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # You may also find it easier to deal with just the dataset and not the dataloader

    return dataset, dataloader

train_dataset, train_dataloader = load_mnist()
test_dataset, test_dataloader = load_mnist(train=False)

In [None]:
# 5 points -- Test without modification

# TODO test your model without modification (you may need to just use a subset of the dataset) and
# plot k versus accuray. 

In [None]:
# 25 points -- Any method you'd like

# TODO Using any modifications of your choosing, implement a KNN which can accurately predict MNIST. 
# You must show plots which explain your models behavior and you should explain your methodology in a markdown section