# k-Nearest Neighbours

In this task, I will use four distance functions: (we removed the vector symbol for simplicity)

- Euclidean distance:  $$d(x, y) = \sqrt{\langle x - y, x - y \rangle}$$
- Inner product distance: $$d(x, y ) = \langle x, y \rangle$$
- Gaussian kernel distance: 
    $$d(x, y ) = - \exp({−\frac 12 \langle x - y, x - y \rangle}) $$
- Cosine Distance: $$d(x, y) =\cos(\theta )={\mathbf {x} \cdot \mathbf {y}  \over \|\mathbf {x} \|\|\mathbf {y} \|}$$

F1-score is a important metric for binary classification, as sometimes the accuracy metric has the false positive (a good example is in MLAPP book 2.2.3.1 “Example: medical diagnosis”, Page 29).
We have provided a basic definition. For more you can read 5.7.2.3 from MLAPP book.

<img src="F1Score.png">

### Part 1.1.1 Distance Functions

Implement the class in file hw1_knn.py
    the functions in utils.py    
    - f1_score
    - euclidean_distance
    - inner_product_distance
    - gaussian_kernel_distance
    - cosine distance

Simply follow the formula above and finish all these function. You are not allowed to call any package that we did not import for you.
    
### Part 1.1.2 KNN Class

There are following functions you need to implement in KNN class:

1.def train(self, features: List[List[float]], labels: List[int])
     
In this function, features are simply all training data which is a 2D list with float. For example, if the data looks like the following: Student 1 with features age 25, grade 3.8 and labeled as 0, Student 2 with features age 22, grade 3.0 and labeled as 1, then the feature data should be [ [25.0, 3.8], [22.0,3.0] ], thus the coresponding label is [0,1]
    
For KNN, the training process is just loading all the training data. Thus, all you need to do in this function is create some local variable in KNN class to store these data so you can use the data in later process.
    
2.def get_k_neighbors(self, point: List[float]) -> List[int]:

This function takes one single data point and ask you to find the nearest k neighbours in the training set. You already have your k value, distance function and you just stored all training data in KNN class with the train function. 

This function need to return a list of labels of all k neighours.

3.def predict(self, features: List[List[float]]) -> List[int]

The predict function take another 2D list which is all testing data point, Similar to those from train function. In this function, you need process every testing data point, reuse the get_k_neighbours function to find the nearest k neighbours for each testing data point, find the majority of labels for these neighbours as the predict label for that testing data point. Thus, you will get N predicted label for N testing data point.

This function need to return a list of predicted labels for all testing data points.

In [1]:
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
import numpy as np
from sklearn.metrics import mean_squared_error
from hw1_knn import KNN
from utils import euclidean_distance, gaussian_kernel_distance, inner_product_distance, cosine_sim_distance
from utils import f1_score, model_selection_without_normalization, model_selection_with_transformation
distance_funcs = {
    'euclidean': euclidean_distance,
    'gaussian': gaussian_kernel_distance,
    'inner_prod': inner_product_distance,
    'cosine_dist': cosine_sim_distance,
}

In [17]:
from data import data_processing
Xtrain, ytrain, Xval, yval, Xtest, ytest = data_processing()

### Part 1.1.3 Model selection 

In this section, you need to implement the following function:

1.def model_selection_without_normalization(distance_funcs, Xtrain, ytrain, Xval, yval)

In this part, you should try different distance function you implemented in part 1.1, and find the best k. Use k range from 1 to 30 and increment by 2. We will use f1-score to compare different models. 

THis function take the following parameter:

distance_funcs: dictionary of distance funtion you will use to calculate the distance. Make sure you loop over all distance function for each data point and each k value.

Xtrain: List[List[int]] training data set to train your KNN model

ytrain: List[int] train labels to train your KNN model

Xval: List[List[int]] validation data set you will use on your KNN predict function to produce predicted labels and tune k and distance function.

yval: List[int] validation labels

This function need to return the following:

best_model: an instance of KNN

best_k: best k choosed for best_model

best_func: name of best function choosed for best_model

Thus, the function only return one set of  model, k and function.

chose model based on the following priorities:
Then check distance function  [euclidean > gaussian > inner_prod > cosine_dist];

In [7]:
best_model, best_k, best_function = model_selection_without_normalization(distance_funcs, Xtrain, ytrain, Xval, yval)
print("best_model:",best_model)
print("best_k:",best_k)
print("best_function:",best_function)
print("best_scaler:",best_scaler)

best_model: <hw1_knn.KNN object at 0x613559390>
best_k: 13
best_function: gaussian
best_scaler: min_max_scale


In [13]:
scaler = MinMaxScaler()
X = scaler(Xtest)
Y_pred = best_model.predict(X)

print("test data RMSE:",mean_squared_error(ytest, Y_pred))

test data RMSE: 0.625


### Part 1.2 Data transformation

Here, we take two different data transformation approaches.

#### Normalizing the feature vector 

This one is simple but some times may work well. Given a feature vector $x$, the normalized feature vector is given by 

$$ x' = \frac x {\sqrt{\langle x, x \rangle}} $$
If a vector is a all-zero vector, we let the normalized vector also be a all-zero vector.

1.normalize
    
normalize the feature vector for each sample . For example, if the input features = [[3, 4], [1, -1], [0, 0]], the output should be [[0.6, 0.8], [0.707107, -0.707107], [0, 0]]
        
#### Min-max scaling the feature matrix

The above normalization is data independent, that is to say, the output of the normalization function doesn’t depend on the rest training data. However, sometimes it would be helpful to do data dependent normalization. One thing to note is that, when doing data dependent normalization, we can only use training data, as the test data is assumed to be unknown during training (at least for most classification tasks).

The min-max scaling works as follows: after min-max scaling, all values of training data’s feature vectors are in the given range.
Note that this doesn’t mean the values of the validation/test data’s fea- tures are all in that range, because the validation/test data may have dif- ferent distribution as the training data.

2.min_max_scale

normalize the feature vector for each sample . For example, if the input features = [[2, -1], [-1, 5], [0, 0]], the output should be [[1, 0], [0, 1], [0.333333, 0.16667]]

In [4]:
from utils import NormalizationScaler, MinMaxScaler

scaling_classes = {
    'min_max_scale': MinMaxScaler,
    'normalize': NormalizationScaler,
}

In [15]:
best_model, best_k, best_function, best_scaler = model_selection_with_transformation(distance_funcs, scaling_classes, Xtrain, ytrain, Xval, yval)
print("best_model:",best_model)
print("best_k:",best_k)
print("best_function:",best_function)
print("best_scaler:",best_scaler)

best_model: <hw1_knn.KNN object at 0x1a14ace898>
best_k: 25
best_function: euclidean
best_scaler: min_max_scale


In [20]:
scaler = MinMaxScaler()
X = scaler(Xtest)
Y_pred = best_model.predict(X)

print("test data RMSE:",mean_squared_error(ytest, Y_pred))

test data RMSE: 0.25


RMSE for test data prediction was improved from 0.625 to 0.25
which shows that appropriate scaling and normalization can enhance the performance of classification.