After finding out the optimal hyperparameter K for a KNN classifier, we are going to try different distance metrics (recall that the default one uses Euclidean distance) and see whether other matrics work better or not. 

We are going to try 4 distance metrics: Manhattan distance, Euclidean distance (for completeness reason although we tried it in question 2), Chebyshev distance, and cosine distance.

Again, let's begin with loading the necessary packages.

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

Load the data and split them into training set and test set

In [2]:
features_file = 'AwA2-features/Animals_with_Attributes2/Features/ResNet101/AwA2-features.txt'
labels_file = 'AwA2-features/Animals_with_Attributes2/Features/ResNet101/AwA2-labels.txt'

# There is in total 37322 images of 50 classes. Each image is represented as a 2048 dimensional feature
features = np.loadtxt(features_file) # shape (37322, 2048)
labels = np.loadtxt(labels_file) # shape (37322, )

# Split each and all classes into training set (60%) and test set (40%)
# set random_state to an int for reproducibility
X_train, X_test, Y_train, Y_test = train_test_split(
    features, labels, train_size=0.6, test_size=0.4, random_state=0, stratify=labels)

1. Manhattan Distance, K = 7

In [7]:
clf = KNeighborsClassifier(n_neighbors=7, p=1, metric='minkowski')
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
print('Manhattan Distance Metric, K=7:\nAccuracy = ' + str(accuracy))

Manhattan Distance Metric, K=7:
Accuracy = 0.8859937035300423


2. Euclidean Distance, K=7

In [5]:
clf = KNeighborsClassifier(n_neighbors=7, p=2, metric='minkowski')
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
print('Euclidean Distance Metric, K=7:\nAccuracy = ' + str(accuracy))

Euclidean Distance Metric, K=7:
Accuracy = 0.894500636345368


3. Chebyshev Distance, K=7

In [6]:
clf = KNeighborsClassifier(n_neighbors=7, metric='chebyshev')
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
print('Chevbyshev Distance Metric, K=7:\nAccuracy = ' + str(accuracy))

Chevbyshev Distance Metric, K=7:
Accuracy = 0.7830397213477125


4. Cosine Distance, K=7.

Note that scikit.learn does not provide a cosine similarity metric in its metrics list, so we need to define it ourselves and use the user-defined metric option.

As stated in sklearn website, this user-defined metric has to meet the requirement: "...Here func is a function which takes two one-dimensional numpy arrays, and returns a distance. Note that in order to be used within the BallTree, the distance must be a true metric...".

Since the original form of cosine similarity cos_sim(p1, p2) = p1^Tp2 / ||p1|| ||p2|| is not a valid distance metric, we convert it to d(p1, p2) = 1 - (1+cos_sim(p1,p2))/2. Then we use this new distance to train a kNN classifier.

Is this new formula d(p1, p2) = 1 - (1+cos_sim(p1,p2))/2 a valid metric?
    1. It satisfies the 'non-negativity' property;
    2. It satisfies the 'identity of indiscernibles' property;
    3. It satisfies the 'symmetry' property;
    4. TODO: prove whether it satisfies the 'triangle inequality' property or not!!!

In [3]:
# Self-defined distance measure
# Use consine similarity to measure the distance b/t two vectors p1 and p2
def cosine(p1, p2):
    '''
    params:
    p1: vector for point 1, shape = (d, )
    p2: vector for point 2, shape = (d, )
    
    return:
    a cosine similary b/t p1 and p2
    '''
    assert p1.shape[0] == p2.shape[0]
    cos = np.dot(p1, p2) / (np.sqrt(np.sum(p1**2)) * np.sqrt(np.sum(p2**2)))
    return 1 - (1 + cos) / 2

clf = KNeighborsClassifier(n_neighbors=7, metric=cosine)
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
print('(Modified) Cosine Distance Metric, K=7:\nAccuracy = ' + str(accuracy))

(Modified) Cosine Distance Metric, K=7:
Accuracy = 0.90635675530846
