## Q1.8

Some setup code:

In [39]:
import os
import numpy as np
from numpy.linalg import norm
from scipy.io import loadmat

def get_distance(x, y):
    return np.sqrt(np.sum((x - y) ** 2))
    
def get_angle(x, y):
    cos_angle = np.dot(x, y) / (norm(x) * norm(y))
    return np.arccos(cos_angle)

# This will help us get answers for both parts of the problem.
# V is the transposed data matrix, with shape (10, 1651).
def print_best_pairs(V):
    min_dist, min_angle = None, None
    min_dist_pair = None
    min_angle_pair = None
    for i in range(len(V)):
        for j in range(i + 1, len(V)):
            dist = get_distance(V[i], V[j])
            angle = np.abs(get_angle(V[i], V[j]))
            if dist < min_dist or min_dist is None:
                best_dist_pair = i, j
                min_dist = dist
            if angle < min_angle or min_angle is None:
                best_angle_pair = i, j
                min_angle = angle
            
    print('Lowest distance pair is: (v%d, v%d)' % (best_dist_pair))
    print('Lowest angle pair is: (v%d, v%d)' % (best_angle_pair))

data_path = os.path.join('PS01_dataSet', 'wordVecV.mat')
data = loadmat(data_path)
V = data['V'].T
num_docs = len(V)

### a) ###

In [33]:
print_best_pairs(V)

Lowest distance pair is: (v6, v7)
Lowest angle pair is: (v8, v9)


They are not the same pair. The reason for this is probably that the vectors aren't normalized, and in this case using angle vs distance for metrics gives us different answers. This was shown in Q1.7.

### b) ###

In [36]:
normalizer = np.sum(V, axis=1, keepdims=True)
V_l1_normed = V / normalizer
print_best_pairs(V_l1_normed)

Lowest distance pair is: (v8, v9)
Lowest angle pair is: (v8, v9)


The lowest angle difference pair is the same as part a); this is expected as all we've done is scale the vectors (i.e they still point in the same directions). What has changed is that the distance metric now agrees with the angle difference metric on the nearest neighbor.

One possible reason for using this normalization would be to decrease the relative distance for documents with very similar structure but differing lengths. A contrived example would be two documents A and B, where B is just A repeated a few times. In this case normalization will help make the distance between the two 0. 

### c), d) ###

In [41]:
fdoc = np.sum(V > 0, axis=0, keepdims=True) 
tfidf_log_term = np.sqrt(np.log(num_docs / fdoc))
print(tfidf_log_term)
V_tfidf = V_l1_normed * tfidf_log_term
print_best_pairs(V_tfidf)

[[1.51742713 1.26863624 1.51742713 ... 1.51742713 1.51742713 1.51742713]]
Lowest distance pair is: (v8, v9)
Lowest angle pair is: (v7, v9)


The "inverse document frequency" adjustment lowers the $f_term$ values for words that occur frequently across _all_ documents, while putting relatively more scaling on words that occur only in fewer documents. Geometrically, this means creating more separation along axes of words that occur more rarely across documents, and having the resulting vectors "point" more along these axes.

For example, the only two documents that contain the word "optimization" will separate themselves more from the other documents by pointing more along the "optimization" axis.

This scaling might be useful as it helps lower the importance of words that don't help identify a document uniquely or help our distance metric, since they occur everywhere regardless of the document.