## Q3.7 ##

### a) (OptM 5.5 part 3) ###

We will look at a case where $n$ is much larger than $m$, and $rank(\tilde{M}) = m$. This is probably a common case as the number of unique words will typically be larger than the number of documents.

Let the SVD of $\tilde{M} = U \Sigma V^T$.

Consider first the product $U\Sigma$. If we let the columns of $U$ represent the normalized, orthogonal term-vectors for each _concept_, then $U\Sigma$ is the scaled version. That is, there are certain concepts for which all documents have more of in common, and thus these concepts will have larger scaling attached to them (The largest eigenvector-value pair corresponds to the direction of maximum variance). The idea is very similar to PCA, where the eigenvalues in $\Sigma$ correspond to the directions of maximum variance (A difference is that in Latent Semantic Indexing our data is not centered).

We can then examine the inclusion of $V^T$ to form the final $\tilde{M}$. As the columns of $\tilde{M}$ are the term-vectors for each document, we require a linear combination of the scaled concept term-vectors in $U\Sigma$ to form the final document term-vectors. Hence, each column of $V^T$ is a concept scaling vector (the linear combination weights) for a certain document. Equivalently, the rows of $V^T$ describe the contribution of each individual concept term-vector into the final document term-vector.

### b) (OptM 5.5 part 4) ###

To project any document vector (whether it be $\tilde{q}$ or one of the columns in $\tilde{M}$) onto the subspace spanned by $u_1, ..., u_k$:

$q = U_k^T\tilde{q}$

If we want this projection to be normalized with respect to the variance of concept occurrences, which we typically do (similar to the idea of the inverse document frequency from Homework 1, which was for individual terms):

$q_{normalized} = \Sigma_k^{-1}U_k^T\tilde{q}$

The cosine similarity between any two projected, normalized terms is then:

$ cos(\theta) = \dfrac{<q_1, q_2>}{\sqrt{<q_1, q_1>}\sqrt{<q_2, q_2>}}$

### c) ###

In [45]:
import os
import numpy as np
from numpy.linalg import norm
from scipy.io import loadmat
from scipy.linalg import svd

# Load the data.
data_path = os.path.join('..', 'a1', 'PS01_dataSet', 'wordVecV.mat')
data = loadmat(data_path)
V = data['V'].astype(np.float64)
num_docs = len(V[0])

# Create the normalized term-by-document matrix M.
M = np.clip(V, 0, 1.0)
col_norms = np.sqrt(np.sum(M * M, axis=0))
M /= col_norms

# Do SVD.
U, S, V_t = svd(M, full_matrices=True, compute_uv=True)

np.set_printoptions(precision=3, suppress=True)
print('The 10 singular values in descending order: ')
print(S)

The 10 singular values in descending order: 
[1.537 1.019 0.959 0.954 0.941 0.929 0.898 0.892 0.869 0.816]


### d) ###

In [53]:
def get_similarity(x, y):
    cos_angle = np.dot(x, y) / (norm(x) * norm(y))
    return cos_angle

def norm(x):
    """ L2 norm. """
    return np.sqrt(np.sum(x * x))

def normalize(x):
    """ L2 normalized. """
    return x / norm(x)

def lsi_sim(U, S, q1, q2):
    """
    Note that for LSI, the matrices should be approximations, 
    i.e the approximation has a lower rank and the matrices are cutoff.
    :param U: The term-concept matrix, of shape (k, n).
    :param S: The singular value array, of shape (k,).
    :param q1: The first query, of shape (n,).
    :param q2: The second query, of shape (n,).
    """
    S_diag = np.diag(1.0 / S)
    proj_q1 = np.matmul(S_diag, np.matmul(U.T, normalize(q1)))
    proj_q2 = np.matmul(S_diag, np.matmul(U.T, normalize(q2)))
    return get_similarity(proj_q1, proj_q2)

max_rank = 9

# Get the rank-cutoff matrices.
U_k = U[:, :max_rank]
S_k =S[:max_rank]

# Compute pairwise distances, and keep track of the highest one.
max_sim = 0.0
max_sim_pair = None
for i in range(num_docs):
    for j in range(i + 1, num_docs):
        sim = lsi_sim(U_k, S_k, M[:, i], M[:, j])
        if sim > max_sim:
            max_sim_pair = i + 1, j + 1
            max_sim = sim
            
print('The pair with maximum similarity is: ' + str(max_sim_pair))
print('The titles are: "Barack Obama" and "George W. Bush", respectively')

The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively


### e) ###

In [58]:
for max_rank in range(8, 0, -1):
    # Get the rank-cutoff matrices.
    U_k = U[:, :max_rank]
    S_k =S[:max_rank]

    # Compute pairwise distances, and keep track of the highest one.
    max_sim = 0.0
    max_sim_pair = None
    for i in range(num_docs):
        for j in range(i + 1, num_docs):
            sim = lsi_sim(U_k, S_k, M[:, i], M[:, j])
            if sim > max_sim:
                max_sim_pair = i + 1, j + 1
                max_sim = sim
    
    print('For k = ' + str(max_rank) + ': ')
    print('The pair with maximum similarity is: ' + str(max_sim_pair))
    print('The titles are: "Barack Obama" and "George W. Bush", respectively')
    print(' ')

For k = 8: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 7: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 6: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 5: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 4: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 3: 
The pair with maximum similarity is: (9, 10)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 2: 
The pair with maximum similarity is: (1, 6)
The titles are: "Barack Obama" and "George W. Bush", respectively
 
For k = 1: 
The pair with maximum similarity is: (1, 2)
The titles are: "Barack Obama" and "George W. Bush", respectively
 


The lowest k that does not change our answer for part d) is k = 3. The pair of most similar documents for 2 is "Barack Obama" and "George W. Bush".