# ECE367 Problem Set 3: Problem 3.10

## Tasks

- [x] Import `wordVecV.mat`. 
- [x] Calculate 'raw' term-by-document matrix $M$ based on $[M]_{i,j} = \mathbb{1}([V]_{i,j})$. 
- [x] Calculate $\tilde{M}$ (normalized version of $M$).
- [x] Calculate `svd` of $\tilde{M}$ and list 10 largest singular values in sorted order.
- [x] Use distance calculation from (b) $s_i= \Sigma^{-1}U^T x_i$; $\text{distance}_{i, j} = (s_i \cdot s_j)/(|s_i| |s_j|)$ to calculate distances between each of the vectors.
    - [ ] Use $k = 9$ rank approximation. Write down titles of most similar one.
- [ ] Repeat with $k = 8, 7, 6, ..., 1$. 
    - [ ] Write down lowest $k$ that does not change closest documents.
    - [ ] Repeat for $k-1$ and write most similar pair for that situation.

In [131]:
##############
# IMPORT BOX #
##############

using LinearAlgebra
using MAT

In [132]:
###################
# DATA IMPORT BOX #
###################
vars = matread("wordVecV.mat");
V = vars["V"];

In [133]:
###############################
# Calculating raw term matrix #
###############################

M = V .> 0;

In [134]:
############################
# Normalizing for Each Row #
############################

M̃ = zeros(size(M))

for i = 1:size(M,2)
    M̃[:,i] = M[:,i]/norm(M[:,i])
end

In [135]:
###############################
# Calculating SVD of tilde{M} #
###############################

U, σ, V = svd(M̃);
Σ = diagm(σ);

println("Largest 10 singular values: ");
println(σ);

Largest 10 singular values: 
[1.5366294177331445, 1.0192424086695382, 0.958684541435874, 0.9539129459951032, 0.9413064001927458, 0.9289078001291811, 0.8977405000640665, 0.8918819220380092, 0.8686645393885041, 0.8160833878423517]


In [136]:
######################################
# Determining Decomposition Validity #
######################################

println("Ensuring approximation validity")

M̃_approx = U*Σ*transpose(V)

println("Error of M̃_approx: ",norm(M̃-M̃_approx)/norm(M̃))

Ensuring approximation validity
Error of M̃_approx: 2.2142216960795045e-15


In [137]:
function latent_encode(x, U, Σ, k)
    s = inv(Σ[1:k,1:k])*transpose(U[:, 1:k])*x;
    return s
end

function get_latent_dist(x1, x2, U, Σ, k)
    s1 = latent_encode(x1, U, Σ, k);
    s2 = latent_encode(x2, U, Σ, k);
    denom =  norm(s1) * norm(s2);
    numerator = dot(s1,s2)
    return numerator/denom
end

get_latent_dist (generic function with 2 methods)

In [138]:
function get_sim_mat(M̃, U, Σ, k)
    similarity_mat = zeros(10,10)
    
    for i = 1:size(M̃,2)
        for j = 1:size(M̃,2)
            similarity_mat[i,j] = get_latent_dist(M̃[:,i], M̃[:,j], U, Σ, k)
            if i == j
                similarity_mat[i,j] = 0
            end
        end
    end
    
    return similarity_mat
end

get_sim_mat (generic function with 1 method)

In [139]:
sim_mat = get_sim_mat(M̃, U, Σ, 9);

In [140]:
function rank_order_sim(sim_mat_in)
    sim_mat = copy(sim_mat_in)
    for i = 1:size(sim_mat,1)
        mx = argmax(sim_mat)
        println(i,": ",mx)
        sim_mat[mx] = -1000
        sim_mat[mx[2],mx[1]] = -1000
    end
end

rank_order_sim (generic function with 1 method)

In [141]:
function find_sim(M̃, U, Σ, k)
    sim_mat = get_sim_mat(M̃, U, Σ, k);
    println("For k = ",k,", the top similar vectors are: ")
    rank_order_sim(sim_mat)
end

find_sim (generic function with 1 method)

In [142]:
find_sim(M̃, U, Σ, 9)

For k = 9, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(10, 8)
3: CartesianIndex(10, 7)
4: CartesianIndex(9, 5)
5: CartesianIndex(10, 1)
6: CartesianIndex(8, 5)
7: CartesianIndex(10, 4)
8: CartesianIndex(9, 6)
9: CartesianIndex(7, 5)
10: CartesianIndex(8, 6)


In [143]:
find_sim(M̃, U, Σ, 8)

For k = 8, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(8, 2)
3: CartesianIndex(10, 8)
4: CartesianIndex(4, 2)
5: CartesianIndex(9, 5)
6: CartesianIndex(5, 2)
7: CartesianIndex(10, 7)
8: CartesianIndex(9, 4)
9: CartesianIndex(10, 1)
10: CartesianIndex(10, 4)


In [144]:
find_sim(M̃, U, Σ, 7) # Lowest k that does not change top-ranked pair.

For k = 7, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(10, 8)
3: CartesianIndex(6, 2)
4: CartesianIndex(8, 2)
5: CartesianIndex(4, 2)
6: CartesianIndex(8, 3)
7: CartesianIndex(3, 2)
8: CartesianIndex(8, 6)
9: CartesianIndex(9, 5)
10: CartesianIndex(5, 2)


In [149]:
find_sim(M̃, U, Σ, 6) # Most similar documents for k-1

For k = 6, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(4, 2)
3: CartesianIndex(10, 8)
4: CartesianIndex(3, 2)
5: CartesianIndex(8, 3)
6: CartesianIndex(7, 6)
7: CartesianIndex(9, 5)
8: CartesianIndex(8, 2)
9: CartesianIndex(6, 4)
10: CartesianIndex(7, 5)


In [150]:
find_sim(M̃, U, Σ, 5) # Most similar documents for k-1

For k = 5, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(4, 2)
3: CartesianIndex(10, 8)
4: CartesianIndex(9, 8)
5: CartesianIndex(6, 4)
6: CartesianIndex(3, 2)
7: CartesianIndex(5, 3)
8: CartesianIndex(7, 5)
9: CartesianIndex(7, 6)
10: CartesianIndex(8, 3)


In [151]:
find_sim(M̃, U, Σ, 4) # Most similar documents for k-1

For k = 4, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(4, 2)
3: CartesianIndex(2, 1)
4: CartesianIndex(10, 8)
5: CartesianIndex(9, 8)
6: CartesianIndex(5, 3)
7: CartesianIndex(4, 1)
8: CartesianIndex(6, 4)
9: CartesianIndex(7, 5)
10: CartesianIndex(6, 1)


In [152]:
find_sim(M̃, U, Σ, 3) # Most similar documents for k-1

For k = 3, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(6, 1)
3: CartesianIndex(10, 8)
4: CartesianIndex(7, 5)
5: CartesianIndex(4, 2)
6: CartesianIndex(9, 8)
7: CartesianIndex(6, 2)
8: CartesianIndex(5, 3)
9: CartesianIndex(2, 1)
10: CartesianIndex(7, 3)


In [153]:
find_sim(M̃, U, Σ, 2) # Most similar documents for k-1

For k = 2, the top similar vectors are: 
1: CartesianIndex(6, 1)
2: CartesianIndex(10, 9)
3: CartesianIndex(6, 2)
4: CartesianIndex(2, 1)
5: CartesianIndex(7, 5)
6: CartesianIndex(10, 8)
7: CartesianIndex(3, 2)
8: CartesianIndex(4, 1)
9: CartesianIndex(9, 8)
10: CartesianIndex(6, 4)
