# ECE367 Problem Set 3: Problem 3.10

## Tasks

- [x] Import `wordVecV.mat`. 
- [x] Calculate 'raw' term-by-document matrix $M$ based on $[M]_{i,j} = \mathbb{1}([V]_{i,j})$. 
- [x] Calculate $\tilde{M}$ (normalized version of $M$).
- [x] Calculate `svd` of $\tilde{M}$ and list 10 largest singular values in sorted order.
- [x] Use distance calculation from (b) $s_i= \Sigma^{-1}U^T x_i$; $\text{distance}_{i, j} = (s_i \cdot s_j)/(|s_i| |s_j|)$ to calculate distances between each of the vectors.
    - [ ] Use $k = 9$ rank approximation. Write down titles of most similar one.
- [ ] Repeat with $k = 8, 7, 6, ..., 1$. 
    - [ ] Write down lowest $k$ that does not change closest documents.
    - [ ] Repeat for $k-1$ and write most similar pair for that situation.

In [2]:
##############
# IMPORT BOX #
##############

using LinearAlgebra
using MAT

In [3]:
###################
# DATA IMPORT BOX #
###################
vars = matread("wordVecV.mat");
V = vars["V"];

In [4]:
###############################
# Calculating raw term matrix #
###############################

M = V .> 0;

In [5]:
############################
# Normalizing for Each Row #
############################

M̃ = zeros(size(M))

for i = 1:size(M,2)
    M̃[:,i] = M[:,i]/sum(M[:,i])
end

In [15]:
###############################
# Calculating SVD of tilde{M} #
###############################

U, σ, V = svd(M̃);
Σ = diagm(σ);

println("Largest 10 singular values: ");
println(σ);

Largest 10 singular values: 
[0.11017223548493625, 0.07927613517189627, 0.07440785850204659, 0.06946862818360801, 0.06474427999716866, 0.060570119029771344, 0.060069118654154825, 0.054051285198182024, 0.0511543437422194, 0.04925926657142573]


In [33]:
######################################
# Determining Decomposition Validity #
######################################

println("Ensuring approximation validity")

M̃_approx = U*Σ*transpose(V)

println("Error of M̃_approx: ",norm(M̃-M̃_approx)/norm(M̃))

Ensuring approximation validity
Error of M̃_approx: 1.462643495439325e-15


In [105]:
function latent_encode(x, U, Σ, k)
    s = inv(Σ[1:k,1:k])*transpose(U[:, 1:k])*x;
    return s
end

function get_latent_dist(x1, x2, U, Σ, k)
    s1 = latent_encode(x1, U, Σ, k);
    s2 = latent_encode(x2, U, Σ, k);
    denom =  norm(s1) * norm(s2);
    numerator = dot(s1,s2)
    return numerator/denom
end

get_latent_dist (generic function with 2 methods)

In [108]:
function get_sim_mat(M̃, U, Σ, k)
    similarity_mat = zeros(10,10)
    
    for i = 1:size(M̃,2)
        for j = 1:size(M̃,2)
            similarity_mat[i,j] = get_latent_dist(M̃[:,i], M̃[:,j], U, Σ, k)
            if i == j
                similarity_mat[i,j] = 0
            end
        end
    end
    
    return similarity_mat
end

get_sim_mat (generic function with 1 method)

In [109]:
sim_mat = get_sim_mat(M̃, U, Σ, 9);

10×10 Array{Float64,2}:
  0.0          -0.00101087   -0.000520564  …  -0.033852     0.0553758
 -0.00101087    0.0          -0.000273633     -0.0177942    0.0291082
 -0.000520564  -0.000273633   0.0             -0.00916337   0.0149896
  0.0003923     0.000206212   0.000106191      0.00690558  -0.0112963
  0.00196948    0.00103525    0.000533116      0.0346683   -0.0567111
 -0.00089684   -0.000471421  -0.000242765  …  -0.0157869    0.0258245
 -0.00184979   -0.000972337  -0.000500718     -0.0325615    0.0532648
 -0.00315669   -0.0016593    -0.000854481     -0.0555665    0.0908968
 -0.033852     -0.0177942    -0.00916337       0.0          0.974769
  0.0553758     0.0291082     0.0149896        0.974769     0.0

In [110]:
function rank_order_sim(sim_mat_in)
    sim_mat = copy(sim_mat_in)
    for i = 1:size(sim_mat,1)
        mx = argmax(sim_mat)
        println(i,": ",mx)
        sim_mat[mx] = -1000
        sim_mat[mx[2],mx[1]] = -1000
    end
end

rank_order_sim (generic function with 1 method)

In [126]:
function find_sim(M̃, U, Σ, k)
    sim_mat = get_sim_mat(M̃, U, Σ, k);
    println("For k = ",k,", the top similar vectors are: ")
    rank_order_sim(sim_mat)
end

find_sim (generic function with 1 method)

In [127]:
find_sim(M̃, U, Σ, 9)

For k = 9, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(10, 8)
3: CartesianIndex(10, 1)
4: CartesianIndex(10, 7)
5: CartesianIndex(9, 5)
6: CartesianIndex(10, 2)
7: CartesianIndex(10, 6)
8: CartesianIndex(10, 3)
9: CartesianIndex(9, 4)
10: CartesianIndex(8, 5)


In [128]:
find_sim(M̃, U, Σ, 8)

For k = 8, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(6, 4)
3: CartesianIndex(4, 2)
4: CartesianIndex(10, 8)
5: CartesianIndex(10, 1)
6: CartesianIndex(10, 7)
7: CartesianIndex(9, 5)
8: CartesianIndex(4, 3)
9: CartesianIndex(7, 6)
10: CartesianIndex(8, 6)


In [129]:
find_sim(M̃, U, Σ, 7) # Lowest k that does not change top-ranked pair.

For k = 7, the top similar vectors are: 
1: CartesianIndex(10, 9)
2: CartesianIndex(6, 4)
3: CartesianIndex(4, 2)
4: CartesianIndex(6, 2)
5: CartesianIndex(10, 6)
6: CartesianIndex(6, 1)
7: CartesianIndex(9, 6)
8: CartesianIndex(10, 4)
9: CartesianIndex(4, 1)
10: CartesianIndex(9, 4)


In [130]:
find_sim(M̃, U, Σ, 6) # Most similar documents for k-1

For k = 6, the top similar vectors are: 
1: CartesianIndex(6, 4)
2: CartesianIndex(10, 9)
3: CartesianIndex(4, 2)
4: CartesianIndex(10, 6)
5: CartesianIndex(9, 6)
6: CartesianIndex(9, 4)
7: CartesianIndex(6, 2)
8: CartesianIndex(10, 4)
9: CartesianIndex(9, 2)
10: CartesianIndex(10, 2)
