# ECE367 Problem Set 3: Problem 3.10

## Tasks

- [x] Import `wordVecV.mat`. 
- [x] Calculate 'raw' term-by-document matrix $M$ based on $[M]_{i,j} = \mathbb{1}([V]_{i,j})$. 
- [x] Calculate $\tilde{M}$ (normalized version of $M$).
- [x] Calculate `svd` of $\tilde{M}$ and list 10 largest singular values in sorted order.
- [ ] Use distance calculation from (b) $s_i= \Sigma^{-1}U^T x_i$; $\text{distance}_{i, j} = (s_i \cdot s_j)/(|s_i| |s_j|)$ to calculate distances between each of the vectors.
    - [ ] Use $k = 9$ rank approximation. Write down titles of most similar ones.
- [ ] Repeat with $k = 8, 7, 6, ..., 1$. 
    - [ ] Write down lowest $k$ that does not change closest documents.
    - [ ] Repeat for $k-1$ and write most similar pair for that situation.

In [2]:
##############
# IMPORT BOX #
##############

using LinearAlgebra
using MAT

In [3]:
###################
# DATA IMPORT BOX #
###################
vars = matread("wordVecV.mat");
V = vars["V"];

In [4]:
###############################
# Calculating raw term matrix #
###############################

M = V .> 0;

In [5]:
############################
# Normalizing for Each Row #
############################

M̃ = zeros(size(M))

for i = 1:size(M,2)
    M̃[:,i] = M[:,i]/sum(M[:,i])
end

In [15]:
###############################
# Calculating SVD of tilde{M} #
###############################

U, σ, V = svd(M̃);
Σ = diagm(σ);

println("Largest 10 singular values: ");
println(σ);

Largest 10 singular values: 
[0.11017223548493625, 0.07927613517189627, 0.07440785850204659, 0.06946862818360801, 0.06474427999716866, 0.060570119029771344, 0.060069118654154825, 0.054051285198182024, 0.0511543437422194, 0.04925926657142573]


In [33]:
######################################
# Determining Decomposition Validity #
######################################

println("Ensuring approximation validity")

M̃_approx = U*Σ*transpose(V)

println("Error of M̃_approx: ",norm(M̃-M̃_approx)/norm(M̃))

Ensuring approximation validity
Error of M̃_approx: 1.462643495439325e-15


In [60]:
function latent_encode(x, U, Σ, k)
    s = inv(Σ[1:k,1:k])*transpose(U[:, 1:k])*x;
    return s
end

function get_latent_dist(x1, x2, U, Σ, k)
    s1 = latent_encode(x1, U, Σ, k);
    s2 = latent_encode(x2, U, Σ, k);
    denom =  norm(s1) * norm(s2);
    numerator = dot(s1,s2)
    return numerator/denom
end

get_latent_dist (generic function with 2 methods)

In [66]:
get_latent_dist(M̃[:,1], M̃[:,6], U, Σ, 10)

2.6020852139652103e-18

In [67]:
latent_encode(M̃[:,1], U, Σ, 10) # Testing latent encoding system -- it is equal to the first row of V!

10-element Array{Float64,1}:
 -0.200974785103816
 -0.059353539813775266
 -0.1055986458917382
  0.25006662279233244
 -0.3529058582888234
  0.8230737566133176
 -0.2160049415708304
  0.17840878546651212
 -0.001306445955263835
  0.043811108857229414

In [68]:
function get_sim_mat(M̃, U, Σ, k)
    similarity_mat = zeros(10,10)
    
    for i = 1:size(M̃,2)
        for j = 1:size(M̃,2)
            similarity_mat[i,j] = get_latent_dist(M̃[:,i], M̃[:,j], U, Σ, k)
            if i == j
                similarity_mat[i,j] = 0
            end
        end
    end
    
    return similarity_mat
end

get_sim_mat (generic function with 1 method)

In [69]:
sim_mat = get_sim_mat(M̃, U, Σ, 10)

10×10 Array{Float64,2}:
  1.0           4.12864e-16  -9.15067e-17  …   3.33067e-16   3.46945e-16
  4.12864e-16   1.0          -1.34506e-15     -2.48065e-16   8.67362e-17
 -9.15067e-17  -1.34506e-15   1.0             -4.84855e-16  -7.35523e-16
 -1.40946e-18  -9.87166e-17   2.46222e-16      1.29237e-16  -3.40006e-16
  3.33934e-17  -5.00468e-16  -1.10675e-15     -5.10009e-16   1.66533e-16
  2.60209e-18  -2.29851e-16  -2.33293e-16  …   7.32053e-16  -1.38778e-17
 -7.80842e-16   7.75205e-17  -6.94974e-17      1.56125e-16  -2.91434e-16
  2.88831e-16   1.97325e-17   3.91072e-16     -5.13478e-16  -1.41553e-15
  3.33067e-16  -2.48065e-16  -4.84855e-16      1.0          -1.11022e-16
  3.46945e-16   8.67362e-17  -7.35523e-16     -1.11022e-16   1.0

0.0