# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *X*

**Names:**

* *Linqi LIU*
* *Yifei SONG*
* *Ying Xu Dempster TAY*
* *Yuhang YAN*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from numpy.linalg import norm

## Exercise 4.4: Latent semantic indexing

In [2]:
import json

courses = []
with open('data/courses.txt', 'r') as file:
    for line in file:
        courses.append(json.loads(line.strip()))

In [3]:
# Load the TF-IDF matrix
tfidf_matrix = np.load('TFIDF_matrix.npy')

# Load document and term indices
document_indices = np.load('document_indices.npy', allow_pickle=True)
terms_indices = np.load('terms_indices.npy', allow_pickle=True)

# document_indices = document_indices.tolist() if isinstance(document_indices, np.ndarray) else document_indices
# terms_indices = terms_indices.tolist() if isinstance(terms_indices, np.ndarray) else terms_indices

In [4]:
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)

Shape of TF-IDF matrix: (5262, 854)


In [5]:
K = 300

U, S, Vt = svds(tfidf_matrix, k=K)

print("U matrix shape:", U.shape)
print("S matrix shape:", S.shape)
print("Vt matrix shape:", Vt.shape)

U matrix shape: (5262, 300)
S matrix shape: (300,)
Vt matrix shape: (300, 854)


Description:
1. U Matrix  
Shape: (5262, 300)  
Rows: Each row corresponds to a term, represented by its "strength of membership" to each of the 300 latent factors.  
Columns: Each column corresponds to a latent factor, represented by the "strengths of membership" of each of the 5262 terms.  
2. Vt Matrix  
Shape: (300, 854)  
Rows: Each row corresponds to a latent factor, represented by its relevance to each of the 854 courses.  
Columns: Each column corresponds to a course, represented by the relevances of each of the 300 latent factors.  
3. S Matrix  
Shape: (300,) (since S is a diagonal matrix, it can be represented as a one-dimensional array)  
Values: The S matrix contains singular values, which are sorted in descending order of magnitude, representing the importance of the latent factors.

In [6]:
print("Top 20 singular values:")
sorted_singular_values = S[::-1]

for i, singular_value in enumerate(sorted_singular_values[:20], start=1):
    print(f"{i:>2}: {singular_value}")

Top 20 singular values:
 1: 3.5298219675119173
 2: 3.1064315388392276
 3: 2.072652694410268
 4: 1.8660719395787877
 5: 1.3928356151088297
 6: 1.3457022159036005
 7: 1.2734044343384363
 8: 1.2339605990753981
 9: 1.213415203613278
10: 1.174911340154373
11: 1.1696397934536829
12: 1.1672398683810892
13: 1.1580087026072818
14: 1.1292889056712745
15: 1.1137563771145385
16: 1.0927817325774516
17: 1.0875044048298685
18: 1.0785561640648744
19: 1.0552986431911284
20: 1.0403463282550436


## Exercise 4.5: Topic extraction

In [7]:
# Get the top 10 topics
top_k_topics = 10

# Get the top 10 terms for each topic
terms_per_topic = 10
top_terms_indices = np.argsort(-np.abs(U), axis=0)[:terms_per_topic]

# Get the top 10 documents for each topic
docs_per_topic = 10
top_docs_indices = np.argsort(-np.abs(Vt), axis=1)[:, :docs_per_topic]

In [8]:
# Print the top 10 topics as a combination of 10 terms and 10 documents along with the singular values
print("Top 10 topics:")
for topic_idx in range(top_k_topics):
    singular_value = S[::-1][topic_idx]  # S is sorted in ascending order, so we reverse it
    print(f"Topic {topic_idx + 1} (Singular Value: {singular_value:.4f}):")
    print("  Top terms:", [terms_indices[i] for i in top_terms_indices[:, topic_idx]])
    print("  Top documents:", [document_indices[i] for i in top_docs_indices[topic_idx, :]])
    print()

Top 10 topics:
Topic 1 (Singular Value: 3.5298):
  Top terms: ['markov', 'biophys', 'turbul', 'hilbert', 'evolut', 'sustain', 'wing', 'pollut', 'genom', 'lipid']
  Top documents: ['CH-415', 'ENV-200', 'ME-467', 'MICRO-515', 'EE-432', 'EE-543', 'MSE-656', 'MSE-464', 'MATH-635', 'FIN-612']

Topic 2 (Singular Value: 3.1064):
  Top terms: ['frequenc', 'inequ', 'stabil', 'HF', 'sustain', 'fpga', 'real', 'engin', 'section', 'deriv']
  Top documents: ['CH-704', 'MATH-407', 'EE-603', 'EE-470', 'CS-473', 'CH-444', 'CIVIL-444', 'BIOENG-404', 'MATH-463', 'MSE-657']

Topic 3 (Singular Value: 2.0727):
  Top terms: ['wood', 'biophys', 'volatil', 'data', 'speci', 'rate', 'privaci', 'hpc', 'composit', 'sustain']
  Top documents: ['MSE-466', 'CH-415', 'MICRO-607', 'FIN-503', 'FIN-505', 'ENG-802', 'BIO-657', 'PHYS-301', 'ME-608', 'CH-311']

Topic 4 (Singular Value: 1.8661):
  Top terms: ['beam', 'IP', 'privaci', 'secur', 'learn', 'tribolog', 'planet', 'planetari', 'set', 'stress']
  Top documents: ['ME-

Label:  
Topic 1: Evolution - Theoretical and evolutionary concepts.   
Topic 2: Engineering - Engineering principles and computational methods.   
Topic 3: Data - Data analysis and biophysics.  
Topic 4: Security - Privacy and security in mechanical systems.  
Topic 5: Philosophy - Philosophical and computational science.  
Topic 6: Industry - Industrial applications and materials.  
Topic 7: Simulation - Negotiation and simulation techniques.  
Topic 8: Market - Market analysis and high-performance computing.  
Topic 9: Chemistry - Advanced chemistry and photochemistry.  
Topic 10: Wireless - Composite materials and wireless communication.  

## Exercise 4.6: Document similarity search in concept-space

In [9]:
# Document similarity search function using LSI concept-space
def sim_score(term_idx, doc_idx):
    U_t = U[term_idx]
    V_d = Vt[:, doc_idx]  # Note that Vt is the transpose of V in SVD
    sv = S * V_d
    return np.dot(U_t, sv) / (norm(U_t) * norm(sv))

In [11]:
# Convert courses to a dictionary for quick lookup by courseId
course_dict = {course['courseId']: course for course in courses}

# Convert term_ids to a dictionary
term_dict = {term: idx for idx, term in enumerate(terms_indices)}

# Get the index of the terms "markov" and "facebook"
markov_idx = term_dict["markov"]
facebook_idx = term_dict["facebook"]

# Compute the similarity of "markov chains" with every course
sim_markov = [sim_score(markov_idx, i) for i in range(Vt.shape[1])]

# Compute the similarity of "facebook" with every course
sim_facebook = [sim_score(facebook_idx, i) for i in range(Vt.shape[1])]

In [13]:
# Retrieve the 5 most similar courses for "markov chains"
top_markov = np.argsort(sim_markov)[-5:]
print("Top 5 courses for 'markov chains':")
for i in np.flip(top_markov):
    course_id = document_indices[i]
    course_name = course_dict[course_id]['name']
    print(f'{course_id} - {course_name} : {sim_markov[i]:.4f}')

Top 5 courses for 'markov chains':
MATH-332 - Applied stochastic processes : 0.7640
MGT-484 - Applied probability & stochastic processes : 0.7424
EE-605 - Statistical Sequence Processing : 0.7302
COM-516 - Markov chains and algorithmic applications : 0.6065
EE-516 - Data analysis and model classification : 0.3409


In [15]:
# Retrieve the 5 most similar courses for "facebook"
top_facebook = np.argsort(sim_facebook)[-5:]
print("Top 5 courses for 'facebook':")
for i in np.flip(top_facebook):
    course_id = document_indices[i]
    course_name = course_dict[course_id]['name']
    print(f'{course_id} - {course_name} : {sim_facebook[i]:.4f}')

Top 5 courses for 'facebook':
EE-727 - Computational Social Media : 0.9457
EE-593 - Social media : 0.6786
HUM-432(a) - How people learn I : 0.4248
COM-308 - Internet analytics : 0.3889
HUM-432(b) - How people learn II : 0.3075


In the previous session, the result is shown as follows:  
markov chain - top five courses with similarity score  
('Applied stochastic processes', 0.557910816140559)  
('Applied probability & stochastic processes', 0.541839456832587)  
('Markov chains and algorithmic applications', 0.4210724606519116)  
('Supply chain management', 0.39279034730463636)  
('Mathematical models in supply chain management', 0.3076572352007134)  
   
facebook - top five courses with similarity score  
('Computational Social Media', 0.18922283654097266)  
('Composites technology', 0.0)  
('Image Processing for Life Science', 0.0)  
('Global business environment', 0.0)  
('Electrochemical nano-bio-sensing and bio/CMOS interfaces', 0.0)   
   
**Markov Chains**  
Using the Vector Space Model (VSM) for "markov chains" produced relevant courses with moderate similarity scores. Latent Semantic Indexing (LSI) also identified relevant courses, but with higher similarity scores, indicating a better capture of related topics.  
   
**Facebook**  
For "facebook," VSM returned fewer relevant courses, some with zero similarity. LSI performed better, identifying more courses related to social media with higher similarity scores, showing improved semantic understanding.   
   
And for both “Markov Chains” and “Facebook”, the Top 1 courses with the highest similiraty score are the same.

## Exercise 4.7: Document-document similarity

Here we choose cosine similarity to determine the similarity between two documents:

In [16]:
# Define cosine similarity function
def cos_sim(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1, 2) * norm(vec2, 2))

In [18]:
# Get the index of the course "COM-308"
com_idx = np.where(document_indices == "COM-308")[0][0]

# Get the latent-space vector for COM-308
com_vec = S * Vt[:, com_idx]

# Compute the similarity of COM-308 with every other course
sims = []
for i in range(Vt.shape[1]):
    if i != com_idx:  # Exclude the course itself
        vec = S * Vt[:, i]
        sim = cos_sim(com_vec, vec)
        sims.append((i, sim))

# Sort the courses by similarity score in descending order
sims.sort(key=lambda x: x[1], reverse=True)

# Retrieve the 5 most similar courses to COM-308
top_5_courses = sims[:5]

listed below are the top 5 classes most similar to COM-308:

In [20]:
print("Top 5 courses most similar to COM-308:")
for idx, sim in top_5_courses:
    course_id = document_indices[idx]
    course_name = course_dict[course_id]['name']
    print(f'{course_id} - {course_name} : {sim:.4f}')

Top 5 courses most similar to COM-308:
EE-558 - A Network Tour of Data Science : 0.6277
CS-423 - Distributed information systems : 0.5555
CS-401 - Applied data analysis : 0.5472
EE-727 - Computational Social Media : 0.5153
COM-512 - Networks out of control : 0.5138
