# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *H*

**Names:**

* *Baffou Jérémy*
* *Basseto Antoine*
* *Pinto Andrea*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [16]:
from utils import load_json
import pickle
import numpy as np
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

We import the Term-Document matrix computed in the first part of the lab. We also keep track of the mappings we used, and their values for further use.

In [124]:
TFIDF = np.load("TFIDF.npy")

a_file = open("terms_ids.pkl", "rb")
terms_ids = pickle.load(a_file)
b_file = open("course_ids.pkl", "rb")
course_ids = pickle.load(b_file)

terms_keys = list(terms_ids.keys())
doc_keys = list(course_ids.keys())

## Exercise 4.4: Latent semantic indexing

We will use the function svds from scipy.sparse.linalg to compute the SVD with k=300 has the targeted rank.

In [6]:
u,s,v_T = svds(TFIDF,300)

The problem with this function is that the output matrices are not well ordered. We need to manipulate them a bit to retrieve the correct matrices. After experimentation and comparison with the function np.linalg.svd(), we have obtain the following transformations:

In [7]:
U = -u[:,::-1]*np.asarray([-1]+np.ones(u.shape[1]-1).tolist())

In [8]:
S = s[::-1]

In [9]:
V_T = (v_T[::-1].T*np.asarray([1]+(-1 * np.ones(v_T.shape[0]-1)).tolist())).T

The columns of $U$ give "eigenvectors" of the correlation matrix row-wise of the TF-IDF matrix (so based on words), ordered in decreasing order of variance they capture. The columns of $V^T$, i.e. rows of $V$, are the "eigenvectors" of the correlation matrix column-wise of the TF_IDF matrix (so based on documents), ordered in decreasing order of variance they capture. The value of S are the singular values of the TF-IDF matrix, which indicate how much an association of vectors of U and V are necessary to give back original vectors of TF-IDF.

In [10]:
for singular_value in S[:20]:
    print(singular_value)

3.1960467509505595
3.1372127474371307
2.9790206315696195
2.1023634847769848
1.191793531134566
1.168869187010023
1.1194782959086742
1.0715814332815714
1.048234637445236
1.0411024305240602
1.0310886533235462
1.0257202178029305
0.9993423732670951
0.9448882250539762
0.9312196332599237
0.9136444774029263
0.906949117562127
0.8973318146743732
0.8955685993325307
0.8899137209771941


## Exercise 4.5: Topic extraction

# @pinto et basseto 

Je sais pas si ici il faut donner pour chaque topic 10 terms et 10 docs ou si il faut donner **en tout** 10 terms et 10 docs (donc 1 de chaque par topic).

We know that by the SVD we mapped our documents into a smaller rank matrix, and thus a sub-space called the latent space. We know that the columns of $U$,$S$,$V_T$ are ordered in there order of importance (i.e. amount of variance of the original data they capture). Thus the first **column** of $U$ will give a "doc" where the combination of terms capture a lot of variance. Same thing for the first **row** of $V$, it gives a weighted group of docs which captures a lot of variance. Thus to select our topic we keep the max terms (i.e. weights) in the first 10 columns of $U$ and first 10 rows of $V$. 

In [129]:
def topic_extraction(index):
    best_terms_indices = U[:,index].argsort()[-10:][::-1]
    print("Terms in topic :")
    #for j in best_terms_indices:
    print(list(map(lambda l : terms_keys[l],best_terms_indices)))
    print("Courses in topic :")
    best_courses_indices = V_T[index,:].argsort()[-10:][::-1]
    #for k in best_courses_indices:
    print(list(map(lambda l : doc_keys[l], best_courses_indices)))
    print("--------------------------------")

In [131]:
for i in range(10):
    print(f"The {i+1}th topic is composed of :")
    topic_extraction(i)

The 1th topic is composed of :
Terms in topic :
['dilution solution', 'predict major', 'content introduction dilute', 'rubber transition rubber', 'concetrated solution glass', 'concetrated solution bulk', 'rubber transition', 'principle result', 'principle result chainlike', 'concentrate solution phase']
Courses in topic :
['CH-332', 'MGT-690(B)', 'MGT-690(A)', 'MSE-431', 'PHYS-708', 'PHYS-709', 'PHYS-610', 'CH-710', 'COM-404', 'ME-705']
--------------------------------
The 2th topic is composed of :
Terms in topic :
['project ic laboratory', 'project ic', 'ic laboratory', 'semester project ic', 'semester project', 'ic', 'semiconductor device wiley', 'scale ballistic', 'outline schematic', 'outline schematic layout']
Courses in topic :
['CS-699(2)', 'CS-699(1)', 'MICRO-432', 'PHYS-709', 'MGT-609', 'MATH-400', 'MGT-690(B)', 'MGT-690(A)', 'PHYS-731', 'PHYS-708']
--------------------------------
The 3th topic is composed of :
Terms in topic :
['administration enrollment', 'edmt administra

We can give the following titles for the 10 topics:
- Chemistry
- Computer Science Project
- Administration ?
- Ph.D project
- Ph.D project
- Field Theory
- Combinatorial Field
- Physic of Plasma
- Manufacturing of micro-components
- Manufacturing of micro-components

## Exercise 4.6: Document similarity search in concept-space

We implement a similarity function between a term and a document like stated in the handout:

In [13]:
def sim(t,d):
    output = U[t,:] @ np.diag(S)  @ V_T[:,d]
    norm_factor = np.linalg.norm(U[t,:]) * np.linalg.norm(np.diag(S) @ V_T[:,d])
    return output/norm_factor

Then we create a search function wich computes the top num courses that matches the list of words given. Note that the aggregation of similarity by a plus operation is not really reliable for documents comparisons, but we wanted to have a look for only a few terms.

In [133]:
def search_function(words_list,num=5):
    query_result = np.zeros(V_T.shape[1])
    for i in range(V_T.shape[1]):
        for j in range(len(words_list)):
            query_result[i] += sim(terms_ids[words_list[j]],i)
    best_fit = query_result.argsort()[-num:][::-1]
    for k in best_fit:
        print(f"Course ID : {doc_keys[k]}, Similarity Score : {query_result[k]}")

In [135]:
search_function(["facebook"])

Course ID : EE-727, Similarity Score : 0.9965330556538541
Course ID : EE-593, Similarity Score : 0.9327644994447769
Course ID : CS-486, Similarity Score : 0.5517138088494876
Course ID : COM-308, Similarity Score : 0.414630098120584
Course ID : MGT-401, Similarity Score : 0.4130553043559283


In [136]:
search_function(["markov chain"])

Course ID : MATH-332, Similarity Score : 0.8540010046589795
Course ID : COM-516, Similarity Score : 0.8487972977682151
Course ID : MATH-600, Similarity Score : 0.7821969940408016
Course ID : MGT-484, Similarity Score : 0.5951955054703466
Course ID : ME-499, Similarity Score : 0.40679820401054506


And here we do a query similar to what we have done in the previous part :

In [134]:
search_function(["facebook","markov chain"])

Course ID : EE-727, Similarity Score : 1.021507248010081
Course ID : EE-593, Similarity Score : 0.9078726858823052
Course ID : COM-516, Similarity Score : 0.8757003297538931
Course ID : MATH-332, Similarity Score : 0.8528451337074313
Course ID : MATH-600, Similarity Score : 0.7858960691559022


The results are treally great, better than with only the TF-IDF approach. It "captures" well the idea of social network and so gives enough importance to facebook so that relevent courses are more on top than before (e.g. EE-593).

## Exercise 4.7: Document-document similarity

The function we use to compare two courses in our latent space is : cos-sim$(S\cdot V^T_{d1},S\cdot V^T_{d2})$, where $V^T_{di}$ indicates the i-th column vector from $V^T$, and cos-sim is the cosine similarity :

cos-sim(v1,v2) = $\frac{v1 \cdot v2}{||v1||*||v2||}$

In [123]:
"""
cosine similarity between two documents. Take their two ids as parameters.
"""
def sim_documents(d1,d2):
    doc_1 = np.diag(S) @ V_T[:,d1]
    doc_2 = np.diag(S) @ V_T[:,d2]
    return (doc_1 @ doc_2)/(np.linalg.norm(doc_1)*np.linalg.norm(doc_2))

"""
Take a course and output the num courses the closest to it in the latent space (in the cosine sense)
"""
def course_recommender(course,num=5):
    course_id = course_ids[course]
    corpus_similarity = np.zeros(TFIDF.shape[1])
    for i in range(TFIDF.shape[1]):
        if i != course_id:
            corpus_similarity[i] = sim_documents(i,course_id)
    best_fit = corpus_similarity.argsort()[-num:][::-1]
    for k in best_fit:
        print(doc_keys[k])

In [121]:
course_recommender("COM-308")

FIN-525
CS-423
CS-401
EE-724
CS-322


The recommendation are actually pretty great! (Except maybe the last one which does not exactly capture the essence of the course, but still it is an important course)