# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Ann-Kristin Bergmann*
* *Nephele Aesopou*
* *Ewa Miazga*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [142]:
import pickle
import numpy as np
import scipy
from scipy.sparse.linalg import svds
import pandas as pd
import utils

from sklearn.metrics.pairwise import cosine_similarity

from itertools import combinations

In [97]:
# load term_document matrix as well as list that map matrix indices to terms/documents

file_path_1 = './courses_list (1).txt'
file_path_2 = './unique_word_corpus (1).txt'

courses_overview = utils.load_json('./courses_list (1).txt')

terms_overview = utils.load_json('./unique_word_corpus (1).txt')

term_document_matrix = scipy.sparse.load_npz("X_matrix (1).npz")



In [98]:
# mapping functions that given an index return the corresponding term/course
def id_to_course(id_in_matrix):
    return courses_overview[id_in_matrix]

def id_to_term(id_in_matrix):
    return terms_overview[id_in_matrix]

## Exercise 4.4: Latent semantic indexing

In [99]:
# we want to extract 300 singular values/singular vectors
U,S,VT = svds(A= term_document_matrix, k = 300)


In [100]:
print("Shape of term-document-matrix: {}, U: {} and VT: {}".format(term_document_matrix.shape, U.shape, VT.shape))


Shape of term-document-matrix: (10820, 854), U: (10820, 300) and VT: (300, 854)


In [101]:
print(len(terms_overview))

10820


In [102]:
print(len(courses_overview))

854


In [109]:
topics_count = U.shape[1]

A row in matrix U represents how a term is associated with different topics, which are represented through the columns. Different rows refer to different terms. The value indicates the association of each term with the corresponding topic. The V^T matrix is build similary with the difference that the columns refer to the different documents, in our case courses, in the corpus. The S matrix contains the singular values arranged in ascending order on the diagonal. The values represent the importance of the topic for the given data, in particular they indicate how much variance can be explained by each topic. A high singualar value means that more significant patterns in the data are captured by the topic.
Terms that have similar meanings tend to have similar values across the same columns. The same applies for documents across rows.

In [108]:
# print top-20 singular values 
diagonal_values = S[-1:-21:-1]
print(diagonal_values)

[58.79735866 51.38912695 37.29238703 34.97853623 34.77311675 33.09564297
 32.64143815 32.57377559 31.5379204  31.24943091 30.66200111 30.25784823
 29.47446833 29.35552253 28.63487448 28.49670816 28.13977974 27.95398917
 27.46202986 27.08862678]


## Exercise 4.5: Topic extraction

In [120]:
def get_top_k(array, k):
    sorted_indices = np.argsort(array)[::-1]

    # Get the top 10 indices and values
    top_k_indices = sorted_indices[:k]
    top_k_values = array[top_k_indices]

    return top_k_indices, top_k_values

num_topics = 10
top_10 = {}
# since singular values are arranged in ascending order we take last 10 topic rows/columns
for topic in range(topics_count-1, topics_count - num_topics -1, -1):              
    # Get the terms for the current topic
    # With get_top find k terms that relate the most with the current topic
    term_indices, topic_term_value = get_top_k(U[:, topic], num_topics)
    terms = [id_to_term(i) for i in term_indices]

    # Get the documents for the current topic
    # With get_top find k courses that relate the most with the current topic
    doc_indices, topic_doc_value = get_top_k(VT[topic, :], num_topics)
    courses = [id_to_course(c) for c in doc_indices]             

    top_10[topic] = {"terms": terms, "courses": courses}


    print("topic {}:".format(abs(topic-topics_count)))
    print("terms: ", terms)
    print("documents: ",courses)


topic 1:
terms:  ['selection design', 'engineering handbook', 'mar', 'analysis requirement', 'assignment team', 'concept selection', 'nasa', 'cycle process', 'weck', 'learning rule']
documents:  ['MGT-690(A)', 'MGT-690(B)', 'BIO-699(n)', 'AR-202(c)', 'CH-404', 'CS-411', 'CH-617', 'MICRO-600', 'EE-712', 'BIO-382']
topic 2:
terms:  ['guidance professor', 'professor assistant', 'guidance', 'research semester', 'theme proposed', 'chosen theme', 'proposed web', 'projectpresent project', 'assistant content', 'semester guidance']
documents:  ['CS-498', 'COM-507', 'CS-596', 'CH-491', 'MSE-490(b)', 'MSE-490(a)', 'MSE-490(c)', 'EE-491(c)', 'EE-491(b)', 'AR-522']
topic 3:
terms:  ['financial', 'valuation', 'pricing', 'risk', 'finance', 'market', 'stochastic', 'corporate', 'firm', 'capital']
documents:  ['MGT-482', 'FIN-401', 'FIN-521', 'FIN-405', 'FIN-407', 'FIN-402', 'FIN-404', 'FIN-505', 'FIN-606', 'FIN-506']
topic 4:
terms:  ['sem', 'microscope', 'electron', 'scanning electron', 'microscopy', 

In [116]:
# Give a label to each topic
# Load the data from the text file
from utils import load_json
courses_list = load_json('data/courses.txt')

In [121]:
# we print the course descriptions of the top 3 matching courses for each topic to help us find a label
for t,v in top_10.items():  
    print("Topic: {}".format(abs(t-topics_count)))
    for id in v["courses"][:3]:
        for item in courses_list:
            if item['courseId'] == id:
                print(id, ": ", item['description'])
                break


Topic: 1
MGT-690(A) :  Contact the EDMT administration for enrollment please
MGT-690(B) :  Contact the EDMT Administration for enrollment please
BIO-699(n) :  Training rotation
Topic: 2
CS-498 :  Individual research during the semester under the guidance of a professor or an assistant. Content Subject to be chosen among the themes proposed on the web site :   http://ic.epfl.ch/semester_projects_by_laboratory Learning Outcomes By the end of the course, the student must be able to: Organize a projectAssess / Evaluate one's progress through the course of the projectPresent a project Transversal skills Write a scientific or technical report.Write a literature review which assesses the state of the art. Assessment methods Written report and oral presentation
COM-507 :  Individual research during the semester under the guidance of a professor or an assistant. Content Subject to be chosen among the themes proposed on the web site : http://ic.epfl.ch/systemes-communication-projet-labo-master L

Given the terms and course descriptions we name the topics the following way:
- Topic 1: *technology management and engineering*
- Topic 2: *academic support*
- Topic 3: *finance*
- Topic 4: *scientific research in microsopy and finance*
- Topic 5: *SEM Training and Analysis*
- Topic 7: *Atmosphere and Fluid Dynamics*
- Topic 8: *Biophotonics and Biochemical Sensing*
- Topic 9: *Electronic Devices and Architectures*
- Topic 10: *Materials Science and Optimization*


## Exercise 4.6: Document similarity search in concept-space

In [153]:
def sim(term_v, document_v, S):
    numerator = np.dot(term_v, np.dot(S, document_v))
    denominator = np.linalg.norm(term_v) * np.linalg.norm(document_v)
    return numerator / denominator

def get_combinations_of_term(term):
    words = term.split()
    # Generate all combinations of the words
    combinations_list = []
    for r in range(1, len(words) + 1):
        combinations_list.extend(combinations(words, r))

    # Convert the combinations back to strings
    combinations_strings = [' '.join(comb) for comb in combinations_list]
    return combinations_strings

# compute similarity for term-course
def sim_term_doc(term, courseId, S):
    # for term consisting of mutiple words we have mutliple entries in the term vector
    terms = get_combinations_of_term(term)
    term_v = np.zeros(topics_count)
    # for each possible term in the given term get vector in U and add up so we have an entry for each possible term
    for t in terms:
        try:
            index_term = terms_overview.index(t)
        except ValueError:
            #print("The term {} is not in the term list.".format(term))
            None
        term_v = term_v + U[index_term, :]
    # normalize term_v
    term_v = term_v / np.sum(term_v)

    # get course vector
    document_v = VT[:, courseId]
    S = np.diag(S)
    # calculate similarity between term and course vector
    return sim(term_v, document_v, S)
    

def search(term, k=5):
    similarities = {}
    # iterate over all courses
    for i in range(len(courses_overview)):
        course_Id = id_to_course(i)
        # store similarity between course and given term
        similarities[course_Id] = sim_term_doc(term, i, S)
    # extract k courses accroding to highest similarity values
    sorted_items = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    return sorted_items[:k]


In [156]:
markov_chain_res = search("markov chains")
print(markov_chain_res)

[('MATH-332', 14.185123393888412), ('MGT-484', 13.844382361987355), ('EE-605', 12.017348257235897), ('COM-516', 11.450630375322035), ('EE-516', 5.796697449279108)]


In [159]:
# we print the course names of the prevoius result
for id,_ in markov_chain_res:  
    print("Topic: {}".format(id))
    for item in courses_list:
        if item['courseId'] == id:
            print(id, ": ", item['name'])
            break

Topic: MATH-332
MATH-332 :  Applied stochastic processes
Topic: MGT-484
MGT-484 :  Applied probability & stochastic processes
Topic: EE-605
EE-605 :  Statistical Sequence Processing
Topic: COM-516
COM-516 :  Markov chains and algorithmic applications
Topic: EE-516
EE-516 :  Data analysis and model classification


In [139]:
facebook_res = search("facebook")
print(facebook_res)

[('EE-727', 18.27613129768606), ('EE-593', 14.097471772548198), ('HUM-432(a)', 6.259373092283942), ('COM-308', 5.636709528854571), ('EE-552', 5.042800555239625)]


In [161]:
# we print the course names of the prevoius result
for id,_ in facebook_res:  
    print("Topic: {}".format(id))
    for item in courses_list:
        if item['courseId'] == id:
            print(id, ": ", item['name'])
            break

Topic: EE-727
EE-727 :  Computational Social Media
Topic: EE-593
EE-593 :  Social media
Topic: HUM-432(a)
HUM-432(a) :  How people learn I
Topic: COM-308
COM-308 :  Internet analytics
Topic: EE-552
EE-552 :  Media security


In [132]:
# compare with previous section

## Exercise 4.7: Document-document similarity

In [133]:
 # use cosine similiraty to calculate the similarity between the two document vectors
def compute_similarity(doc_1_v, doc_2_v):
    return cosine_similarity(doc_1_v.reshape(1,-1), doc_2_v.reshape(1,-1))

    

In [134]:
# implement search function that given a specified course computes the similarity to all other courses and choose top 5
def compare_multiple_courses(course, k):
    # get course id
    idx = courses_overview.index(course)
    # get column corresponding to specified course which gives use info about the course-topic associations
    course_v = VT[:, idx]
    similarities = {}
    # iterate over all courses adn compute similarity to specified course
    for i in range(len(courses_overview)):
        # filter out specified course from the comparison courses
        if idx != i:
            course_id = id_to_course(i)
            similarities[course_id] = compute_similarity(course_v, VT[:,i])
    # sort items according to similarity
    sorted_items = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:k]
    return sorted_items


In [126]:
# find top 5 most similar courses to "COM-308"
res = compare_multiple_courses("COM-308", 5)
print(res)

[('CS-423', array([[0.55880181]])), ('CS-401', array([[0.35891372]])), ('EE-727', array([[0.30841919]])), ('FIN-525', array([[0.28658085]])), ('CS-411', array([[0.2768962]]))]
