# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *N*

**Names:**

* *Anh Nghia Khau (223613)*
* *Sandra Djambazovska(224638)*



---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import math
import numpy as np
from sklearn.cluster import KMeans
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import operator
from scipy.sparse.linalg import svds

import nltk
from nltk import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import wordnet

### Pre-processing query

In [2]:
def split_words(word):
    """Transform HelloWord into Hello Word"""
    if (word.isupper() or word.islower()):
        return word
    else:
        pos_to_cut = []
        for i in range(1, len(word)):
            if (word[i].isupper()):
                pos_to_cut.append(i)
        curr = 0
        words = ''
        for pos in pos_to_cut:
            words += ' ' + word[curr: pos]
            curr = pos
        words += ' ' + word[curr:]
        return words

In [3]:
"""Combine RegularExpr (Remove the punctuation) and word_tokenize"""
def tokenization(sentence):
    tokenizer = RegexpTokenizer(r'\w+')
    temp = ''
    for w in sentence.split():
        temp += ' ' + split_words(w)   
    temp = word_tokenize(temp)
    new_sentence = ''
    for grams in temp:
        new_sentence += ' ' + grams

    return  tokenizer.tokenize(new_sentence) 

In [4]:
def stop_words(sentence):
    stopwords = load_pkl('data/stopwords.pkl')
    return [x.lower() for x in sentence if x.lower() not in stopwords]

In [5]:
def get_wordnet_pos(treebank_tag):
    """Map ['NN', 'NNS', 'NNP', 'NNPS'] to NOUN....."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [6]:
def lemmatization(sentence):
    lemmatiser = WordNetLemmatizer()
    tokens_pos = pos_tag(sentence)
    tokens_pos = [(w,get_wordnet_pos(p)) for (w,p) in tokens_pos]
    
    return [lemmatiser.lemmatize(w, pos=p) for (w,p) in tokens_pos if p != None]

In [7]:
def preprocessing(sentence):
    """Tokenization"""
    new_sentence = tokenization(sentence)
    """Stopwords"""
    new_sentence = stop_words(new_sentence)
    """POS and Lemmatization"""
    new_sentence = lemmatization(new_sentence)
    
    return new_sentence

### Load data from previous exercice

In [8]:
def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

In [9]:
def load_data():
    tf_matrix    = load_sparse_csr("tf_matrix.npz")
    tfidf_matrix = load_sparse_csr("tfidf_matrix.npz")
    doc_indices  = load_json('doc_indices.txt')[0]
    term_indices = load_json('term_indices.txt')[0]
    indices_term = load_json('indices_term.txt')[0]
    doc_names    = load_json('doc_names.txt')[0]
    
    return tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names 

## Exercise 4.4: Latent semantic indexing

Here X is not a square matrix, I would suppose that we take the singular values of X

The singular values of a m×n matrix A are the positive square roots of the nonzero eigenvalues of the corresponding matrix (A.T @ A). The corresponding eigenvectors are called the singular vectors.

U is an M x K matrix: a row i of U means term i is decomposed into K-latents concept instead of N documents (coordinate of term in latent space)

V is an N x K matrix: a row i of V means document i is decomposed into K-latents concept instead of M terms documents (coordinate of document in latent space)

S is a K x K diagonal matrix: the first value of S (biggest on top left) is the first 'topic' in which coresspond to the first column of U and the first column of v


In [10]:
def LSI(path, K):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    U, S, V_T = svds(tfidf_matrix, k=K)
    return (S[::-1][:20])

In [11]:
print("Top 20 eigenvalues of X")
LSI('data/courses.txt', 300)

Top 20 eigenvalues of X


array([ 38.36486966,  27.26559776,  24.77872325,  23.16109916,
        22.96221797,  22.11653527,  21.57938508,  20.86047732,
        20.34718015,  20.21306309,  19.90632062,  19.88701028,
        19.49484284,  18.96126007,  18.70856149,  18.28396281,
        18.15410327,  18.04668982,  17.85978557,  17.78966251])

## Exercise 4.5: Topic extraction

In [12]:
def topic_extraction_terms(nbTopics, n_terms):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()

    """low-rank approximation"""
    U, S, V_T = svds(tfidf_matrix, 60) #60
    
    print("Top-{k} topics as a combinations of terms: ".format(k=nbTopics))
    for i in range (nbTopics):
        print("   Topic " + str(i) + ":")
        abs_score = [np.abs(x) for x in U[:,i]]
        top_terms = np.argsort(abs_score)[::-1][:n_terms]
        for y in top_terms:
            print("     '{k}'".format(k=indices_term.get(str(y))))

topic_extraction_terms(nbTopics=10, n_terms=15)

Top-10 topics as a combinations of terms: 
   Topic 0:
     'seminar'
     'numerical'
     'brain'
     'convex'
     'measurement'
     'stellar'
     'electrochemical'
     'talk'
     'star'
     'econometric'
     'visualization'
     'magnetic'
     'series'
     'cognitive'
     'plasma'
   Topic 1:
     'atmospheric'
     'film'
     'turbulent'
     'thin'
     'solidification'
     'econometric'
     'acoustic'
     'stain'
     'plasma'
     'fluid'
     'boundary'
     'layer'
     'additive'
     'exchange'
     'transport'
   Topic 2:
     'telomerase'
     'telomere'
     'telomeres'
     'rapid'
     'image'
     'senescence'
     'additive'
     'gene'
     'stain'
     'test'
     'film'
     'assay'
     'dna'
     'thin'
     'seminar'
   Topic 3:
     'ccd'
     'communication'
     'photodiodes'
     'laser'
     'plasma'
     'brain'
     'mems'
     'sensor'
     'label'
     'synthesis'
     'detector'
     'neuroprostheses'
     'cancer'
     'organizational'


T0: security, T1: life science, T2: biology, T3: biomedical engineering ?, T4:  ??? , T5: ???, T6: ???, T7: ?, T8: electrical engineering ? ,T9: ?

=> difficult to recognise !!!!

In [13]:
def topic_extraction_docs(nbTopics, n_docs):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()

    """Create a mapping between col and course's name"""
    indices_doc = dict((v,k) for k,v in doc_indices.items())

    """low-rank approximation"""
    U, S, V_T = svds(tfidf_matrix, 80)
    
    print("Top-{k} topics as a combinations of documents: ".format(k=nbTopics))
    for i in range (nbTopics):
        print("Topic " + str(i) + ":")
        abs_score = [np.abs(x) for x in V_T.T[:,i]]
        top_docs = np.argsort(abs_score)[::-1][:n_docs]
        for y in top_docs:
            print("   '{k}-{s}'".format(k=indices_doc.get(y),s=doc_names[indices_doc[y]]))

topic_extraction_docs(nbTopics=10, n_docs=15)

Top-10 topics as a combinations of documents: 
Topic 0:
   'CS-487-Industrial automation'
   'ENG-421-Fundamentals in Systems Engineering'
   'CIVIL-428-Engineering geology for geo-energy'
   'CIVIL-402-Geomechanics'
   'CIVIL-444-Energy geostructures'
   'COM-401-Cryptography and security'
   'CIVIL-709-New Concretes for Structures'
   'PENS-201-Making structural logic'
   'COM-501-Advanced cryptography'
   'CH-432-Structure and reactivity'
   'CIVIL-429-Reservoir mechanics for geo-energy and the environment'
   'MSE-465-Thin film fabrication processes'
   'MSE-802-CCMX Summer School - Multiscale Modelling of Materials (2016)'
   'CS-206-Parallelism and concurrency'
   'CS-401-Applied data analysis'
Topic 1:
   'CH-415-Chemistry of small biological molecules'
   'CIVIL-402-Geomechanics'
   'MSE-651-Crystallography of structural phase transformations'
   'ENG-627-Academic Writing for Doctoral Students'
   'ENV-719-Localization and Navigation Methods'
   'CIVIL-444-Energy geostructures'

T0: Civil engineering ? T1: ???, T2: ???, T3: ???, T4: ??? , T5: ???, T6: ???, T7: ???, T8: ??? ,T9: ???

=> difficult to recognise !!!!

## Exercise 4.6: Document similarity search in concept-space

In [10]:
def similarity(u, S, v_t):
    temp = np.diag(S) @ v_t
    return np.dot(u, temp) / np.sqrt(np.sum(u ** 2) * np.sum(temp **2)) 

In [34]:
def search(query, n):
    nbTopics = 35
    """a dict for storing result. E.g.: scores[CS-308] = 14"""
    scores = {}
    
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    """Create a mapping between col and course's name"""
    indices_doc = dict((v,k) for k,v in doc_indices.items())

    """low-rank approximation"""
    U, S, V_T = svds(tfidf_matrix, nbTopics)
    
    """Pre-processing query"""
    processed = preprocessing(query)
    
    """Compute score of 'query' for each doc"""
    for doc in doc_indices.keys():
        "Take the index of doc's column"
        col = doc_indices[doc]
        score = 0.0
        "For each term in the query"
        for term in processed:
            row = term_indices.get(term)
            "If term doesn't exist => continue"
            if row == None:
                continue
            u = U[row] # term u
            v_t = V_T.T[col] # doc v
            score += similarity(u, S, v_t)
        scores[doc] = score
    
    """Sort dict by value/score and take n"""
    sorted_scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)[:n]
    
    print("Top {k} classes (descending) for your query '{q}'".format(k=n,q=query))
    for (k,v) in sorted_scores:
        print("{key}--{c} with score {s}".format(key=k, c=doc_names[k],s=v))

In [35]:
search('facebook', 5)

Top 5 classes (descending) for your query 'facebook'
EE-727--Computational Social Media with score 0.9047365692929483
COM-308--Internet analytics with score 0.6085845559101029
CS-423--Distributed information systems with score 0.5897271271506194
EE-593--Social media with score 0.5548492126267299
COM-208--Computer networks with score 0.5488574990330605


Very good, this time it returns some classes 'social media' or 'social network'. The algorithm now know the 'relationship' between these documents !!!

In [36]:
search('markov chains', 5)

Top 5 classes (descending) for your query 'markov chains'
MATH-332--Applied stochastic processes with score 1.3867736168516411
MGT-484--Applied probability & stochastic processes with score 1.36451454007995
COM-516--Markov chains and algorithmic applications with score 1.2685938880684826
COM-512--Networks out of control with score 1.158170215991472
FIN-600--Game Theory with score 1.1132292998035345


## Exercise 4.7: Document-document similarity

In [37]:
"""https://en.wikipedia.org/wiki/Latent_semantic_analysis"""
""""""
def doc_similarity(doc1, doc2, S):
    doc1_ = np.diag(S) @ doc1
    doc2_ = np.diag(S) @ doc2 
    return np.dot(doc1_, doc2_) / np.sqrt(np.sum(doc1_ ** 2) * np.sum(doc2_ ** 2))

In [38]:
def most_similar(course, n):
    nbTopics = 35
    """a dict for storing result. E.g.: scores[class] = 14
       means that similarity between (course, class) = 14"""
    scores = {}

    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    
    """low-rank approximation"""
    U, S, V_T = svds(tfidf_matrix, nbTopics)
    
    col1 = doc_indices[course]
    """Compute score of 'query' for each doc"""
    for doc in doc_indices.keys():      
        col2 = doc_indices[doc]
        doc1 = V_T.T[col1]
        doc2 = V_T.T[col2]
        scores[doc] = doc_similarity(doc1, doc2, S)
    
    """Sort dict by value/score and take n"""
    sorted_scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)[:n+1]
    
    print("Top {k} classes (descending) most similar to '{q}'".format(k=n,q=course))
    for (k,v) in sorted_scores:
        if (k == course):
            continue
        print("{key}--{c} with score {s}".format(key=k, c=doc_names[k],s=v))

In [39]:
most_similar('COM-308', 5)

Top 5 classes (descending) most similar to 'COM-308'
CS-423--Distributed information systems with score 0.8504745986824664
EE-558--A Network Tour of Data Science with score 0.8266593814062848
COM-208--Computer networks with score 0.7804548717769448
COM-407--TCP/IP networking with score 0.7765591647293033
COM-512--Networks out of control with score 0.7582265717362375
