# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Dennis Gankin*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [3]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_pkl

## Exercise 4.4: Latent semantic indexing

In [5]:
#loading data
TF_m = load_pkl('tfidf.pkl')
terms = load_pkl('terms.pkl')
courses = load_pkl('courses.pkl')

#n terms, m documents
n, m = TF_m.shape

In [7]:
from scipy.sparse.linalg import svds
U, S, V_t = svds(TF_m, k=300, which='LM')

In [9]:
#TODO Describe the rows and columns of U and V , and the values of S.

$U$: Term-concept mapping

The $n$ rows of the $U$-matrix given by the SVD, give a mapping from term to concept. Each row is the mapping for one term, and each value $v_i$ in that row shows how strongly that term relates to concept $c_i$

 $V^T$: Course-concept mapping

Similarly, the $m$ columns of the $V^T$ matrix shows how strongly each course corresponds to each concept.

 $S$: Concept-"strength"

The singular values of $S$ shows how "strong" the concept is - the bigger the value is, the "stronger" the concept.

In [10]:
singular20 = S[::-1][:20]
singular20

array([ 344.31354526,  223.54598261,  211.23490744,  204.18159738,
        192.59066714,  190.12085322,  187.87139273,  186.11051083,
        182.05487293,  177.11191671,  171.67441762,  170.92157508,
        168.79107365,  163.88075299,  160.85639893,  160.70081938,
        158.30216791,  156.76891911,  154.34270592,  152.88152916])

In [12]:
eigenv20= [x*x for x in singular20]
eigenv20

[118551.81745109936,
 49972.806340782823,
 44620.186123127998,
 41690.124708290095,
 37091.165068619892,
 36145.938828648803,
 35295.660205228443,
 34637.122241329547,
 33143.976759353827,
 31368.631040412936,
 29472.105666437194,
 29214.184826477391,
 28490.426545137649,
 26856.901200287168,
 25874.781076314903,
 25824.753349977462,
 25059.576364671633,
 24576.493999927687,
 23821.670870634709,
 23372.761958302566]

## Exercise 4.5: Topic extraction

In [17]:
s = np.diag(S)
# combination of terms
for topic in range(-1,-11,-1):
    words = [terms[t] for t in np.argsort(U[:,topic])[-15:]]
    print('Topic', -topic,':')
    print(words)

Topic 1 :
['dwarf', 'observationshydrogen', 'holesstellar', 'meynet', 'simulationsstar', 'contractu', 'apparatus', 'magnetohydrodynam', 'microinst', 'aspectsstructurefinanci', 'aspectsregulatori', 'aspectsmarket', 'aspectslega', 'aspectsglp', 'edmt']
Topic 2 :
['magnet', 'micro', 'cell', 'mem', 'photon', 'sensor', 'print', 'materi', 'imag', 'light', 'devic', 'microscopi', 'laser', 'electron', 'optic']
Topic 3 :
['research', 'develop', 'semest', 'solid', 'form', 'treatment', 'week', 'recycl', 'report', 'excurs', 'urban', 'studio', 'project', 'architectur', 'wast']
Topic 4 :
['system', 'citi', 'comput', 'lab', 'signal', 'imag', 'algorithm', 'speech', 'robot', 'project', 'digit', 'urban', 'data', 'studio', 'architectur']
Topic 5 :
['mass', 'fluid', 'numer', 'structur', 'thermodynam', 'seismic', 'ah', 'heat', 'protein', 'energi', 'chemic', 'reaction', 'flow', 'cell', 'steel']
Topic 6 :
['immers', 'scientif', 'wet', 'cell', 'lab', 'obtain', 'sv', 'laboratori', 'ssv', 'host', 'biolog', 'lase

In [18]:
# combination of documents
for topic in range(-1,-11,-1):
    # Collect some words from each course
    words = np.concatenate([courses[t]['cut_wordlist'] for t in np.argsort(V_t[topic])[-2:]])
    print('Topic', -topic,':')
    print(words[:10])

Topic 1 :
['contact' 'edmt' 'administration' 'enrollment' 'training' 'rotation']
Topic 2 :
['addresses' 'implementation' 'organic' 'printed' 'electronics'
 'technologies' 'large' 'area' 'manufacturing' 'techniques']
Topic 3 :
['studio' 'explores' 'meaningful' 'form' 'generating' 'processes'
 'algorithmic' 'parametric' 'tools' 'introduces']
Topic 4 :
['studio' 'explores' 'meaningful' 'form' 'generating' 'processes'
 'algorithmic' 'parametric' 'tools' 'introduces']
Topic 5 :
['covers' 'basic' 'aspects' 'numerical' 'discretization' 'solution' 'fluid'
 'flow' 'heat' 'transfer']
Topic 6 :
['engage' 'laboratory' 'based' 'project' 'field' 'molecular' 'medicine'
 'neuroscience' 'bioengineering' 'projects']
Topic 7 :
['advanced' 'topics' 'structural' 'steel' 'seismic' 'topics' 'include'
 'bolted' 'welded' 'beam']
Topic 8 :
['goal' 'provide' 'main' 'formalisms' 'models' 'algorithms' 'required'
 'implementation' 'advanced' 'speech']
Topic 9 :
['engage' 'laboratory' 'based' 'project' 'field' 'mole

In [None]:
#TODO: find lables, is the thing above combionation?

## Exercise 4.6: Document similarity search in concept-space

In [19]:
def term_document_similarity(query):
    similarities = np.zeros(len(courses))
    for term in query.split(' '):
        t_index = terms.index(term)
        u_t = U[t_index]
        
        for ix, d in enumerate(courses):
            v_T_d = V_t[:,ix]
    
            s_v_T_d = np.dot(s, v_T_d)
            nominator = np.dot(u_t, s_v_T_d)
            denominator = np.linalg.norm(u_t) * np.linalg.norm(s_v_T_d)
            sim = nominator / denominator
            similarities[ix] += sim
    return similarities


def LSI_search(terms, no_top=5):
    similiarities = term_document_similarity(terms)
    top_results = np.argsort(similiarities)[::-1][0:no_top]
    for top in top_results:
        print('{0}: {1}'.format(courses[top]['name'], top))

In [20]:
LSI_search('facebook', 5)

Computational Social Media: 798
Social media: 407
Studio MA2 (Escher et GuneWardena): 59
Human computer interaction: 521
Transport phenomena II: 29


In [21]:
LSI_search('markov chain', 5)

Applied stochastic processes: 80
Applied probability & stochastic processes: 398
Markov chains and algorithmic applications: 245
Supply chain management: 44
Mathematical models in supply chain management: 99


In [22]:
#compare with previous section

## Exercise 4.7: Document-document similarity

In [23]:
#TODO equation?
# best way for document-document similarity is computing cosine similarity between a given document, represented as topic vector V_t and rest of the document-topics

In [24]:
def cosine_sim(d1, d2):
    return np.dot(d1, d2) / (np.linalg.norm(d1) * np.linalg.norm(d2))

In [28]:
IX_id = next(index for (index, d) in enumerate(courses) if d["courseId"] == "COM-308")

IX_sim = np.apply_along_axis(cosine_sim, 0, V_t, V_t[:, IX_id])
IX_sim5 = np.argsort(IX_sim)[::-1][0:6]

IX_top_courses = {}
for i, course_id in enumerate(IX_sim5):
    if course_id != IX_id:
        IX_top_courses[courses[course_id]['name']] = np.sort(IX_sim)[::-1][i]

IX_top_courses

{'A Network Tour of Data Science': 0.36356197773259208,
 'Distributed information systems': 0.53311533332506555,
 'Financial big data': 0.43771257172080819,
 'Graph theory': 0.25801553622868179,
 'Networks out of control': 0.34966640343643635}

In [None]:
#TODO short comment?