# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *D*

**Names:**

* *Cyril Cadoux*
* *Marc Bickel*
* *Emma Lejal*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [31]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds

## Exercise 4.4: Latent semantic indexing

In [32]:
from utils import load_pkl

In [33]:
TFIDF = load_pkl('data/tfidf.pkl')

In [34]:
words = load_pkl('data/words.pkl')

In [35]:
words[:10]

['data',
 'comput',
 'engin',
 'energi',
 'physic',
 'mechan',
 'optim',
 'recommend',
 'research',
 'transvers']

In [36]:
docs = load_pkl('data/docs.pkl')

In [37]:
def indexOf(w):
    for i in range(len(words)):
        if(words[i] == w):
            return i

In [38]:
U, S, Vt = svds(TFIDF, 300)

#### Top 20 eigenvalues : 

In [39]:
S[-20:][::-1]

array([ 38.00094938,  27.33947905,  24.28309272,  23.37363376,
        22.71773375,  22.45765828,  21.75387861,  21.62794474,
        21.19462941,  20.77585478,  20.35596681,  20.26458478,
        20.11926194,  19.91355462,  19.80452182,  19.65565104,
        19.62574102,  19.38970516,  19.33123254,  19.2341546 ])

## Exercise 4.5: Topic extraction

In [43]:
topTenTerms = U[:,-10:]
topTenDocs = Vt[-10:, :]

In [45]:
for i in range(10):
    topTenIndexesTerms = np.argsort(topTenTerms[:,i])[::-1]
    topTenIndexesDocs = np.argsort(topTenDocs[i,:])[::-1]
    topicTerms = ''
    topicDocs = ''
    for j in range(10):
        topicTerms += ' + ' + words[topTenIndexesTerms[j]] + ' * ' + str(np.round(topTenTerms[topTenIndexesTerms[j], i], 3))
        topicDocs += ' + ' + docs[topTenIndexesDocs[j]] + ' * ' + str(np.round(topTenDocs[i, topTenIndexesDocs[j]], 3))
        
    print("\nTopic ", i)
    print("Words : ")
    print(topicTerms)
    print("Docs : ")
    print(topicDocs)


Topic  0
Words : 
 + ena * 0.142 + multidisciplinari * 0.129 + mountain * 0.114 + landscap * 0.101 + interdisciplinari * 0.096 + sustain * 0.094 + renew * 0.092 + environ * 0.091 + situat * 0.081 + wetlab * 0.079
Docs : 
 + Renewable energy and solar architecture in Davos * 0.434 + Théorie et critique du projet MA1 (Gugger) * 0.145 + Interdisciplinary / disciplinary project for chemical master * 0.126 + Lab immersion III * 0.123 + Théorie et critique du projet MA2 (Gugger) * 0.118 + Lab immersion I * 0.092 + Lab immersion II * 0.074 + Principles of finance * 0.073 + Hydrogeophysics * 0.058 + Introduction to finance (IF master and minor only) * 0.058

Topic  1
Words : 
 + guidanc * 0.213 + theme * 0.204 + professor * 0.147 + chosen * 0.135 + subject * 0.096 + artist * 0.092 + ena * 0.086 + individu * 0.084 + multidisciplinari * 0.078 + assist * 0.075
Docs : 
 + Renewable energy and solar architecture in Davos * 0.276 + Difficult Double Double Histories * 0.265 + Project in computer sci

## Exercise 4.6: Document similarity search in concept-space

In [None]:
def sim(t, d):
    return (U[:t] @ S @ Vt[:d])/(la.norm(U[:t])*la.norm(S @ Vt[:d]))

## Exercise 4.7: Document-document similarity