# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Y*

**Names:**

* *Kristian Aurlien*
* *Mateusz Paluchowski*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [2]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_json, load_pkl

## Exercise 4.4: Latent semantic indexing

In [3]:
# n x m matrix where n is number of terms and m is number of documents
X = load_pkl('tfidx_matrix.pkl')
terms = load_pkl('terms.pkl')
courses = load_pkl('courses.pkl')

n, m = X.shape
print('Number of terms (n) =', n)
print('Number of courses (m) =', m)

Number of terms (n) = 10875
Number of courses (m) = 854


In [4]:
U, S, V_t = svds(X, k=300, which='LM')

In [5]:
'''
I'm in love with the shape of U
We push and pull like a magnet do
Although my heart is falling too
I'm in love with your body
'''

print('U:', U.shape)
print('S:', S.shape)
print('V^T:', V_t.shape)

U: (10875, 300)
S: (300,)
V^T: (300, 854)


    1. Describe the rows and columns of U and V , and the values of S.
#### $U$: Term-concept mapping

The $n$ rows of the $U$-matrix given by the SVD, gives us a mapping from term to concept. Each row is the mapping for one term, and each value $v_i$ in that row shows how strongly that term relates to concept $c_i$

#### $V^T$: Cource-concept mapping
Similarly, the $m$ columns of the $V^T$ matrix shows how strongly each course corresponds to each concept.

#### $S$: Concept-"strength"
The singular valies of $S$ shows how "strong" the concept is - the bigger the value is, the "stronger" the concept.


    2. Print the top-20 eigenvalues of X.  

In [6]:
top20_singular = S[::-1][:20]
top20_singular

array([ 349.98468286,  223.52636182,  211.60463826,  204.95465982,
        192.72277657,  191.73736178,  188.90578214,  186.88000684,
        182.26580104,  177.04396616,  172.45494585,  171.3265112 ,
        168.98098613,  164.52156741,  161.13242853,  160.75631113,
        158.58393064,  157.0648853 ,  155.26142247,  153.10302775])

In [7]:
top20_eigenvalues = [x*x for x in top20_singular]
top20_eigenvalues

[122489.27823503187,
 49964.034429859865,
 44776.522932370121,
 42006.412581026918,
 37142.068610248956,
 36763.215903532655,
 35685.394527661541,
 34924.136956224007,
 33220.822229130361,
 31344.56595271128,
 29740.708346997264,
 29352.773441369507,
 28554.573672492083,
 27067.346143503579,
 25963.659525275787,
 25842.591568733384,
 25148.863057568571,
 24669.378195278958,
 24106.1093074325,
 23440.537106715161]

## Exercise 4.5: Topic extraction

In [8]:
s = np.diag(S)

In [9]:
U[:,-1].shape

(10875,)

In [10]:
# combination of terms
for concept in range(-1,-11,-1):
    words = [terms[t] for t in np.argsort(U[:,concept])[-10:]]
    print('Concept', -concept,':')
    print(words)

Concept 1 :
['energi', 'electron', 'materi', 'data', 'project', 'optic', 'process', 'model', 'design', 'system']
Concept 2 :
['sensor', 'materi', 'print', 'imag', 'light', 'devic', 'microscopi', 'laser', 'electron', 'optic']
Concept 3 :
['excurs', 'form', 'report', 'design', 'week', 'urban', 'studio', 'project', 'architectur', 'wast']
Concept 4 :
['algorithm', 'robot', 'system', 'speech', 'design', 'data', 'urban', 'digit', 'studio', 'architectur']
Concept 5 :
['waveguid', 'citi', 'light', 'urban', 'studio', 'imag', 'architectur', 'laser', 'wast', 'optic']
Concept 6 :
['power', 'wast', 'studio', 'electron', 'devic', 'circuit', 'design', 'steel', 'print', 'architectur']
Concept 7 :
['common', 'polici', 'chemic', 'print', 'electron', 'protein', 'architectur', 'cell', 'energi', 'risk']
Concept 8 :
['algorithm', 'voic', 'model', 'code', 'robot', 'process', 'wast', 'signal', 'recognit', 'speech']
Concept 9 :
['electron', 'host', 'wearabl', 'circuit', 'report', 'lab', 'sensor', 'devic', 'pri

In [11]:
# combination of documents
for concept in range(-1,-11,-1):
    words = [courses[t]['courseId'] for t in np.argsort(V_t[concept])[-10:]]
    print('Concept', -concept,':')
    print(words)

Concept 1 :
['AR-402(y)', 'MICRO-505', 'HUM-370', 'FIN-404', 'AR-401(y)', 'MSE-803', 'EE-730', 'FIN-402', 'ENV-500', 'ENG-421']
Concept 2 :
['MICRO-562', 'BIOENG-445', 'MICRO-534', 'CH-448', 'MICRO-421', 'MICRO-504', 'MICRO-424', 'MICRO-618', 'MICRO-505', 'MSE-803']
Concept 3 :
['AR-401(b)', 'AR-402(c)', 'AR-401(c)', 'BIO-504', 'BIO-505', 'BIO-506', 'BIO-507', 'AR-402(y)', 'AR-401(y)', 'ENV-500']
Concept 4 :
['AR-401(b)', 'AR-476', 'EE-553', 'MICRO-453', 'EE-432', 'EE-730', 'AR-402(c)', 'AR-401(c)', 'AR-402(y)', 'AR-401(y)']
Concept 5 :
['MICRO-422', 'MICRO-421', 'AR-402(y)', 'BIOENG-445', 'CH-448', 'AR-401(y)', 'MICRO-424', 'AR-402(c)', 'AR-401(c)', 'ENV-500']
Concept 6 :
['EE-730', 'ENV-500', 'AR-402(y)', 'MSE-803', 'AR-402(c)', 'AR-401(c)', 'MICRO-618', 'AR-401(y)', 'MICRO-505', 'CIVIL-435']
Concept 7 :
['Caution, these contents corresponds to the coursebooks of last year', 'AR-401(b)', 'AR-401(y)', 'MICRO-505', 'FIN-402', 'MICRO-618', 'MSE-803', 'HUM-370', 'AR-402(c)', 'AR-401(c)']

## Exercise 4.6: Document similarity search in concept-space

In [74]:
U, S, V_t = svds(X, k=300, which='LM')
s = np.diag(S)

In [75]:
def term_document_similarity(terms):
    similarities = np.zeros(len(courses))
    for term in terms.split(' '):
        t_index = terms.index(term)
        u_t = U[t_index]
        
        for ix, d in enumerate(courses):
            v_T_d = V_t[:,ix]
    
            s_v_T_d = np.dot(s, v_T_d)
            nominator = np.dot(u_t, s_v_T_d)
            denominator = np.linalg.norm(u_t) * np.linalg.norm(s_v_T_d)
            sim = nominator / denominator
            similarities[ix] += sim
    return similarities

In [76]:
def LSI_search(terms, no_top=5):
    similiarities = term_document_similarity(terms)
    top_results = np.argsort(similiarities)[::-1][0:no_top]
    for top in top_results:
        print('{0}: {1}'.format(courses[top]['name'], top))

In [77]:
LSI_search('facebook', 5)

Microelectronics: 572
Semiconductor physics and fundamentals of electronic devices: 231
Project 2 (EDIC): 56
Project 1 (EDIC): 36
Properties of semiconductors and related nanostructures: 438


In [78]:
LSI_search('markov chains', 5)

Bioprocesses and downstream processing: 143
Biochemical engineering: 146
Microelectronics: 572
Semiconductor physics and fundamentals of electronic devices: 231
Project in Biotechnology: 733


### VSM

###### Query: "facebook"

'Computational Social Media': 0.17945984867925702,

'CCMX Advanced Course - Instrumented Nanoindentation': 0.0,

'Electronic properties of solids and superconductivity': 0.0,

'Hydrogeophysics': 0.0,

'Molecular and cellular biophysic II': 0.0


###### Query: "markov chains"


'Applied probability & stochastic processes', 0.55400769353626178,

'Applied stochastic processes', 0.55211833344995098,

'Markov chains and algorithmic applications', 0.38168653789318985,

'Supply chain management', 0.37852761365218429,

'Mathematical models in supply chain management', 0.31162506776787757

### VSM

###### Query: "facebook"




###### Query: "markov chains"




## Exercise 4.7: Document-document similarity