# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Y*

**Names:**

* *Kristian Aurlien*
* *Mateusz Paluchowski*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_json, load_pkl

## Exercise 4.4: Latent semantic indexing

In [2]:
# n x m matrix where n is number of terms and m is number of documents
X = load_pkl('tfidx_matrix.pkl')
terms = load_pkl('terms.pkl')
courses = load_pkl('courses.pkl')

n, m = X.shape
print('Number of terms (n) =', n)
print('Number of courses (m) =', m)

Number of terms (n) = 10875
Number of courses (m) = 854


In [3]:
U, S, V_t = svds(X, k=300, which='LM')

In [4]:
'''
I'm in love with the shape of U
We push and pull like a magnet do
Although my heart is falling too
I'm in love with your body
'''

print('U:', U.shape)
print('S:', S.shape)
print('V^T:', V_t.shape)

U: (10875, 300)
S: (300,)
V^T: (300, 854)


    1. Describe the rows and columns of U and V , and the values of S.
#### $U$: Term-concept mapping

The $n$ rows of the $U$-matrix given by the SVD, gives us a mapping from term to concept. Each row is the mapping for one term, and each value $v_i$ in that row shows how strongly that term relates to concept $c_i$

#### $V^T$: Cource-concept mapping
Similarly, the $m$ columns of the $V^T$ matrix shows how strongly each course corresponds to each concept.

#### $S$: Concept-"strength"
The singular valies of $S$ shows how "strong" the concept is - the bigger the value is, the "stronger" the concept.


    2. Print the top-20 eigenvalues of X.  

In [5]:
top20_singular = S[::-1][:20]
top20_singular

array([ 349.98468286,  223.52636182,  211.60463826,  204.95465982,
        192.72277657,  191.73736178,  188.90578214,  186.88000684,
        182.26580104,  177.04396616,  172.45494585,  171.3265112 ,
        168.98098613,  164.52156741,  161.13242853,  160.75631113,
        158.58393064,  157.0648853 ,  155.26142247,  153.10302775])

In [6]:
top20_eigenvalues = [x*x for x in top20_singular]
top20_eigenvalues

[122489.27823503192,
 49964.03442985985,
 44776.522932370077,
 42006.412581026954,
 37142.068610249044,
 36763.21590353275,
 35685.394527661418,
 34924.136956224233,
 33220.82222913039,
 31344.565952711291,
 29740.7083469973,
 29352.773441369587,
 28554.573672492083,
 27067.346143503568,
 25963.659525275776,
 25842.591568733347,
 25148.863057568589,
 24669.378195278983,
 24106.109307432598,
 23440.537106715179]

## Exercise 4.5: Topic extraction

In [7]:
s = np.diag(S)

In [30]:
U[:,-1].shape

(10875,)

In [35]:
# combination of terms
for concept in range(-1,-11,-1):
    words = [terms[t] for t in np.argsort(U[:,concept])[-10:]]
    print('Concept', -concept,':')
    print(words)

Concept 1 :
['energi', 'electron', 'materi', 'data', 'project', 'optic', 'process', 'model', 'design', 'system']
Concept 2 :
['sensor', 'materi', 'print', 'imag', 'light', 'devic', 'microscopi', 'laser', 'electron', 'optic']
Concept 3 :
['excurs', 'form', 'report', 'design', 'week', 'urban', 'studio', 'project', 'architectur', 'wast']
Concept 4 :
['algorithm', 'robot', 'system', 'speech', 'design', 'data', 'urban', 'digit', 'studio', 'architectur']
Concept 5 :
['waveguid', 'citi', 'light', 'urban', 'studio', 'imag', 'architectur', 'laser', 'wast', 'optic']
Concept 6 :
['power', 'wast', 'studio', 'electron', 'devic', 'circuit', 'design', 'steel', 'print', 'architectur']
Concept 7 :
['common', 'polici', 'chemic', 'print', 'electron', 'protein', 'architectur', 'cell', 'energi', 'risk']
Concept 8 :
['algorithm', 'voic', 'model', 'code', 'robot', 'process', 'wast', 'signal', 'recognit', 'speech']
Concept 9 :
['electron', 'host', 'wearabl', 'circuit', 'report', 'lab', 'sensor', 'devic', 'pri

In [37]:
# combination of documents
for concept in range(-1,-11,-1):
    words = [courses[t]['courseId] for t in np.argsort(V_t[concept])[-10:]]
    print('Concept', -concept,':')
    print(words)

Concept 1 :
[{'courseId': 'AR-402(y)', 'description': 'This studio explores meaningful form generating processes by the use of algorithmic and parametric tools and introduces the notion of growth typologies. Our studio site will be in Singapore, our programme the procedural design of an innovation building prototypology. Content The advent of new digital technologies has had a twofold impact on architectural thinking and urban design, transforming, on one hand, the processes for form generation and design production through algorithmic and parametric technologies, and, on the other hand, enabling an escape from the static fate of the built environment by facilitating dynamic interaction between inhabitants and their surrounding. Our interest in the orientation \'Artificial Morphogenesis\' is to explore meaningful form generating processes by the use of algorithmic and parametric tools and introduce the notion of growth typologies in architectural and urban design thinking. In particula

## Exercise 4.6: Document similarity search in concept-space

## Exercise 4.7: Document-document similarity