# Text 2: Latent semantic indexing
**Internet Analytics - Lab 4**

---

**Group:** *Y*

**Names:**

* *Kristian Aurlien*
* *Mateusz Paluchowski*

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse.linalg import svds
from utils import load_json, load_pkl

## Exercise 4.4: Latent semantic indexing

In [2]:
# n x m matrix where n is number of terms and m is number of documents
X = load_pkl('tfidx_matrix.pkl')
terms = load_pkl('terms.pkl')
courses = load_pkl('courses.pkl')

n, m = X.shape
print('Number of terms (n) =', n)
print('Number of courses (m) =', m)

Number of terms (n) = 10875
Number of courses (m) = 854


In [3]:
U, S, V_t = svds(X, k=300, which='LM')

In [4]:
'''
I'm in love with the shape of U
We push and pull like a magnet do
Although my heart is falling too
I'm in love with your body
'''

print('U:', U.shape)
print('S:', S.shape)
print('V^T:', V_t.shape)

U: (10875, 300)
S: (300,)
V^T: (300, 854)


    1. Describe the rows and columns of U and V , and the values of S.
#### $U$: Term-concept mapping

The $n$ rows of the $U$-matrix given by the SVD, gives us a mapping from term to concept. Each row is the mapping for one term, and each value $v_i$ in that row shows how strongly that term relates to concept $c_i$

#### $V^T$: Course-concept mapping
Similarly, the $m$ columns of the $V^T$ matrix shows how strongly each course corresponds to each concept.

#### $S$: Concept-"strength"
The singular valies of $S$ shows how "strong" the concept is - the bigger the value is, the "stronger" the concept.


    2. Print the top-20 eigenvalues of X.  

In [5]:
top20_singular = S[::-1][:20]
top20_singular

array([ 349.98468286,  223.52636182,  211.60463826,  204.95465982,
        192.72277657,  191.73736178,  188.90578214,  186.88000684,
        182.26580104,  177.04396616,  172.45494585,  171.3265112 ,
        168.98098613,  164.52156741,  161.13242853,  160.75631113,
        158.58393064,  157.0648853 ,  155.26142247,  153.10302775])

In [6]:
top20_eigenvalues = [x*x for x in top20_singular]
top20_eigenvalues

[122489.27823503192,
 49964.03442985985,
 44776.522932370077,
 42006.412581026954,
 37142.068610249044,
 36763.21590353275,
 35685.394527661418,
 34924.136956224233,
 33220.82222913039,
 31344.565952711291,
 29740.7083469973,
 29352.773441369587,
 28554.573672492083,
 27067.346143503568,
 25963.659525275776,
 25842.591568733347,
 25148.863057568589,
 24669.378195278983,
 24106.109307432598,
 23440.537106715179]

## Exercise 4.5: Topic extraction

In [7]:
s = np.diag(S)

In [8]:
# The columns of U contains one value per token in the vocabulary
U[:,-1].shape

(10875,)

In [9]:
# combination of terms
for concept in range(-1,-11,-1):
    words = [terms[t] for t in np.argsort(U[:,concept])[-15:]]
    print('Concept', -concept,':')
    print(words)

Concept 1 :
['comput', 'architectur', 'imag', 'risk', 'engin', 'energi', 'electron', 'materi', 'data', 'project', 'optic', 'process', 'model', 'design', 'system']
Concept 2 :
['magnet', 'semiconductor', 'micro', 'mem', 'photon', 'sensor', 'materi', 'print', 'imag', 'light', 'devic', 'microscopi', 'laser', 'electron', 'optic']
Concept 3 :
['data', 'research', 'semest', 'develop', 'recycl', 'excurs', 'form', 'report', 'design', 'week', 'urban', 'studio', 'project', 'architectur', 'wast']
Concept 4 :
['lab', 'comput', 'imag', 'project', 'signal', 'algorithm', 'robot', 'system', 'speech', 'design', 'data', 'urban', 'digit', 'studio', 'architectur']
Concept 5 :
['common', 'roman', 'risk', 'territori', 'fiber', 'waveguid', 'citi', 'light', 'urban', 'studio', 'imag', 'architectur', 'laser', 'wast', 'optic']
Concept 6 :
['sensor', 'digit', 'seismic', 'system', 'urban', 'power', 'wast', 'studio', 'electron', 'devic', 'circuit', 'design', 'steel', 'print', 'architectur']
Concept 7 :
['reaction',

In [10]:
# combination of documents
for concept in range(-1,-11,-1):
    # Collect some tokens from each course
    words = np.concatenate([courses[t]['tokens'] \
                            for t in np.argsort(V_t[concept])[-2:]])
    print('Concept', -concept,':')
    print(words[:10])

Concept 1 :
['book' 'solid' 'waste' 'engineering' 'global' 'perspective' 'basis'
 'textbook' 'excellent' 'introduction']
Concept 2 :
['addresses' 'implementation' 'organic' 'printed' 'electronics'
 'technologies' 'large' 'area' 'manufacturing' 'techniques']
Concept 3 :
['studio' 'explores' 'meaningful' 'form' 'generating' 'processes'
 'algorithmic' 'parametric' 'tools' 'introduces']
Concept 4 :
['studio' 'explores' 'meaningful' 'form' 'generating' 'processes'
 'algorithmic' 'parametric' 'tools' 'introduces']
Concept 5 :
['commons' 'part' 'appia' 'novissima' 'tackle' 'urgent' 'issue'
 'rebuilding' 'shared' 'infrastructure']
Concept 6 :
['addresses' 'implementation' 'organic' 'printed' 'electronics'
 'technologies' 'large' 'area' 'manufacturing' 'techniques']
Concept 7 :
['commons' 'part' 'tackle' 'urgent' 'issue' 'rebuilding' 'shared'
 'infrastructure' 'european' 'territory']
Concept 8 :
['goal' 'provide' 'students' 'main' 'formalisms' 'models' 'algorithms'
 'required' 'implementation' 

For some of the concepts it is hard to see the connection, while for others, like C_4, it seems obvious. Nevertheless, we have done our best to do give each concept a name:


- Concept 1: "Physics"- / "Administrative"+
- Concept 2: "Finance"
- Concept 3: "Optics"
- Concept 4: "Environmental engineering"+
- Concept 5: "Thermodynamics / Mathematics"
- Concept 6: "Electronics"
- Concept 7: "Signal Processing"
- Concept 8: "Computer Science / AI"
- Concept 9: "Chemistry"
- Concept 10: "Fluid Dynamics"


## Exercise 4.6: Document similarity search in concept-space

In [11]:
U, S, V_t = svds(X, k=300, which='LM')
s = np.diag(S)

In [12]:
def term_document_similarity(query):
    similarities = np.zeros(len(courses))
    for term in query.split(' '):
        t_index = terms.index(term)
        u_t = U[t_index]
        
        for ix, d in enumerate(courses):
            v_T_d = V_t[:,ix]
    
            s_v_T_d = np.dot(s, v_T_d)
            nominator = np.dot(u_t, s_v_T_d)
            denominator = np.linalg.norm(u_t) * np.linalg.norm(s_v_T_d)
            sim = nominator / denominator
            similarities[ix] += sim
    return similarities

In [13]:
def LSI_search(terms, no_top=5):
    similiarities = term_document_similarity(terms)
    top_results = np.argsort(similiarities)[::-1][0:no_top]
    for top in top_results:
        print('{0}: {1}'.format(courses[top]['name'], top))

In [14]:
LSI_search('facebook', 5)

Computational Social Media: 798
Social media: 407
Studio MA2 (Escher et GuneWardena): 59
Transport phenomena II: 29
Human computer interaction: 521


In [15]:
LSI_search('markov chain', 5)

Applied stochastic processes: 80
Applied probability & stochastic processes: 398
Markov chains and algorithmic applications: 245
Supply chain management: 44
Mathematical models in supply chain management: 99


### VSM

###### Query: "facebook"

'Computational Social Media': 0.17945984867925702,

'CCMX Advanced Course - Instrumented Nanoindentation': 0.0,

'Electronic properties of solids and superconductivity': 0.0,

'Hydrogeophysics': 0.0,

'Molecular and cellular biophysic II': 0.0


###### Query: "markov chains"


'Applied probability & stochastic processes', 0.55400769353626178,

'Applied stochastic processes', 0.55211833344995098,

'Markov chains and algorithmic applications', 0.38168653789318985,

'Supply chain management', 0.37852761365218429,

'Mathematical models in supply chain management', 0.31162506776787757

### LSI

###### Query: "facebook"

Computational Social Media: 798

Human computer interaction: 521

Social media: 407

Studio MA2 (Escher et GuneWardena): 59

Transport phenomena II: 29

###### Query: "markov chains"

Applied probability & stochastic processes: 398

Markov chains and algorithmic applications: 245

Mathematical models in supply chain management: 99

Applied stochastic processes: 80


Supply chain management: 44


As we can see above LSI search performs significantly better especially for queries which contain terms occuring only in few of the documents. That is simply because rare terms can be 'asigned' to more general concept describing documents and thus there is a higher chance of finding relevant information in this concept space, rather than in naive term frequency apporach.

## Exercise 4.7: Document-document similarity

Simpliest and most efficient way of computing document-document similarity is to compute cosine similarity between given document represented as document-concept V_t and rest of the document-concepts.

In [16]:
def cosine_sim(d1, d2):
    return np.dot(d1, d2) / (np.linalg.norm(d1) * np.linalg.norm(d2))

In [17]:
COM_308_id = next(index for (index, d) in enumerate(courses) if d["courseId"] == "COM-308")
COM_308_id

43

In [18]:
COM_308_similarities = np.apply_along_axis(cosine_sim, 0, V_t, V_t[:, COM_308_id])

In [19]:
top_5_COM_308 = np.argsort(COM_308_similarities)[::-1][0:6]

In [20]:
top_COM_308 = {}
for ind, top in enumerate(top_5_COM_308):
    if top != COM_308_id:
        top_COM_308[courses[top]['name']] = np.sort(COM_308_similarities)[::-1][ind]

In [21]:
top_COM_308

{'A Network Tour of Data Science': 0.34372597801585447,
 'Data science for business': 0.26147945878886086,
 'Distributed information systems': 0.5356748810502423,
 'Financial big data': 0.43121459718942151,
 'Networks out of control': 0.35117786627061409}

As we can see recommended classes revolve around big data and data science. Deep dive into the course description only seems to verify that proposed classes are accurate.