# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *N*

**Names:**

* *Anh Nghia Khau (223613)*
* *Sandra Djambazovska(224638)*



---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import json
import pickle
import math
import operator
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl

from nltk import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import wordnet

In [2]:
data_wiki = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

##  Load data from previous exercice

In [17]:
def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

In [18]:
def load_data():
    tf_matrix    = load_sparse_csr("tf_matrix.npz")
    tfidf_matrix = load_sparse_csr("tfidf_matrix.npz")
    doc_indices  = load_json('doc_indices.txt')[0]
    term_indices = load_json('term_indices.txt')[0]
    indices_term = load_json('indices_term.txt')[0]
    doc_names    = load_json('doc_names.txt')[0]
    
    return tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names 

## Exercise 4.8: Topics extraction (without parameters)

In [19]:
def topics_extraction_1(nbTopics, n_terms):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    
    data = sc.parallelize([Vectors.dense(x) for x in tfidf_matrix.todense().T])
    corpus = data.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
    MAGIC_NUMBER = 42
    ldaModel = LDA.train(corpus, k=nbTopics, seed=MAGIC_NUMBER)
    topics = ldaModel.topicsMatrix()
    
    print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
    for i in range(nbTopics):
        print("Topic " + str(i) + ":")
        "Take the top n_terms "
        score_abs = [np.abs(x) for x in topics[:,i]]
        topic = np.argsort(score_abs)[::-1][:n_terms]
        for y in topic:
            print("   '{k}'".format(k=indices_term[str(y)]))  

In [20]:
topics_extraction_1(nbTopics=10, n_terms=15)

Learned topics (as distributions over vocab of 10929 words):
Topic 0:
   'algebra'
   'geometry'
   'algebraic'
   'fracture'
   'decision'
   'digital'
   'goal'
   'finite'
   'mechanic'
   'heat'
   'critique'
   'interest'
   'topology'
   'ring'
   'elementary'
Topic 1:
   'film'
   'mechanical'
   'microscopy'
   'thin'
   'sensor'
   'electrochemical'
   'magnetic'
   'surface'
   'interface'
   'polymer'
   'device'
   'electron'
   'electrical'
   'electronics'
   'powder'
Topic 2:
   'doctoral'
   'note'
   'edms'
   'biology'
   'gene'
   'cancer'
   'year'
   'protein'
   'signal'
   'access'
   'tumor'
   'cycle'
   'priority'
   'module'
   'expression'
Topic 3:
   'optical'
   'equation'
   'optic'
   'regression'
   'reaction'
   'propagation'
   'nuclear'
   'reactor'
   'equilibrium'
   'electromagnetic'
   'light'
   'fluid'
   'image'
   'evolution'
   'description'
Topic 4:
   'market'
   'financial'
   'stochastic'
   'pricing'
   'finance'
   'probability'
   'ri

Compare with LSI: here we can see/recognise much topics than LSI 

T0: math?, T1: electronic, T2: biology, T3: physic ?, T4: finance, T5: life science, T6: electrical engineering, T7: research, T8: physic, T9: ?

## Exercise 4.9: Dirichlet hyperparameters

##### Fix beta (distribution of words in topics)

In [21]:
def topics_extraction_2(nbTopics, beta, n_terms):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    data = sc.parallelize([Vectors.dense(x) for x in tfidf_matrix.todense().T])
    corpus = data.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
    MAGIC_NUMBER = 42
    
    for alpha in [ 1.01, 2.01, 5.01, 10.01, 50.01, 100.01]:
        print("Alpha = {a}".format(a=alpha))
        ldaModel = LDA.train(corpus, k=nbTopics, seed=MAGIC_NUMBER, optimizer='em', docConcentration=beta, topicConcentration=alpha)
        topics = ldaModel.topicsMatrix()
        for i in range(nbTopics):
            print("     Topic " + str(i) + ":")
            "Take the top n_terms "
            topic = np.argsort(topics[:,i])[::-1][:n_terms]
            for y in topic:
                print("          '{k}'".format(k=indices_term[str(y)]))

In [22]:
topics_extraction_2(nbTopics=10, beta = 1.01, n_terms=15)

Alpha = 1.01
     Topic 0:
          'management'
          'firm'
          'innovation'
          'decision'
          'case'
          'plasma'
          'security'
          'urban'
          'supply'
          'business'
          'company'
          'risk'
          'analytics'
          'professional'
          'privacy'
     Topic 1:
          'sensor'
          'optical'
          'film'
          'thin'
          'electrochemical'
          'mechanical'
          'laser'
          'surface'
          'polymer'
          'magnetic'
          'device'
          'semiconductor'
          'ceramic'
          'electrical'
          'optic'
     Topic 2:
          'doctoral'
          'edms'
          'biology'
          'note'
          'gene'
          'development'
          'tissue'
          'cancer'
          'image'
          'microscopy'
          'year'
          'light'
          'business'
          'tumor'
          'mandate'
     Topic 3:
          'equation'
         

##### alpha = 1.01 :
T0: management technology, T1: electrical engineering, T2: biology, T3: physic???, T4: finance, T5: ???, T6: electronic, T7: research, T8: ???, T9: math
##### alpha = 2.01 :
T0: cryptography, T1: ???, T2: syscom, T3: chemical, T4: finance, T5: biology, T6: electrical engineering, T7: research ,T8: computer science?, T9: syscom
##### alpha = 5.01 :
T0: ???, T1: chemical T2: ???, T3: physic, T4: ???, T5: ???, T6: syscom, T7: research ,T8: ???, T9: syscom
##### alpha = 10.01 , 50.01, 100.01
We have the same keywords (in different order) for all topic and cannot recognise topics

With the higher alpha, we will have an uniform distribution of topics in document in which we cannot distinguish/cluster over all courses at EPFL (which are different topics in reality). We suppose that EPFL's courses have some principle topics and we choose an small alpha (2.01)

##### Fix alpha (distribution of topics in documents)

In [23]:
def topics_extraction_3(nbTopics, alpha, n_terms):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    data = sc.parallelize([Vectors.dense(x) for x in tfidf_matrix.todense().T])
    corpus = data.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
    MAGIC_NUMBER = 42
    
    for beta in [1.01, 5.01, 10.01, 50.01, 100.01]:
        print("Beta = {a}".format(a=beta))
        ldaModel = LDA.train(corpus, k=nbTopics, seed=MAGIC_NUMBER, optimizer="online", docConcentration=beta, topicConcentration=alpha)
        topics = ldaModel.topicsMatrix()
        for i in range(nbTopics):
            print("     Topic " + str(i) + ":")
            "Take the top n_terms "
            topic = np.argsort(topics[:,i])[::-1][:n_terms]
            for y in topic:
                print("          '{k}'".format(k=indices_term[str(y)]))
            




In [24]:
topics_extraction_3(nbTopics=10, alpha = 6.01, n_terms=15)

Beta = 1.01
     Topic 0:
          'linear'
          'image'
          'paper'
          'optimization'
          'semester'
          'fracture'
          'stability'
          'space'
          'performance'
          'practical'
          'week'
          'network'
          'case'
          'signal'
          'optical'
     Topic 1:
          'scale'
          'mechanical'
          'physic'
          'science'
          'question'
          'optical'
          'topic'
          'sensor'
          'image'
          'simulation'
          'case'
          'semester'
          'cover'
          'provide'
          'molecular'
     Topic 2:
          'regression'
          'linear'
          'network'
          'probability'
          'optimization'
          'machine'
          'stochastic'
          'classification'
          'image'
          'derivative'
          'random'
          'statistic'
          'signal'
          'risk'
          'real'
     Topic 3:
          'enac'
 

Changing beta : for each different value of beta, we have the same topic but with a **few** different terms 

## Exercise 4.10: EPFL's taught subjects

In [25]:
def topics_extraction(nbTopics, n_terms, alpha, beta):
    tf_matrix, tfidf_matrix, doc_indices, term_indices, indices_term, doc_names = load_data()
    data = sc.parallelize([Vectors.dense(x) for x in tfidf_matrix.todense().T])
    corpus = data.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
    MAGIC_NUMBER = 42
    ldaModel = LDA.train(corpus, k=nbTopics, docConcentration=beta, topicConcentration=alpha, seed=MAGIC_NUMBER, optimizer='em')
    topics = ldaModel.topicsMatrix()
    
    for i in range(nbTopics):
        print("Topic " + str(i) + ":")
        "Take the top n_terms "
        score_abs = [np.abs(x) for x in topics[:,i]]
        topic = np.argsort(score_abs)[::-1][:n_terms]
        for y in topic:
            print("   '{k}'".format(k=indices_term[str(y)]))  




In [29]:
topics_extraction(nbTopics=9, n_terms=15, alpha=2.01, beta=2.01)

Topic 0:
   'drug'
   'development'
   'optic'
   'optical'
   'powder'
   'pharmacology'
   'case'
   'semester'
   'signal'
   'digital'
   'propagation'
   'precipitation'
   'industrial'
   'disease'
   'scale'
Topic 1:
   'doctoral'
   'edms'
   'note'
   'device'
   'sensor'
   'cancer'
   'circuit'
   'tumor'
   'contact'
   'electronics'
   'biology'
   'priority'
   'human'
   'module'
   'robot'
Topic 2:
   'electron'
   'microscopy'
   'spectroscopy'
   'image'
   'molecular'
   'protein'
   'magnetic'
   'reaction'
   'electronic'
   'tissue'
   'mechanical'
   'chemistry'
   'quantum'
   'interaction'
   'molecule'
Topic 3:
   'optimization'
   'financial'
   'market'
   'stochastic'
   'risk'
   'program'
   'linear'
   'discrete'
   'finance'
   'algorithm'
   'graph'
   'management'
   'algebra'
   'decision'
   'computer'
Topic 4:
   'water'
   'measurement'
   'environmental'
   'atmospheric'
   'transport'
   'fluid'
   'layer'
   'experimental'
   'physical'
   'pol

The reason that we pick a **low** value for alpha and beta is: we suppose that each class of EPFL is composed of only a **few** topics.


T0: chemical, T1: electrical engineering, T2: physic, T3: finance, T4: life science, T5: machine learning, T6: electrical engineering?, T7: research, T8: semester project

## Exercise 4.11: Wikipedia structure

Intuition : analyse the title

In [3]:
def split_words(word):
    """Transform HelloWord into Hello Word"""
    if (word.isupper() or word.islower()):
        return word
    else:
        pos_to_cut = []
        for i in range(1, len(word)):
            if (word[i].isupper()):
                pos_to_cut.append(i)
        curr = 0
        words = ''
        for pos in pos_to_cut:
            words += ' ' + word[curr: pos]
            curr = pos
        words += ' ' + word[curr:]
        return words

In [4]:
"""Pre-requisite to choosing indexing terms"""
"""Combine RegularExpr (Remove the punctuation) and word_tokenize"""
def tokenization(sentence):
    tokenizer = RegexpTokenizer(r'\w+')
    temp = ''
    for w in sentence.split():
        temp += ' ' + split_words(w)   
    temp = word_tokenize(temp)
    new_sentence = ''
    for grams in temp:
        new_sentence += ' ' + grams

    return  tokenizer.tokenize(new_sentence)

In [5]:
"""Remove words not important => smaller indexes and give more informative indexes"""
def stop_words(sentence):
    stopwords = load_pkl('data/stopwords.pkl')
    return [x.lower() for x in sentence if x.lower() not in stopwords]

In [6]:
def get_wordnet_pos(treebank_tag):
    """Map ['NN', 'NNS', 'NNP', 'NNPS'] to NOUN....."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [7]:
"""The goal of lemma and stemming is : reduces lexical variability 
                                      ⇒ reduces index size"""
def lemmatization(sentence):
    lemmatiser = WordNetLemmatizer()
    tokens_pos = pos_tag(sentence)
    tokens_pos = [(w,get_wordnet_pos(p)) for (w,p) in tokens_pos]
    
    return [lemmatiser.lemmatize(w, pos=p) for (w,p) in tokens_pos if p != None]

In [8]:
def preprocessing(sentence):
    """Tokenization"""
    new_sentence = tokenization(sentence)
    """Stopwords"""
    new_sentence = stop_words(new_sentence)
    """POS and Lemmatization"""
    new_sentence = lemmatization(new_sentence)
    
    return new_sentence

In [9]:
"""Read data and return a term-frequency matrix"""
"""Weighting scheme for term t in doc d: 
   
   TF(t,d) =  # occurs of t in d / max {# occurs of t'} for all terms t' in d"""


def read_data(data):
    """Mapping from 'term' to 'row indice'"""
    term_indices = {}
    """Mapping from 'row indice' to 'term'"""
    indices_term = {}
    """Mapping from 'courseID' to 'col indice'"""
    doc_indices = {}
    """Mapping from 'courseID' to 'course name'"""
    doc_names = {}
    
    values = []
    rows = []
    columns = []
    terms_count = 0
    docs_count = 0
    
    
    for d in data:
        id_   = d[0]
        title = d[1]
        doc_indices[id_] = docs_count
        doc_names[id_] = title

        processed = preprocessing(title)
        if (len(processed) == 0):
            continue
        for term in processed:
            """Remove 1-gram and 2-grams"""
            if(len(term) <= 2):
                continue

            if term not in term_indices:
                term_indices[term] = terms_count
                indices_term[terms_count] = term
                terms_count += 1
            """Append a value to matrix(row, col)"""
            values.append(1.0)
            rows.append(term_indices[term])
            columns.append(docs_count)
        """Go to another doc"""
        docs_count += 1
    """Create csr matrix"""
    tf_matrix = csr_matrix((values, (rows, columns)), shape=(terms_count, docs_count))
    """Transforme to TF matrix"""
    for col in range(docs_count):
        tf_matrix[:,col] /= np.max(tf_matrix.getcol(col))
        
    return tf_matrix, term_indices, indices_term, doc_indices, doc_names

In [10]:
"""Inverse Docement Frequency IDF(t,D): log(# documents/ # documents contain term t)"""
"""TF_IDF = TF(t,d)*IDF(t,D)"""
def tf_idf(tf_matrix, doc_indices, term_indices):
    nbDocs = len(doc_indices)
    nbTerms= len(term_indices)
    tfidf_matrix = tf_matrix.copy()
    for i in range(nbTerms):
        tfidf_matrix[i,:] *= math.log(nbDocs/tfidf_matrix.getrow(i).nnz)
    return tfidf_matrix


In [None]:
data = data_wiki.map(lambda x : (x['page_id'] - 1, x['title'])).collect()

In [33]:
def wiki_structure(data):
    tf_matrix, term_indices, indices_term, doc_indices, doc_names = read_data(data)
    tfidf_matrix = tf_idf(tf_matrix, doc_indices, term_indices)
    data_ = sc.parallelize([Vectors.dense(x) for x in tfidf_matrix.todense().T])
    corpus = data_.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
    MAGIC_NUMBER = 42
    alpha = 6.01
    beta = 1.01
    nbTopics = 50
    n_terms = 15
    ldaModel = LDA.train(corpus, k=nbTopics, docConcentration=beta, topicConcentration=alpha, seed=MAGIC_NUMBER, optimizer='em')
    topics = ldaModel.topicsMatrix()
    for i in range(nbTopics):
        print("Topic " + str(i) + ":")
        "Take the top n_terms "
        score_abs = [np.abs(x) for x in topics[:,i]]
        topic = np.argsort(score_abs)[::-1][:n_terms]
        for y in topic:
            print("   '{k}'".format(k=indices_term[y]))

In [34]:
wiki_structure(data)

Topic 0:
   'history'
   'russia'
   'wolf'
   'brother'
   'painting'
   'christianity'
   'japan'
   'peter'
   'slavery'
   'france'
   'agriculture'
   'buddhism'
   'europe'
   'saffron'
   'soviet'
Topic 1:
   'game'
   'theory'
   'olympic'
   'video'
   'card'
   'board'
   'introduction'
   'monopoly'
   'ultimatum'
   'winter'
   'kite'
   'commonwealth'
   'summer'
   'civilization'
   'trading'
Topic 2:
   'street'
   'vitamin'
   'theory'
   'super'
   'space'
   'mario'
   'probability'
   'sesame'
   'party'
   'relativity'
   'fleet'
   'bird'
   'introduction'
   'watling'
   'coronation'
Topic 3:
   'war'
   'world'
   'black'
   'american'
   'civil'
   'cup'
   'trade'
   'star'
   'gas'
   'present'
   'terrorism'
   'year'
   'map'
   'peace'
   'organization'
Topic 4:
   'john'
   'hurricane'
   'battle'
   'paul'
   'season'
   'kennedy'
   'lewis'
   'robinson'
   'atlantic'
   'muhammad'
   'milton'
   'keynes'
   'adam'
   'danny'
   'pope'
Topic 5:
   'engla

We take k = 50, beta = 1.01, alpha = 6.01