# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Dennis Gankin*
* *Name 2*
* *Name 3*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

from utils import load_pkl
import pandas as pd

## Exercise 4.8: Topics extraction

In [2]:
# load the data
tf_matrix = load_pkl('tfidf.pkl')
terms = load_pkl('terms.pkl')

In [3]:
#creating the model following the mllib docs
tf_dataframe=sc.parallelize(tf_matrix.T)

data=tf_dataframe.map(lambda line: Vectors.dense([float(x) for x in line]))
corpus= data.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

ldaModel = LDA.train(corpus, k=10, optimizer='online')

In [4]:
import numpy as np

def topic_descriptions(model, num_topics=10, num_labels=10):
    topics=model.topicsMatrix()
    
    for topic in range(num_topics):
        print("Topic " + str(topic) + ":")
        top_word_ids = np.argsort(topics[:,topic])[::-1][0:num_labels]
        labels = []
        
        for word_id in top_word_ids:
            labels.append(terms[word_id])
        print(labels)

In [5]:
topic_descriptions(ldaModel)

Topic 0:
['biometr', 'neuron', 'wast', 'recycl', 'wastewat', 'atmospher', 'treatment', 'synapt', 'water', 'pollut']
Topic 1:
['polici', 'risk', 'financi', 'financ', 'stochast', 'price', 'corpor', 'market', 'asset', 'volatil']
Topic 2:
['laser', 'studio', 'urban', 'architectur', 'spectroscopi', 'light', 'microscop', 'electron', 'sampl', 'optic']
Topic 3:
['cell', 'statist', 'probabl', 'regress', 'random', 'cancer', 'immun', 'anim', 'protein', 'express']
Topic 4:
['flow', 'project', 'ah', 'energi', 'case', 'plan', 'heat', 'report', 'busi', 'manag']
Topic 5:
['neutron', 'systemc', 'avion', 'systemverilog', 'equilibrium', 'arteri', 'powder', 'diffract', 'nonlinear', 'diffus']
Topic 6:
['food', 'print', 'biolog', 'manufactur', 'organ', 'area', 'seminar', 'synthet', 'process', 'electron']
Topic 7:
['fractur', 'seismic', 'soil', 'rock', 'geolog', 'tribolog', 'crack', 'plasmon', 'mx', 'deep']
Topic 8:
['model', 'circuit', 'system', 'imag', 'signal', 'algorithm', 'linear', 'numer', 'network', '

In [6]:
#TODO: How does it compare with LSI

## Exercise 4.9: Dirichlet hyperparameters

In [7]:
#try for alphas and betas
params = [0.1, 1.0, 10.0, 1000.0]

In [8]:
def parameters_LDA(corpus, alpha=6.0,beta=1.01,k=10):

    if type(alpha)==list:
        print("beta =",beta,"\n")
        for a in alpha:
            print("\n alpha =", a)
            model = LDA.train(corpus, k=k, seed=1, topicConcentration=beta, docConcentration=a, optimizer='online')
            topic_descriptions(model,num_topics=k)
    elif type(beta)==list:
        print("alpha =",alpha,"\n")
        for b in beta:
            print("\nbeta =",b)
            model = LDA.train(corpus, k=k, seed=1, topicConcentration=b, docConcentration=alpha, optimizer='online')
            topic_descriptions(model,num_topics=k)
    else:
        print("alpha =",alpha)
        print("beta =",beta,"\n")
        model = LDA.train(corpus, k=k, seed=1, topicConcentration=beta, docConcentration=alpha, optimizer='online')
        topic_descriptions(model,num_topics=k)

In [9]:
parameters_LDA(corpus,alpha=params)

beta = 1.01 


 alpha = 0.1
Topic 0:
['project', 'data', 'research', 'group', 'materi', 'report', 'plan', 'manag', 'busi', 'object']
Topic 1:
['acoust', 'algebra', 'lie', 'curv', 'neuroprosthes', 'ring', 'neuroprosthesi', 'invas', 'tech', 'geometri']
Topic 2:
['mx', 'mse', 'crack', 'cardiac', 'arteri', 'ancient', 'ray', 'venou', 'degrad', 'blood']
Topic 3:
['risk', 'linear', 'statist', 'flow', 'model', 'probabl', 'control', 'robot', 'algorithm', 'numer']
Topic 4:
['magnet', 'electron', 'circuit', 'imag', 'sensor', 'devic', 'cmo', 'digit', 'nois', 'filter']
Topic 5:
['reaction', 'reactor', 'snow', 'chemic', 'energi', 'heat', 'kinet', 'thermodynam', 'chemistri', 'water']
Topic 6:
['protein', 'cell', 'solar', 'doctor', 'edm', 'motil', 'sensit', 'kinas', 'tem', 'assay']
Topic 7:
['laser', 'optic', 'radiat', 'photon', 'light', 'electromagnet', 'shield', 'fiber', 'grate', 'game']
Topic 8:
['food', 'architectur', 'wast', 'studio', 'urban', 'drug', 'wood', 'code', 'ferment', 'build']
Topic 9:


In [10]:
#TODO: how does it impact the topics

In [11]:
parameters_LDA(corpus,beta=params)

alpha = 6.0 


beta = 0.1
Topic 0:
['risk', 'market', 'busi', 'laser', 'polici', 'financ', 'financi', 'suppli', 'price', 'case']
Topic 1:
['speech', 'algebra', 'imag', 'audio', 'video', 'acoust', 'sound', 'curv', 'tp', 'lie']
Topic 2:
['ancient', 'materi', 'degrad', 'absorpt', 'tem', 'ray', 'sensit', 'spectroscopi', 'paint', 'damag']
Topic 3:
['quantum', 'equat', 'molecular', 'approxim', 'phase', 'flow', 'dynam', 'arteri', 'statement', 'kohn']
Topic 4:
['circuit', 'electron', 'magnet', 'devic', 'sensor', 'cmo', 'nois', 'filter', 'imag', 'optic']
Topic 5:
['snow', 'reactor', 'protein', 'reaction', 'chemic', 'kinet', 'cell', 'radiat', 'heat', 'thermodynam']
Topic 6:
['crack', 'fractur', 'mx', 'mse', 'structur', 'cell', 'steel', 'model', 'mechan', 'electron']
Topic 7:
['optic', 'light', 'fluoresc', 'imag', 'organ', 'laser', 'reaction', 'secur', 'biophoton', 'biolog']
Topic 8:
['project', 'process', 'data', 'model', 'test', 'food', 'comput', 'control', 'robot', 'week']
Topic 9:
['steel', '

In [12]:
#TODO: how does it impact the topuics

## Exercise 4.10: EPFL's taught subjects

In [13]:
# chose k as number of sections at EPF

"""Architecture, Civil and Environmental Engineering ENAC,
Basic Sciences SB,
Engineering STI,
Computer and Communication Sciences IC,
Life Sciences SV,
Management of Technology CDM,
College of Humanities CDH"""
k=7
#chose alpha beta from above??
#TODO: make and axplain alpha beta choice..
alpha=1.5
beta=0.01

In [14]:
parameters_LDA(corpus,alpha=alpha,beta=beta,k=k)

alpha = 1.5
beta = 0.01 

Topic 0:
['project', 'food', 'research', 'data', 'present', 'manag', 'architectur', 'group', 'risk', 'develop']
Topic 1:
['algebra', 'lie', 'curv', 'neuroprosthes', 'ring', 'invas', 'neuroprosthesi', 'geometri', 'de', 'planet']
Topic 2:
['solar', 'absorpt', 'mx', 'materi', 'cardiac', 'ancient', 'ray', 'light', 'sensit', 'arteri']
Topic 3:
['linear', 'statist', 'probabl', 'numer', 'model', 'robot', 'flow', 'algorithm', 'control', 'space']
Topic 4:
['imag', 'sensor', 'electron', 'magnet', 'circuit', 'optic', 'devic', 'digit', 'filter', 'cmo']
Topic 5:
['reactor', 'reaction', 'snow', 'energi', 'chemic', 'protein', 'kinet', 'heat', 'transport', 'thermodynam']
Topic 6:
['video', 'cell', 'secur', 'verif', 'cancer', 'speech', 'tumor', 'vhdl', 'mpeg', 'laboratori']


## Exercise 4.11: Wikipedia structure

In [15]:
from pyspark.sql import Row
from collections import OrderedDict
import json
from pyspark.ml.feature import CountVectorizer

In [16]:
wikipedia_raw = sc.textFile('/ix/wikipedia-for-schools.txt')
wikipedia_df = wikipedia_raw.map(lambda l: Row(**dict(json.loads(l)))).toDF()
wikipedia_df.show()

+-------+--------------------+--------------------+
|page_id|               title|              tokens|
+-------+--------------------+--------------------+
|      1|   Áedán mac Gabráin|[áedán, mac, gabr...|
|      2|       Åland Islands|[åland, islandssc...|
|      3|       Édouard Manet|[édouard, manetsc...|
|      4|                Éire|[éireschools, wik...|
|      5|     Évariste Galois|[évariste, galois...|
|      6|Óengus I of the P...|[óengus, pictssch...|
|      7|€2 commemorative ...|[€, commemorative...|
|      8|          0 (number)|[numberschools, w...|
|      9|   10 Downing Street|[downing, streets...|
|     10|        10th century|[centuryschools, ...|
|     11|        11th century|[centuryschools, ...|
|     12|        12th century|[centuryschools, ...|
|     13|        13th century|[centuryschools, ...|
|     14|        14th century|[centuryschools, ...|
|     15|        15th century|[centuryschools, ...|
|     16|            16 Cygni|[cygnischools, wi...|
|     17|   

In [17]:
cv = CountVectorizer(inputCol="tokens", outputCol="features")
model = cv.fit(wikipedia_df)
wikipedia_count_df = model.transform(wikipedia_df)

In [18]:
wiki_corpus = wikipedia_count_df.select("page_id", "features").map(list)

In [21]:
#https://en.wikipedia.org/wiki/Category:Main_topic_classifications --- 22 topics?
#or 12?

terms=model.vocabulary

def topic_descriptions(model, num_topics=10, num_labels=10):
    topics=model.topicsMatrix()
    
    for topic in range(num_topics):
        print("Topic " + str(topic) + ":")
        top_word_ids = np.argsort(topics[:,topic])[::-1][0:num_labels]
        labels = []
        
        for word_id in top_word_ids:
            labels.append(terms[word_id])
        print(labels)
                   

In [22]:
parameters_LDA(wiki_corpus,alpha=1.5,beta=0.01,k=12)

alpha = 1.5
beta = 0.01 



ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
    raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
    raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 626, in send_command
    response = connection.send_command(command)
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 75

Topic 0:
['claudius', 'marx', '–', 'engels', 'power', 'pericles', 'states', 'years', 'world', 'war']
Topic 1:
['calgary', 'alcohol', 'toad', 'toads', 'fasd', 'city', 'wifi', 'fas', 'exposure', 'prenatal']
Topic 2:
['pluto', 'sheep', 'plutos', 'operatorexpression', 'patel', 'planet', 'charon', 'neptune', 'years', 'wine']
Topic 3:
['n−', 'zambia', 'fish', 'carp', 'sicklecell', 'skate', 'city', 'number', 'skates', 'fishes']
Topic 4:
['piłsudski', 'phalacrocorax', 'cormorant', 'urbankowski', '^', 'swan', 'piłsudskis', 'hänsel', 'gretel', 'shag']
Topic 5:
['leeds', 'acid', 'city', 'god', '–', 'allah', 'world', '·', 'years', 'time']
Topic 6:
['caffeine', 'inquisition', 'sleep', 'president', '·', 'united', 'world', 'century', 'de', '–']
Topic 7:
['calgary', 'iits', 'coupler', 'hippos', '–', 'fluid', 'years', 'city', 'number', 'time']
Topic 8:
['–', 'time', 'years', 'world', 'american', 'war', 'united', 'city', 'states', 'century']
Topic 9:
['oz', 'episcopus', 'sonic', 'game', 'sarajevo', 'pla