# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *Y*

**Names:**

* *Kristian Aurlien*
* *Mateusz Paluchowski*


---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import pandas as pd
import numpy as np
from utils import load_json, load_pkl

from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

In [2]:
sc

<pyspark.context.SparkContext at 0x7f55781f6240>

## Exercise 4.8: Topics extraction

Using your pre-processed courses dataset, extract topics using LDA.
    1. Print k = 10 topics extracted using LDA and give them labels. 
    2. How does it compare with LSI?
You can use the default values for all parameters.

In [3]:
# courses = load_pkl('courses.pkl')
tf_matrix = load_pkl('tf_matrix.pkl')
terms = load_pkl('terms.pkl')

In [4]:
#We want word count vectors not colums thus we need to transpose our TF_matrix 
tf_matrix_T = tf_matrix.T

In [5]:
tf_matrix_T.shape

(854, 10875)

In [6]:
#Pandas trick to format the data as in pyspark docs example
df = pd.DataFrame(data=tf_matrix_T)
df.to_csv('tf_matrix_T.csv', sep=' ', header=False, index=False)

In [7]:
data = sc.textFile("tf_matrix_T.csv")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))

# Index documents with unique IDs`
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

In [8]:
ldaModel = LDA.train(corpus, k=10, optimizer='online')

In [9]:
# Learn topics
topics_baseline = ldaModel.topicsMatrix()

In [10]:
# Print top words describing topics
def describe_topics(topics, no_of_topics=10, no_of_words=10):
    for topic in range(no_of_topics):
        print("Topic " + str(topic) + ":")
        top_words_idx = np.argsort(topics[:,topic])[::-1][0:no_of_words]
        words = []
        for word_id in top_words_idx:
            words.append(terms[word_id])
        print(words)

In [11]:
describe_topics(topics_baseline)

Topic 0:
['student', 'learn', 'content', 'method', 'enrol', 'imag', 'administr', 'analysi', 'system', 'lectur']
Topic 1:
['train', 'premium', 'materialdefinit', 'parabol', 'ductil', 'pet', 'bisboron', 'rotat', 'innov', 'door']
Topic 2:
['semest', 'show', 'depth', 'literaci', 'rough', 'kill', 'laboratori', 'franc', 'signalsrecogn', 'geostatist']
Topic 3:
['tpc', 'justif', 'student', 'physic', 'principl', 'nano', 'atomsdescrib', 'learn', 'photographi', 'ms']
Topic 4:
['student', 'learn', 'method', 'system', 'model', 'content', 'basic', 'design', 'assess', 'exercis']
Topic 5:
['student', 'symmetri', 'method', 'system', 'wastewaterpropos', 'concis', 'canton', 'naviunderstand', 'formul', 'calderon']
Topic 6:
['dq', 'ide', 'blockchain', 'amplifierfor', 'matplotlib', 'mapsth', 'erupt', 'incos', 'cellulair', 'functionder']
Topic 7:
['elastoplast', 'networksappli', 'medium', 'ilp', 'independ', 'investigationsappli', 'endow', 'devicesmass', 'codeunderstand', 'matroid']
Topic 8:
['method', 'examp

## Exercise 4.9: Dirichlet hyperparameters

1. Fix k = 10 and β = 1.01, and vary α. How does it impact the topics?

In [12]:
# LDA docConcentration must be > 1.0 (or -1 for auto) for EM 
alphas = [0.1, 1, 10, 1000] 

#### alpha=0.1

In [13]:
ldaModel_a_0_1 = LDA.train(corpus, k=10, seed=1, topicConcentration=1.01, docConcentration=0.1, optimizer='online')

In [14]:
topics_a_0_1 = ldaModel_a_0_1.topicsMatrix()

In [15]:
describe_topics(topics_a_0_1)

Topic 0:
['method', 'student', 'learn', 'basic', 'analysi', 'cours', 'model', 'imag', 'understand', 'theori']
Topic 1:
['student', 'method', 'learn', 'process', 'content', 'energi', 'disadvantg', 'statist', 'model', 'lectur']
Topic 2:
['student', 'method', 'learn', 'model', 'splinkerett', 'electrokineticsdielectrophresi', 'design', 'today', 'electromagnet', 'date']
Topic 3:
['fondament', 'chemic', 'provid', 'question', 'method', 'drug', 'compound', 'research', 'halt', 'credibl']
Topic 4:
['student', 'method', 'learn', 'model', 'content', 'system', 'design', 'project', 'assess', 'concept']
Topic 5:
['rotat', 'train', 'method', 'fsu', 'pocessesdescrib', 'optionsperform', 'disc', 'horizont', 'gener', 'galerkin']
Topic 6:
['student', 'method', 'learn', 'content', 'model', 'microwav', 'system', 'rapid', 'basic', 'chemic']
Topic 7:
['nucleu', 'remov', 'networkscharacter', 'gel', 'managermentassess', 'harper', 'toolsaccess', 'vital', 'bertseka', 'carri']
Topic 8:
['method', 'student', 'learn'

#### alpha=1

In [16]:
ldaModel_a_1 = LDA.train(corpus, k=10, seed=1, topicConcentration=1.01, docConcentration=0.1, optimizer='online')

In [17]:
topics_a_1 = ldaModel_a_1.topicsMatrix()

In [18]:
describe_topics(topics_a_1)

Topic 0:
['learn', 'method', 'student', 'flow', 'cours', 'understand', 'sweng', 'session', 'csr', 'respons']
Topic 1:
['disadvantg', 'statist', 'begin', 'lot', 'usual', 'thermogravimetr', 'affect', 'magnet', 'nco', 'experimentdescrib']
Topic 2:
['train', 'rotat', 'student', 'method', 'fondament', 'splinkerett', 'electrokineticsdielectrophresi', 'algebra', 'learn', 'critic']
Topic 3:
['method', 'student', 'design', 'research', 'halt', 'credibl', 'formalis', 'physicsidentifi', 'dbmsdesign', 'systemsconstruct']
Topic 4:
['method', 'student', 'learn', 'model', 'design', 'content', 'system', 'project', 'assess', 'analysi']
Topic 5:
['project', 'semest', 'fsu', 'pocessesdescrib', 'method', 'student', 'optionsperform', 'disc', 'horizont', 'geomechan']
Topic 6:
['student', 'learn', 'method', 'rapid', 'model', 'physic', 'structureselabor', 'chemic', 'content', 'ambient']
Topic 7:
['nucleu', 'remov', 'networkscharacter', 'managermentassess', 'harper', 'toolsaccess', 'gel', 'vital', 'method', 'be

#### alpha=10

In [19]:
ldaModel_a_10 = LDA.train(corpus, k=10, seed=1, topicConcentration=1.01, docConcentration=0.1, optimizer='online')

In [20]:
topics_a_10 = ldaModel_a_10.topicsMatrix()

In [21]:
describe_topics(topics_a_10)

Topic 0:
['learn', 'method', 'model', 'cours', 'flow', 'principl', 'understand', 'basic', 'session', 'sweng']
Topic 1:
['disadvantg', 'lot', 'begin', 'usual', 'thermogravimetr', 'affect', 'nco', 'hexa', 'experimentdescrib', 'polici']
Topic 2:
['student', 'method', 'splinkerett', 'electrokineticsdielectrophresi', 'today', 'date', 'electromagnet', 'distributor', 'webster', 'econom']
Topic 3:
['method', 'rotat', 'train', 'student', 'halt', 'system', 'keyword', 'credibl', 'learn', 'formalis']
Topic 4:
['student', 'method', 'learn', 'design', 'model', 'system', 'content', 'assess', 'project', 'basic']
Topic 5:
['method', 'fsu', 'pocessesdescrib', 'design', 'optionsperform', 'disc', 'horizont', 'mimick', 'galerkin', 'monoclon']
Topic 6:
['chemic', 'compound', 'drug', 'question', 'fondament', 'learn', 'provid', 'method', 'student', 'model']
Topic 7:
['nucleu', 'method', 'remov', 'networkscharacter', 'managermentassess', 'harper', 'toolsaccess', 'algebra', 'gel', 'vital']
Topic 8:
['student', 

#### alpha=1000

In [22]:
ldaModel_a_1000 = LDA.train(corpus, k=10, seed=1, topicConcentration=1.01, docConcentration=0.1, optimizer='online')

In [23]:
topics_a_1000 = ldaModel_a_1000.topicsMatrix()

In [24]:
describe_topics(topics_a_1000)

Topic 0:
['learn', 'student', 'project', 'flow', 'basic', 'understand', 'system', 'sweng', 'cours', 'method']
Topic 1:
['student', 'method', 'learn', 'disadvantg', 'content', 'lectur', 'lot', 'begin', 'statist', 'usual']
Topic 2:
['student', 'method', 'splinkerett', 'algebra', 'electrokineticsdielectrophresi', 'electromagnet', 'learn', 'date', 'critic', 'today']
Topic 3:
['method', 'student', 'halt', 'system', 'credibl', 'keyword', 'formalis', 'physicsidentifi', 'dbmsdesign', 'law']
Topic 4:
['method', 'learn', 'student', 'system', 'design', 'content', 'model', 'concept', 'project', 'assess']
Topic 5:
['rotat', 'train', 'method', 'student', 'learn', 'fsu', 'pocessesdescrib', 'gener', 'organ', 'design']
Topic 6:
['student', 'learn', 'method', 'model', 'content', 'system', 'process', 'design', 'assess', 'cours']
Topic 7:
['method', 'student', 'learn', 'nucleu', 'present', 'cours', 'comput', 'content', 'remov', 'networkscharacter']
Topic 8:
['learn', 'student', 'method', 'system', 'optic'

# TODO How does it impact the topics?


_______


2. Fix k=10 and α=6, and vary β. How does it impact the topics?

In [25]:
betas = [0.1, 1, 10, 1000] 

#### beta = 0.1

In [26]:
ldaModel_b_0_1 = LDA.train(corpus, k=10, seed=1, topicConcentration=0.1, docConcentration=6.0, optimizer='online')

In [27]:
topics_b_0_1 = ldaModel_b_0_1.topicsMatrix()

In [28]:
describe_topics(topics_b_0_1)

Topic 0:
['learn', 'method', 'cours', 'student', 'model', 'basic', 'analysi', 'system', 'comput', 'flow']
Topic 1:
['student', 'content', 'learn', 'method', 'model', 'lectur', 'statist', 'energi', 'basic', 'design']
Topic 2:
['student', 'method', 'learn', 'model', 'design', 'project', 'system', 'cours', 'outcom', 'electromagnet']
Topic 3:
['method', 'student', 'system', 'keyword', 'design', 'learn', 'research', 'introduct', 'physic', 'expect']
Topic 4:
['method', 'learn', 'student', 'concept', 'transvers', 'content', 'model', 'design', 'specif', 'system']
Topic 5:
['method', 'student', 'learn', 'design', 'project', 'system', 'model', 'analysi', 'teach', 'present']
Topic 6:
['student', 'learn', 'model', 'method', 'content', 'analysi', 'system', 'chemic', 'design', 'outcom']
Topic 7:
['method', 'student', 'content', 'learn', 'present', 'cours', 'comput', 'end', 'system', 'exercis']
Topic 8:
['student', 'learn', 'method', 'comput', 'content', 'system', 'applic', 'design', 'linear', 'exerc

#### beta = 1

In [29]:
ldaModel_b_1 = LDA.train(corpus, k=10, seed=1, topicConcentration=1.00, docConcentration=6.0, optimizer='online')

In [30]:
topics_b_1 = ldaModel_b_1.topicsMatrix()

In [31]:
describe_topics(topics_b_1)

Topic 0:
['learn', 'method', 'student', 'cours', 'model', 'basic', 'analysi', 'system', 'comput', 'flow']
Topic 1:
['student', 'method', 'learn', 'content', 'model', 'lectur', 'statist', 'design', 'system', 'basic']
Topic 2:
['student', 'method', 'learn', 'model', 'design', 'project', 'system', 'cours', 'outcom', 'electromagnet']
Topic 3:
['method', 'student', 'system', 'learn', 'keyword', 'design', 'research', 'model', 'introduct', 'halt']
Topic 4:
['method', 'learn', 'student', 'concept', 'content', 'transvers', 'model', 'design', 'system', 'specif']
Topic 5:
['method', 'student', 'learn', 'design', 'system', 'model', 'project', 'analysi', 'teach', 'present']
Topic 6:
['student', 'learn', 'method', 'model', 'content', 'analysi', 'system', 'design', 'chemic', 'assess']
Topic 7:
['method', 'student', 'learn', 'content', 'present', 'cours', 'comput', 'end', 'system', 'prerequisit']
Topic 8:
['student', 'learn', 'method', 'comput', 'content', 'system', 'design', 'applic', 'linear', 'exer

#### beta = 10

In [32]:
ldaModel_b_10 = LDA.train(corpus, k=10, seed=1, topicConcentration=10.0, docConcentration=6.0, optimizer='online')

In [33]:
topics_b_10 = ldaModel_b_10.topicsMatrix()

In [34]:
describe_topics(topics_b_10)

Topic 0:
['method', 'learn', 'student', 'model', 'cours', 'basic', 'system', 'analysi', 'comput', 'flow']
Topic 1:
['student', 'method', 'learn', 'content', 'model', 'lectur', 'system', 'design', 'statist', 'disadvantg']
Topic 2:
['student', 'method', 'learn', 'model', 'design', 'system', 'project', 'content', 'cours', 'splinkerett']
Topic 3:
['method', 'student', 'learn', 'system', 'design', 'keyword', 'model', 'research', 'content', 'halt']
Topic 4:
['method', 'learn', 'student', 'model', 'content', 'concept', 'system', 'design', 'transvers', 'specif']
Topic 5:
['method', 'student', 'learn', 'design', 'model', 'system', 'project', 'analysi', 'present', 'prerequisit']
Topic 6:
['student', 'method', 'learn', 'model', 'content', 'system', 'analysi', 'design', 'chemic', 'assess']
Topic 7:
['method', 'student', 'learn', 'content', 'present', 'cours', 'system', 'comput', 'model', 'end']
Topic 8:
['student', 'learn', 'method', 'content', 'system', 'comput', 'design', 'model', 'applic', 'exe

#### beta = 1000

In [35]:
ldaModel_b_1000 = LDA.train(corpus, k=10, seed=1, topicConcentration=1000.0, docConcentration=6.0, optimizer='online')

In [36]:
topics_b_1000 = ldaModel_b_1000.topicsMatrix()

In [37]:
describe_topics(topics_b_1000)

Topic 0:
['method', 'learn', 'student', 'model', 'cours', 'system', 'basic', 'analysi', 'comput', 'flow']
Topic 1:
['student', 'method', 'learn', 'content', 'model', 'lectur', 'system', 'design', 'statist', 'disadvantg']
Topic 2:
['student', 'method', 'learn', 'model', 'design', 'system', 'content', 'project', 'cours', 'splinkerett']
Topic 3:
['method', 'student', 'learn', 'system', 'design', 'model', 'keyword', 'content', 'research', 'halt']
Topic 4:
['method', 'learn', 'student', 'model', 'content', 'system', 'concept', 'design', 'transvers', 'specif']
Topic 5:
['method', 'student', 'learn', 'design', 'model', 'system', 'project', 'analysi', 'present', 'prerequisit']
Topic 6:
['student', 'method', 'learn', 'model', 'content', 'system', 'analysi', 'design', 'chemic', 'assess']
Topic 7:
['method', 'student', 'learn', 'content', 'system', 'present', 'cours', 'comput', 'model', 'end']
Topic 8:
['student', 'learn', 'method', 'content', 'system', 'comput', 'design', 'model', 'applic', 'exe

## Exercise 4.10: EPFL's taught subjects

For the symmetric distribution, a high alpha-value means that each document is likely to contain a mixture of most of the topics, and not any single topic specifically. A low alpha value puts less such constraints on documents and means that it is more likely that a document may contain mixture of just a few, or even only one, of the topics. Likewise, a high beta-value means that each topic is likely to contain a mixture of most of the words, and not any word specifically, while a low value means that a topic may contain a mixture of just a few of the words.

In [38]:
# Number of sections at EPFL
potential_k = len(['Architecture, Civil and Environmental Engineering ENAC',
'Basic Sciences SB',
'Engineering STI',
'Computer and Communication Sciences IC',
'Life Sciences SV',
'Management of Technology CDM',
'College of Humanities CDH'])

In [39]:
potential_alpha = 1.5
potential_beta = 0.5
potential_k

7

In [40]:
ldaModel_EPFL = LDA.train(corpus,
                          k=potential_k,
                          seed=1,
                          docConcentration=potential_alpha,
                          topicConcentration=potential_beta,
                          optimizer='online')

In [41]:
topics_EPFL = ldaModel_EPFL.topicsMatrix()

In [42]:
describe_topics(topics_EPFL, potential_k)

Topic 0:
['learn', 'method', 'student', 'cours', 'model', 'basic', 'system', 'analysi', 'comput', 'project']
Topic 1:
['student', 'learn', 'content', 'method', 'model', 'lectur', 'design', 'system', 'basic', 'outcom']
Topic 2:
['student', 'method', 'learn', 'model', 'design', 'project', 'system', 'cours', 'content', 'outcom']
Topic 3:
['method', 'student', 'learn', 'system', 'design', 'keyword', 'model', 'content', 'research', 'introduct']
Topic 4:
['method', 'learn', 'student', 'content', 'concept', 'model', 'design', 'system', 'transvers', 'work']
Topic 5:
['method', 'student', 'learn', 'design', 'system', 'model', 'project', 'analysi', 'present', 'teach']
Topic 6:
['student', 'learn', 'method', 'model', 'content', 'system', 'analysi', 'design', 'assess', 'keyword']


## Exercise 4.11: Wikipedia structure

In [43]:
from pyspark.sql import Row
from collections import OrderedDict
import json

In [44]:
wikipedia_raw = sc.textFile('/ix/wikipedia-for-schools.txt')

In [45]:
wikipedia_spark_df = wikipedia_raw.map(lambda l: Row(**dict(json.loads(l)))).toDF()

In [46]:
wikipedia_spark_df.show()

+-------+--------------------+--------------------+
|page_id|               title|              tokens|
+-------+--------------------+--------------------+
|      1|   Áedán mac Gabráin|[áedán, mac, gabr...|
|      2|       Åland Islands|[åland, islandssc...|
|      3|       Édouard Manet|[édouard, manetsc...|
|      4|                Éire|[éireschools, wik...|
|      5|     Évariste Galois|[évariste, galois...|
|      6|Óengus I of the P...|[óengus, pictssch...|
|      7|€2 commemorative ...|[€, commemorative...|
|      8|          0 (number)|[numberschools, w...|
|      9|   10 Downing Street|[downing, streets...|
|     10|        10th century|[centuryschools, ...|
|     11|        11th century|[centuryschools, ...|
|     12|        12th century|[centuryschools, ...|
|     13|        13th century|[centuryschools, ...|
|     14|        14th century|[centuryschools, ...|
|     15|        15th century|[centuryschools, ...|
|     16|            16 Cygni|[cygnischools, wi...|
|     17|   

In [47]:
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="tokens", outputCol="features")
model = cv.fit(wikipedia_spark_df)
wikipedia_count_df = model.transform(wikipedia_spark_df)

In [48]:
wikipedia_count_df.show()

+-------+--------------------+--------------------+--------------------+
|page_id|               title|              tokens|            features|
+-------+--------------------+--------------------+--------------------+
|      1|   Áedán mac Gabráin|[áedán, mac, gabr...|(262144,[0,1,2,7,...|
|      2|       Åland Islands|[åland, islandssc...|(262144,[1,2,3,4,...|
|      3|       Édouard Manet|[édouard, manetsc...|(262144,[0,1,2,3,...|
|      4|                Éire|[éireschools, wik...|(262144,[7,10,13,...|
|      5|     Évariste Galois|[évariste, galois...|(262144,[0,1,2,9,...|
|      6|Óengus I of the P...|[óengus, pictssch...|(262144,[0,1,4,9,...|
|      7|€2 commemorative ...|[€, commemorative...|(262144,[0,2,3,5,...|
|      8|          0 (number)|[numberschools, w...|(262144,[1,3,5,9,...|
|      9|   10 Downing Street|[downing, streets...|(262144,[0,1,2,3,...|
|     10|        10th century|[centuryschools, ...|(262144,[0,1,2,3,...|
|     11|        11th century|[centuryschools, ...|

In [49]:
wiki_corpus = wikipedia_count_df.select("page_id", "features").map(list)

With k=22 we are aiming at describing main Wikipedia topics:
    https://en.wikipedia.org/wiki/Category:Main_topic_classifications

In [50]:
ldaModel_wikipedia = LDA.train(wiki_corpus,
                          k=22,
                          seed=1,
                          docConcentration=3.0,
                          topicConcentration=6.0,
                          optimizer='online')

In [51]:
topics_wikipedia = ldaModel_wikipedia.topicsMatrix()

In [52]:
no_of_topics=10
no_of_words=10
vocab = model.vocabulary

for topic in range(no_of_topics):
    print("Topic " + str(topic) + ":")
    top_words_idx = np.argsort(topics_wikipedia[:,topic])[::-1][0:no_of_words]
    words = []
    for word_id in top_words_idx:
        words.append(vocab[word_id])
    print(words)

Topic 0:
['–', 'years', 'war', 'time', 'city', 'states', 'world', 'french', 'united', 'made']
Topic 1:
['leeds', 'city', 'centre', 'park', '–', 'area', 'world', 'european', 'eu', 'north']
Topic 2:
['–', 'years', 'number', 'leibniz', 'war', 'city', 'people', 'system', 'time', 'century']
Topic 3:
['sheep', 'number', 'years', 'century', 'states', 'american', 'world', 'contact', 'time', 'system']
Topic 4:
['world', '–', 'time', 'water', 'century', 'years', 'states', 'found', 'form', 'god']
Topic 5:
['–', 'oz', 'time', 'world', 'years', 'city', 'states', '·', 'early', 'wine']
Topic 6:
['space', 'world', '–', 'number', 'years', 'system', 'time', 'united', 'international', 'measurement']
Topic 7:
['–', 'years', 'number', 'time', 'world', 'called', 'work', 'states', 'found', 'large']
Topic 8:
['–', 'time', 'years', 'world', 'american', 'war', 'united', 'states', 'city', 'century']
Topic 9:
['time', '–', 'people', 'number', 'called', 'years', 'including', 'world', 'city', 'states']
