# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *D*

**Names:**

* *Marc Bickel*
* *Cyil Cadoux*
* *Emma Lejal*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors

In [2]:
import pickle
import numpy as np
from utils import load_json
from collections import defaultdict

## Exercise 4.8: Topics extraction

First we load the preprocessed version of the courses, that we save in part 1

In [3]:
processed_courses = load_json('data/preProcessedCourses.txt')

We create a dictionnary which gives the number of occurences per word (named wordCount). 
***
We also give a unique identifier to every word (uid field in wordCount). 
***
We create the reverse dictionnary to have the correspondance of the id to a word (ie idToWord)

In [4]:
wordCount = {}
idToWord = {}
unique_id = 0
for c in processed_courses:
    listOfWords = c['description'].split(' ')
    for w in listOfWords:
        try :
            wordCount[w]['count'] +=1
        except KeyError :
            wordCount[w] = {'count' : 1, 'uid' : unique_id}
            idToWord[unique_id] = w
            unique_id += 1

In [5]:
#Size of vocabulary = last unique_id (it is incremented one more time at the end so we subtract 1)
unique_id -1

7861

We create a function that given a course will return the id and the vector associated with this course. 
* * *
The vector is for each word in wordCount, how many time it appears in the description of the given course.

In [6]:
def course_vector(course):
    id = course['courseId']
    counts = defaultdict(int)
    for token in course['description'].split(' '):
        token_id = wordCount[token]['uid']
        counts[token_id] += 1
    counts = sorted(counts.items())
    keys =[x[0] for x in counts]
    values = [x[1] for x in counts]
    return (id, Vectors.sparse(unique_id -1, keys, values))

Example of use of the course_vector function with the Internet Analytics course:

In [7]:
course_vector(processed_courses[43])

('COM-308',
 SparseVector(7861, {10: 2.0, 21: 3.0, 42: 2.0, 63: 6.0, 78: 2.0, 99: 2.0, 108: 2.0, 117: 2.0, 125: 1.0, 135: 1.0, 142: 1.0, 186: 1.0, 204: 1.0, 230: 2.0, 292: 2.0, 318: 1.0, 382: 4.0, 401: 1.0, 420: 1.0, 448: 1.0, 455: 1.0, 458: 2.0, 470: 2.0, 533: 1.0, 534: 1.0, 537: 2.0, 654: 1.0, 688: 1.0, 842: 2.0, 875: 5.0, 967: 1.0, 1113: 2.0, 1119: 1.0, 1174: 1.0, 1239: 1.0, 1280: 1.0, 1284: 1.0, 1398: 5.0, 1426: 2.0, 1427: 5.0, 1428: 5.0, 1429: 2.0, 1430: 1.0, 1431: 1.0, 1432: 1.0, 1433: 1.0, 1434: 2.0, 1435: 2.0, 1436: 2.0, 1437: 1.0, 1438: 4.0, 1439: 2.0, 1440: 1.0, 1441: 2.0, 1442: 1.0, 1443: 1.0, 1444: 1.0, 1445: 1.0, 1446: 1.0, 1447: 1.0, 1448: 1.0}))

We create the corpus of documents with the result of course_vector for every course, associated with an integer id

In [8]:
documents = []
doc_uid = 0
for c in processed_courses:
    id, vect = course_vector(c)
    documents.append([doc_uid, vect])
    doc_uid += 1

In [9]:
rdd = sc.parallelize(documents)

We train the model

In [118]:
lda_model = LDA.train(rdd, k = 10)

In [137]:
topics = lda_model.describeTopics()

In [145]:
def print_topics(k, topics):
    for i in range(k):
        topicWords = []
        for j in range(10):
            topicWords.append(idToWord[topics[i][0][j]])
        print(i+1, " : ", topicWords)

In [146]:
print_topics(10, topics)

1  :  ['treatment', 'manag', 'environment', 'physic', 'mass', 'role', 'impact', 'risk', 'wast', 'explain']
2  :  ['architectur', 'robot', 'polici', 'case', 'innov', 'organ', 'water', 'assist', 'urban', 'practic']
3  :  ['optim', 'linear', 'stochast', 'data', 'deriv', 'statist', 'supervis', 'continu', 'price', 'financi']
4  :  ['data', 'algorithm', 'engin', 'comput', 'linear', 'statist', 'practic', 'assign', 'function', 'space']
5  :  ['chemic', 'thermodynam', 'phase', 'properti', 'state', 'reaction', 'metal', 'physic', 'polym', 'transfer']
6  :  ['biolog', 'paper', 'molecular', 'protein', 'discuss', 'research', 'literatur', 'chemic', 'prepar', 'note']
7  :  ['mechan', 'physic', 'magnet', 'flow', 'properti', 'laser', 'chemistri', 'electron', 'reaction', 'organ']
8  :  ['imag', 'signal', 'circuit', 'digit', 'comput', 'filter', 'visual', 'code', 'analog', 'cod']
9  :  ['energi', 'network', 'power', 'manag', 'object', 'integr', 'comput', 'level', 'convers', 'languag']
10  :  ['optic', 'mic

Subjects for topics
1. Environmental sciences
2. Architecture
3. Mathematics
4. Communication Systems
5. Chemistry
6. Biology
7. Physics
8. Signal Processing
9. ?
10. Physics

## Exercise 4.9: Dirichlet hyperparameters

Varying alpha means varying docConcentration.
***
There default value is -1.0
***
We set the seed manually to compare results.

In [158]:
lda_model_varying_alpha = LDA.train(rdd, k=10, docConcentration = 1.99 , topicConcentration = 1.01, seed = 1)

In [159]:
alpha_topics = lda_model_varying_alpha.describeTopics()

In [160]:
print_topics(10, alpha_topics) # alpha = 1.99

1  :  ['risk', 'market', 'price', 'innov', 'financi', 'financ', 'manag', 'econom', 'deriv', 'option']
2  :  ['chemistri', 'magnet', 'physic', 'robot', 'comput', 'organ', 'properti', 'reaction', 'drug', 'solid']
3  :  ['research', 'data', 'manag', 'engin', 'inform', 'object', 'architectur', 'team', 'perform', 'case']
4  :  ['signal', 'function', 'imag', 'recommend', 'cancer', 'rate', 'set', 'type', 'filter', 'acoust']
5  :  ['linear', 'algorithm', 'comput', 'statist', 'data', 'algebra', 'optim', 'numer', 'space', 'signal']
6  :  ['mechan', 'electron', 'properti', 'sensor', 'surfac', 'polym', 'micro', 'physic', 'measur', 'microscopi']
7  :  ['energi', 'circuit', 'electron', 'power', 'thermodynam', 'chemic', 'integr', 'quantum', 'state', 'semiconductor']
8  :  ['energi', 'data', 'optim', 'random', 'continu', 'network', 'chain', 'practic', 'function', 'map']
9  :  ['optic', 'imag', 'biolog', 'laser', 'paper', 'research', 'light', 'protein', 'experiment', 'literatur']
10  :  ['flow', 'heat'

In [154]:
print_topics(10, alpha_topics) # alpha = 1.1

1  :  ['risk', 'market', 'price', 'innov', 'financ', 'financi', 'econom', 'architectur', 'deriv', 'manag']
2  :  ['magnet', 'chemistri', 'robot', 'physic', 'comput', 'research', 'topic', 'reaction', 'organ', 'drug']
3  :  ['manag', 'data', 'object', 'research', 'engin', 'inform', 'architectur', 'perform', 'team', 'case']
4  :  ['signal', 'imag', 'function', 'cancer', 'recommend', 'acoust', 'rate', 'fourier', 'type', 'topic']
5  :  ['linear', 'algorithm', 'comput', 'data', 'statist', 'algebra', 'optim', 'signal', 'numer', 'space']
6  :  ['electron', 'mechan', 'properti', 'sensor', 'physic', 'micro', 'microscopi', 'surfac', 'polym', 'film']
7  :  ['energi', 'circuit', 'electron', 'power', 'thermodynam', 'chemic', 'integr', 'state', 'quantum', 'biolog']
8  :  ['energi', 'data', 'optim', 'random', 'stochast', 'function', 'continu', 'chain', 'mass', 'network']
9  :  ['optic', 'imag', 'biolog', 'laser', 'paper', 'protein', 'research', 'light', 'molecular', 'microscopi']
10  :  ['flow', 'heat

Impact of alpha : change the order of the words in the topics, we change the probability of the words appearing in each topic. When alpha is larger we push away the word that seem strange in the topic (eg, architecture is gone from topic 1 that looks like Financial topic, in topic 7, biology is also gone in a topic that looks like Physics). Indeed the words are globally all about science, so the distribution per topic will tend to be more uniform then sharp, as all the words are part of some bigger topic, they are not so different from one another.

Varying beta means varying topicConcentration.

In [164]:
lda_model_varying_beta = LDA.train(rdd, k=10, docConcentration = 6.0 , topicConcentration = 1.99, seed = 1)

In [165]:
beta_topics = lda_model_varying_beta.describeTopics()

In [166]:
print_topics(10, beta_topics) # beta = 1.99

1  :  ['risk', 'market', 'price', 'optim', 'financi', 'stochast', 'financ', 'measur', 'deriv', 'manag']
2  :  ['magnet', 'robot', 'chemistri', 'practic', 'physic', 'properti', 'organ', 'research', 'reaction', 'state']
3  :  ['architectur', 'data', 'inform', 'research', 'manag', 'tool', 'urban', 'engin', 'comput', 'algorithm']
4  :  ['rate', 'signal', 'imag', 'data', 'manag', 'tool', 'function', 'case', 'optim', 'continu']
5  :  ['data', 'algorithm', 'comput', 'linear', 'paper', 'optim', 'function', 'statist', 'inform', 'recommend']
6  :  ['mechan', 'electron', 'stabil', 'properti', 'metal', 'data', 'microscopi', 'fractur', 'dynam', 'imag']
7  :  ['energi', 'comput', 'circuit', 'power', 'data', 'quantum', 'engin', 'integr', 'thermodynam', 'inform']
8  :  ['energi', 'optic', 'function', 'physic', 'engin', 'recommend', 'transvers', 'integr', 'properti', 'comput']
9  :  ['optic', 'imag', 'laser', 'biolog', 'physic', 'light', 'protein', 'microscopi', 'spectroscopi', 'cover']
10  :  ['energi

In [163]:
print_topics(10, beta_topics) # beta = 1.01

1  :  ['risk', 'market', 'optim', 'price', 'manag', 'financi', 'financ', 'decis', 'deriv', 'econom']
2  :  ['magnet', 'chemistri', 'physic', 'properti', 'robot', 'solid', 'organ', 'drug', 'practic', 'state']
3  :  ['engin', 'research', 'architectur', 'data', 'inform', 'team', 'manag', 'case', 'build', 'object']
4  :  ['signal', 'function', 'continu', 'recommend', 'set', 'filter', 'analyz', 'rate', 'cancer', 'object']
5  :  ['linear', 'algorithm', 'comput', 'data', 'statist', 'algebra', 'network', 'numer', 'space', 'optim']
6  :  ['mechan', 'electron', 'properti', 'sensor', 'stabil', 'measur', 'polym', 'surfac', 'micro', 'sampl']
7  :  ['energi', 'circuit', 'chemic', 'reaction', 'quantum', 'thermodynam', 'power', 'state', 'integr', 'electron']
8  :  ['energi', 'protein', 'practic', 'data', 'transvers', 'network', 'assist', 'map', 'perform', 'resourc']
9  :  ['optic', 'imag', 'biolog', 'laser', 'paper', 'microscopi', 'research', 'light', 'experiment', 'spectroscopi']
10  :  ['flow', 'hea

Impact of beta : Having a low beta is better than a higher one. For example, the stochastic word should not be in our Financial topic number 1 (it is with beta = 1.99, and not with beta = 1.01) and in topic 5 the word linear is first woith low beta but does not describe best the data science topic, it is indeed gone with a high beta. As the documents are courses, it is unlikely that we have many topics per document, so we take a low value of beta to get more sharp distributions

## Exercise 4.10: EPFL's taught subjects

Varying k, alpha and beta to find best topics

In [191]:
lda_model_courses = LDA.train(rdd, k=8, docConcentration = 6.0 , topicConcentration = 1.01, seed = 1)

In [192]:
courses_topics = lda_model_courses.describeTopics()

In [193]:
print_topics(8, courses_topics)

1  :  ['chemistri', 'reaction', 'organ', 'space', 'mechan', 'topic', 'inform', 'synthesi', 'data', 'includ']
2  :  ['data', 'case', 'particip', 'manag', 'polici', 'set', 'inform', 'analyz', 'busi', 'solv']
3  :  ['optim', 'risk', 'market', 'stochast', 'deriv', 'manag', 'price', 'financi', 'data', 'decis']
4  :  ['energi', 'flow', 'research', 'physic', 'transfer', 'heat', 'chemic', 'experiment', 'mass', 'thermodynam']
5  :  ['biolog', 'function', 'molecular', 'statist', 'protein', 'dynam', 'note', 'mechan', 'signal', 'molecul']
6  :  ['optic', 'imag', 'linear', 'algorithm', 'quantum', 'laser', 'light', 'comput', 'stabil', 'signal']
7  :  ['engin', 'electron', 'physic', 'properti', 'mechan', 'solid', 'integr', 'sensor', 'metal', 'semiconductor']
8  :  ['architectur', 'comput', 'circuit', 'digit', 'object', 'tool', 'practic', 'perform', 'languag', 'softwar']


We want a little bit less than the 10 topics (choose 8), as we see that some of them are unclear, we keep the high alpha value = 6.0 as we still have this vocabulary in a global science domain, so distribution tend to be more uniform. But we still have these courses description so it is unlikely that we mix completely all the topics, we then choose a low beta value = 1.01.
***
Subjects for topics
1. Chemistry
2. Management of technology
3. Financial engineering
4. Geology
5. Biology
6. Light physics
7. Electricity engineering
8. Computer sciences

## Exercise 4.11: Wikipedia structure

We retrieve the data of wikipedia

In [3]:
import json

In [4]:
wiki_rdd = sc.textFile("/ix/wikipedia-for-schools.txt").map(json.loads)

We execute the same steps to have the corpus to train

In [5]:
#One instance of wiki_rdd contains 'tokens', 'title' and 'page_id'
wikiWordCount = {}
idToWikiWord = {}
unique_wiki_id = 0
for page in wiki_rdd.collect():
    for w in page['tokens']:
        try :
            wikiWordCount[w]['count'] +=1
        except KeyError :
            wikiWordCount[w] = {'count' : 1, 'uid' : unique_wiki_id}
            idToWikiWord[unique_wiki_id] = w
            unique_wiki_id += 1

In [6]:
wikiWordCount['britain']

{'count': 4178, 'uid': 11}

In [7]:
idToWikiWord[11]

'britain'

In [8]:
#Size of vocabulary
unique_wiki_id -1

494493

In [9]:
def wiki_vector(wiki):
    counts = defaultdict(int)
    for token in wiki['tokens']:
        token_id = wikiWordCount[token]['uid']
        counts[token_id] += 1
    counts = sorted(counts.items())
    keys =[x[0] for x in counts]
    values = [x[1] for x in counts]
    return Vectors.sparse(unique_wiki_id -1, keys, values)

In [10]:
wikiDocuments = []
wiki_uid = 0
for wiki in wiki_rdd.collect():
    vect = wiki_vector(wiki)
    wikiDocuments.append([wiki_uid, vect])
    wiki_uid += 1

In [11]:
wiki_docs_rdd = sc.parallelize(wikiDocuments)

In order to determine the Dirichlet hyperparameters, here is our reasoning.
- First Wikipedia has a lot of different topics (the idea of an enclyclopedia is indeed to define every topic), so we can increase k, we will have many interesting different topics (due to computations delay, we must keep k = 5)
- As one page is supposed to be a definition, we think that the topicConcentrati<on is supposed to be sharp (beta low)
- As the words are from a way more wider dictionnary that the scientific vocabulary of EPFL courses, we can put a lower value of docConcentration as the distributions of words will be sharper

In [13]:
lda_model_wiki = LDA.train(wiki_docs_rdd, k=5, docConcentration = 1.3 , topicConcentration = 1.01, seed = 1)

In [14]:
wiki_topics = lda_model_wiki.describeTopics()

In [15]:
def print_wiki_topics(k, topics):
    for i in range(k):
        topicWords = []
        for j in range(10):
            topicWords.append(idToWikiWord[topics[i][0][j]])
        print(i+1, " : ", topicWords)

In [17]:
print_wiki_topics(5, wiki_topics)

1  :  ['species', 'water', 'large', 'area', 'sea', 'years', 'north', 'found', 'called', 'small']
2  :  ['war', 'american', '–', 'british', 'united', 'states', 'government', 'march', 'january', 'french']
3  :  ['city', 'world', 'south', '–', 'time', 'united', 'national', 'international', 'government', 'years']
4  :  ['number', 'system', 'theory', 'form', 'called', 'time', 'human', 'common', 'energy', '=']
5  :  ['time', 'music', 'early', 'work', 'film', 'made', 'including', 'years', 'people', 'world']


Topics : 
1. nature
2. geopolitics
3. geography
4. sciences
5. culture

These topics are very global as k is small but reflect different parts of wikipedia