# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *H*

**Names:**

* *Jérémy Baffou*
* *Antoine Basseto*
* *Andrea Pinto*

---

### Regular Imports

In [1]:
import json
import numpy as np
from utils import load_json, load_pkl
from collections import Counter

In [2]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors, SparseVector

`pyspark.mlli` useful docs for the lab : 
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDA.html
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDAModel.html

### Import Data

In [3]:
# Clean data from lab4-1-vsm with all lemmatized terms 
courses_preprocessed = load_json('data/courses-lem-preprocessed.json')
courses_preprocessed_rdd = sc.parallelize(courses_preprocessed)

In [4]:
# Clean data from lab4-1-vsm without most frequent words (quantile 0.95)
courses_preprocessed_fw = load_json('data/courses-lem-fw-preprocessed.json')
courses_preprocessed_fw_rdd = sc.parallelize(courses_preprocessed_fw)

**Note:** We tried with both datasets and the `courses_preprocessed` gives way better results. <br>
So all the following computations are done with this latter and we'll not use `courses_preprocessed_fw`.

---
## Prepare Data for Model

### Tools for Data Preparation

In [5]:
"""Create super set bag of words
@param: (rdd) RDD dataset of courses
@return: Set of all existing terms
"""
def bow(rdd):
    return set(rdd.flatMap(lambda c: c["description"]).collect())

In [6]:
"""Populate term count matrix
@param: (data) Term counts in a list format i.e. tuples (a, b)
with a: tuple (doc ID, term ID) and b: number of occurences
@param: (m) Term count matrix rows i.e. number of documents
@param: (n) Term count matrix columns i.e. number of documents
@return: Full term count matrix
"""
def populate_tc(data, n, m):
    tc = np.zeros((n, m))
    for x in data:
        infos = x[0]
        value = x[1]
        
        doc_id = infos[0]
        word_id = infos[1]
        
        tc[doc_id][word_id] = value
    
    return tc

In [7]:
"""Create Sparse Vector from Dense Vector
@param: (dense_vect) Dense Vector input
@return: Same vector under Sparse Vector format
"""
def create_sparse_vector(dense_vect):
    d = len(dense_vect)
    indices = []
    values = []
    for i, e in enumerate(dense_vect):
        if e != 0:
            indices.append(i)
            values.append(e)
    return Vectors.sparse(d, indices, values)

### Preprocessing Data

Useful following variables documentation. Note that none of these variables are RDDs.
- **`superbow`** is the *set* of every existing terms we are dealing with.
- **`words_ids_map`** is a *dict* mapping every existing terms to a specific ID $\in N$
- **`doc_ids_map`** is a *dict* mapping every existing documents (here courses descriptions) to a specific ID $\in N$
- **`tc`** is the term count *matrix*. It has one row for every document and one column for every existing term.

In [8]:
superbow = bow(courses_preprocessed_rdd)
words_ids_map = dict(zip(superbow, range(len(superbow))))

In [9]:
courses_ids = courses_preprocessed_rdd.map(lambda c: c['courseId']).collect()
doc_ids_map = dict(zip(courses_ids, range(len(courses_ids))))
doc_ids = list(doc_ids_map.values())

In [10]:
termcount = (
    courses_preprocessed_rdd
    .map(lambda c: [((doc_ids_map[c["courseId"]], words_ids_map[w]),1) for w in c["description"]])
    .flatMap(lambda c: c)
    .reduceByKey(lambda x, y : x + y)
)

In [11]:
tc = populate_tc(termcount.collect(), len(doc_ids), len(superbow))

### Model Parameters

**LDA Model** requires **`doc_tc`** a list of tuples with documents IDs and corresponding terms count under a `pyspark.mllib.linalg.vector` format.

In [12]:
tcs_dense = [Vectors.dense(wordcount) for wordcount in tc]
doc_tc_tup = list(zip(doc_ids, tcs_dense))
doc_tc = list(map(lambda x: list(x), doc_tc_tup))

In [13]:
rdd = sc.parallelize(doc_tc)

---
## Exercise 4.8: Topics extraction

In [14]:
"""Extract K topics from LDA Model
@param: (model) LDA Model on which we work
@param: (words_per_topic) Words to print per topic
@param: (words_ids_map) Dict mapping words to their ID
"""
def extract_topic(model, words_per_topic, words_ids_map):
    topics = model.describeTopics(words_per_topic)
    for i, topic in enumerate(topics):
        words_ids = topic[0]
        words = []
        for word_id in words_ids:
            word = list(words_ids_map.keys())[word_id]
            words.append(word)
        print(f'For topic {i + 1 :>2} we have: {words}')

### LDA Model and topic extraction

In [15]:
model_48 = LDA.train(rdd, k=10, seed=1)

In [16]:
extract_topic(model_48, 5, words_ids_map)

For topic  1 we have: ['material', 'energy', 'property', 'process', 'chemical']
For topic  2 we have: ['student', 'cell', 'method', 'molecular', 'learn']
For topic  3 we have: ['learn', 'model', 'method', 'processing', 'analysis']
For topic  4 we have: ['model', 'student', 'learn', 'method', 'exam']
For topic  5 we have: ['system', 'model', 'design', 'method', 'learn']
For topic  6 we have: ['student', 'presentation', 'paper', 'learn', 'development']
For topic  7 we have: ['student', 'method', 'problem', 'report', 'learn']
For topic  8 we have: ['project', 'student', 'design', 'work', 'group']
For topic  9 we have: ['method', 'model', 'system', 'image', 'electron']
For topic 10 we have: ['circuit', 'device', 'method', 'system', 'sensor']


### Label topics
- Topic 1 could be `material physics`
- Topic 2 could be `life sciences`
- Topic 3 could be `mathematics`
- Topic 4 could be `evaluation`
- Topic 5 could be `system design`
- Topic 6 could be `learning methods`
- Topic 7 could be `evaluation`
- Topic 8 could be `group evaluation`
- Topic 9 could be `imagery`
- Topic 10 could be `electrical engineering`

### Comparison with LSI
Topics are much more meaningful than with LSI but some results still leave something to be desired, especially because of the repetition of the same couple of words in most of the topics.

---
## Exercise 4.9: Dirichlet hyperparameters

In [17]:
"""Create LDA Model with alpha variations
@param: (rdd) RDD on which to train the model
@param: (beta) Fixed beta (topic concentration) param
@return: models - LDA Models trained with moving alpha param
@return: alphas - Corresponding alpha's parameters
"""
def moving_alpha(rdd, beta):
    models = []
    alphas = []
    for alpha in [1.01, 1.3, 1.7] + list(range(2, 21, 3)):
        print(f'alpha: {alpha} beta: {beta}', end='\r')
        m = LDA.train(rdd, k=10, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)
        models.append(m)
        alphas.append(alpha)
    print(f'Finished to train the {len(models)} models', end='\r')
    return models, alphas, beta

In [18]:
"""Create LDA Model with beta variations
@param: (rdd) RDD on which to train the model
@param: (alpha) Fixed alpha (doc concentration) param
@return: models - LDA Models trained with moving beta param
@return: betas - Corresponding beta's parameters
"""
def moving_beta(rdd, alpha):
    models = []
    betas = []
    for beta in [1.01, 1.3, 1.7] + list(range(2, 16, 2)):
        print(f'alpha: {alpha} beta: {beta}', end='\r')
        m = LDA.train(rdd, k=10, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)
        models.append(m)
        betas.append(beta)
    print(f'Finished to train the {len(models)} models', end='\r')
    return models, betas, alpha

In [19]:
def evaluate_models(models, nargs, args, narg, arg, word_ids_map):
    for i, m in enumerate(models):
        print(f'For model {i + 1} with parameter {nargs} {args[i]} and {narg} {arg}')
        extract_topic(m, 5, words_ids_map)
        print('\n')

In [20]:
models_49a, alphas, beta = moving_alpha(rdd, 1.01)

Finished to train the 10 models

In [21]:
models_49b, betas, alpha = moving_beta(rdd, 6)

Finished to train the 10 models

In [22]:
evaluate_models(models_49a, 'alpha', alphas, 'beta', beta, words_ids_map)

For model 1 with parameter alpha 1.01 and beta 1.01
For topic  1 we have: ['material', 'energy', 'process', 'property', 'chemical']
For topic  2 we have: ['student', 'cell', 'method', 'molecular', 'biology']
For topic  3 we have: ['learn', 'model', 'method', 'equation', 'analysis']
For topic  4 we have: ['model', 'student', 'method', 'learn', 'theory']
For topic  5 we have: ['system', 'design', 'model', 'method', 'learn']
For topic  6 we have: ['student', 'presentation', 'learn', 'innovation', 'method']
For topic  7 we have: ['student', 'method', 'learn', 'problem', 'analysis']
For topic  8 we have: ['project', 'student', 'design', 'data', 'plan']
For topic  9 we have: ['method', 'image', 'electron', 'content', 'basic']
For topic 10 we have: ['method', 'system', 'numerical', 'circuit', 'student']


For model 2 with parameter alpha 1.3 and beta 1.01
For topic  1 we have: ['material', 'energy', 'process', 'property', 'chemical']
For topic  2 we have: ['student', 'cell', 'method', 'molecu

We can see that high alpha means more homogeneous distribution of words per topic, which could be explained by the homogeneous distribution of topics per document it causes according to the theory.

In [23]:
evaluate_models(models_49b, 'beta', betas, 'alpha', alpha, words_ids_map)

For model 1 with parameter beta 1.01 and alpha 6
For topic  1 we have: ['material', 'energy', 'property', 'process', 'chemical']
For topic  2 we have: ['cell', 'student', 'method', 'molecular', 'learn']
For topic  3 we have: ['learn', 'model', 'method', 'analysis', 'system']
For topic  4 we have: ['model', 'student', 'theory', 'method', 'learn']
For topic  5 we have: ['system', 'model', 'design', 'method', 'learn']
For topic  6 we have: ['student', 'presentation', 'system', 'development', 'paper']
For topic  7 we have: ['student', 'method', 'problem', 'report', 'learn']
For topic  8 we have: ['project', 'student', 'design', 'data', 'work']
For topic  9 we have: ['method', 'model', 'electron', 'system', 'image']
For topic 10 we have: ['device', 'circuit', 'system', 'method', 'sensor']


For model 2 with parameter beta 1.3 and alpha 6
For topic  1 we have: ['material', 'energy', 'property', 'chemical', 'process']
For topic  2 we have: ['student', 'cell', 'molecular', 'method', 'biology']

We can see that high beta means an homogeneous distribution of words per topic, and pretty quickly all topics are almost equivalent, which can be explained with the theory.

---
## Exercise 4.10: EPFL's taught subjects

From exercise `4.9` we found out that parameters that give the most interpretable results are : 

In [37]:
k = 15
alpha = 1.1
beta = 1.01

- k = 15 because 13 is the number of sections at EPFL, therefore we expect around this number of topics for course-related information and a few more relating to administration or other.
- alpha = 1.1 because we saw we needed a low alpha
- beta = 1.01 because we saw we needed a low beta

### Run with optimal hyperparameters

In [38]:
model_410 = LDA.train(rdd, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)

In [39]:
extract_topic(model_410, 5, words_ids_map)

For topic  1 we have: ['reaction', 'material', 'method', 'spectroscopy', 'chemistry']
For topic  2 we have: ['student', 'cell', 'biology', 'note', 'microscopy']
For topic  3 we have: ['method', 'numerical', 'equation', 'model', 'problem']
For topic  4 we have: ['student', 'project', 'report', 'learn', 'data']
For topic  5 we have: ['linear', 'model', 'method', 'image', 'probability']
For topic  6 we have: ['material', 'design', 'circuit', 'device', 'application']
For topic  7 we have: ['design', 'method', 'market', 'student', 'paper']
For topic  8 we have: ['student', 'work', 'project', 'design', 'learn']
For topic  9 we have: ['chemical', 'student', 'learn', 'process', 'method']
For topic 10 we have: ['energy', 'student', 'system', 'process', 'conversion']
For topic 11 we have: ['optical', 'optic', 'control', 'light', 'laser']
For topic 12 we have: ['model', 'method', 'analysis', 'signal', 'dynamic']
For topic 13 we have: ['student', 'data', 'learn', 'project', 'method']
For topic 14 

### Label topics

- Topic 1 `chemistry`
- Topic 2 `life sciences`
- Topic 3 `mathematics`
- Topic 4 `evaluation`
- Topic 5 `machine learning`
- Topic 6 `micro-technique`
- Topic 7 `MTE research`
- Topic 8 `student work`
- Topic 9 `chemitry experiments`
- Topic 10 `physics experiment`
- Topic 11 `optics`
- Topic 12 `signal processing`
- Topic 13 `learning methods`
- Topic 14 `energy policy`
- Topic 15 `modelisation`

---
# Wikipedia structure

### Import Data

In [27]:
wikipedia_rdd = sc.textFile('/ix/wikipedia-for-schools.txt').map(json.loads)

### Tools for Data Preparation

In [28]:
"""Create super set bag of words
@param: (rdd) RDD dataset of courses
@return: RDD of all existing terms
"""
def bow_wiki(rdd):
    return set(rdd.flatMap(lambda c: c["tokens"]).collect())

### Preprocessing Data

In [29]:
wiki_superbow = bow_wiki(wikipedia_rdd)

In [30]:
dim = len(wiki_superbow)

In [31]:
tokens_ids_map = dict(zip(wiki_superbow, range(len(wiki_superbow))))

In [32]:
"""Create the sparse vector of term counters for the given page
@param: (page) Fixed alpha (doc concentration) param
@return: sparse term counter vector
"""
def create_term_count_vector(page):
    terms_count = Counter(page['tokens'])
    indices = []
    values = []
    for w, c in terms_count.items():
        indices.append(tokens_ids_map[w])
        values.append(c)
         
    args_sorted = np.argsort(indices)
    values = np.array(values)[args_sorted]
    indices = np.array(indices)[args_sorted]
    return Vectors.sparse(dim, indices, values)

In [33]:
wiki_rdd = wikipedia_rdd.zipWithIndex().map(lambda t: [t[1], create_term_count_vector(t[0])])

---
## Exercise 4.11: Wikipedia structure

From our interpretations we think that optimal parameters giving most interpretable results are : 

In [34]:
k = 20
alpha = 1.05
beta = 1.01

We choose :
- k = 20 because as seen here (https://schools-wikipedia.org/wp/index/subject.html) there are 16 subjects covered in the dataset, and we expect a few subtilities will be uncovered
- alpha = 1.05 because we expect a really un-uniform topic distribution per article
- beta = 1.01 because we wxpect a reaaly un-uniform word distribution per topic

### Run with optimal hyperparameters

In [35]:
model_411 = LDA.train(wiki_rdd, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)

In [36]:
extract_topic(model_411, 10, tokens_ids_map)

For topic  1 we have: ['time', 'king', 'film', 'series', 'death', 'album', 'war', 'years', 'early', 'work']
For topic  2 we have: ['games', 'game', 'united', 'music', 'states', 'company', 'world', 'windows', 'country', 'computer']
For topic  3 we have: ['species', 'human', 'cells', 'blood', 'disease', 'common', 'cell', 'called', 'study', 'found']
For topic  4 we have: ['–', 'river', 'open', 'world', 'years', 'won', 'time', 'final', 'grand', 'year']
For topic  5 we have: ['water', 'sea', 'species', 'north', 'island', 'hurricane', 'lake', 'south', 'ocean', 'tropical']
For topic  6 we have: ['city', 'government', 'population', 'area', 'national', 'largest', 'capital', 'centre', 'economic', 'world']
For topic  7 we have: ['earth', 'sun', 'solar', 'years', 'stars', 'planet', 'star', 'planets', 'mass', 'church']
For topic  8 we have: ['energy', '·', 'law', 'theory', 'light', 'system', 'time', 'mass', 'universe', 'force']
For topic  9 we have: ['american', 'war', '–', 'british', 'french', 'ja

### Label topics

- Topic 1 `films and series`
- Topic 2 `computer games`
- Topic 3 `life sciences`
- Topic 4 `geology`
- Topic 5 `places relating to water`
- Topic 6 `geo-political`
- Topic 7 `solar system`
- Topic 8 `physics`
- Topic 9 `american independance war`
- Topic 10 `chemistry`
- Topic 11 `languages in history`
- Topic 12 `religion and politics`
- Topic 13 `unspecified`
- Topic 14 `arty films`
- Topic 15 `industrial revolution`
- Topic 16 `football`
- Topic 17 `wars`
- Topic 18 `mathematics`
- Topic 19 `space race`
- Topic 20 `music`

We are quite happy with the results as most of the clusters are quite recognizable, but results could surely be improved.