# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *H*

**Names:**

* *Jérémy Baffou*
* *Antoine Basseto*
* *Andrea Pinto*

---

### Regular Imports

In [1]:
import json
import numpy as np
from utils import load_json, load_pkl

In [2]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors, SparseVector

`pyspark.mlli` useful docs for the lab : 
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDA.html
- https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDAModel.html

### Import Data

In [3]:
# Clean data from lab4-1-vsm with all lemmatized terms 
courses_preprocessed = load_json('data/courses-lem-preprocessed.json')
courses_preprocessed_rdd = sc.parallelize(courses_preprocessed)

In [4]:
# Clean data from lab4-1-vsm without most frequent words (quantile 0.95)
courses_preprocessed_fw = load_json('data/courses-lem-fw-preprocessed.json')
courses_preprocessed_fw_rdd = sc.parallelize(courses_preprocessed_fw)

**Note:** We tried with both datasets and the `courses_preprocessed` gives way better results. <br>
So all the following computations are done with this latter and we'll not use `courses_preprocessed_fw`.

---
## Prepare Data for Model

### Tools for Data Preparation

In [5]:
"""Create super set bag of words
@param: (rdd) RDD dataset of courses
@return: Set of all existing terms
"""
def bow(rdd):
    return set(rdd.flatMap(lambda c: c["description"]).collect())

In [6]:
"""Populate term count matrix
@param: (data) Term counts in a list format i.e. tuples (a, b)
with a: tuple (doc ID, term ID) and b: number of occurences
@param: (m) Term count matrix rows i.e. number of documents
@param: (n) Term count matrix columns i.e. number of documents
@return: Full term count matrix
"""
def populate_tc(data, n, m):
    tc = np.zeros((n, m))
    for x in data:
        infos = x[0]
        value = x[1]
        
        doc_id = infos[0]
        word_id = infos[1]
        
        tc[doc_id][word_id] = value
    
    return tc

In [7]:
"""Create Sparse Vector from Dense Vector
@param: (dense_vect) Dense Vector input
@return: Same vector under Sparse Vector format
"""
def create_sparse_vector(dense_vect):
    d = len(dense_vect)
    indices = []
    values = []
    for i, e in enumerate(dense_vect):
        if e != 0:
            indices.append(i)
            values.append(e)
    return Vectors.sparse(d, indices, values)

### Preprocessing Data

Useful following variables documentation. Note that none of these variables are RDDs.
- **`superbow`** is the *set* of every existing terms we are dealing with.
- **`words_ids_map`** is a *dict* mapping every existing terms to a specific ID $\in N$
- **`doc_ids_map`** is a *dict* mapping every existing documents (here courses descriptions) to a specific ID $\in N$
- **`tc`** is the term count *matrix*. It has one row for every document and one column for every existing term.

In [8]:
superbow = bow(courses_preprocessed_rdd)
words_ids_map = dict(zip(superbow, range(len(superbow))))

In [9]:
courses_ids = courses_preprocessed_rdd.map(lambda c: c['courseId']).collect()
doc_ids_map = dict(zip(courses_ids, range(len(courses_ids))))
doc_ids = list(doc_ids_map.values())

In [10]:
termcount = (
    courses_preprocessed_rdd
    .map(lambda c: [((doc_ids_map[c["courseId"]], words_ids_map[w]),1) for w in c["description"]])
    .flatMap(lambda c: c)
    .reduceByKey(lambda x, y : x + y)
)

In [11]:
tc = populate_tc(termcount.collect(), len(doc_ids), len(superbow))

### Model Parameters

**LDA Model** requires **`doc_tc`** a list of tuples with documents IDs and corresponding terms count under a `pyspark.mllib.linalg.vector` format.

In [12]:
tcs_dense = [Vectors.dense(wordcount) for wordcount in tc]
tcs_sparse = list(map(lambda x: create_sparse_vector(x), tcs_dense))
doc_tc_tup = list(zip(doc_ids, tcs_sparse))
doc_tc = list(map(lambda x: list(x), doc_tc_tup))

Upper cell can take 1-2s to run because we cast to Sparse Vectors format for Model optimization. <br>
Note also that in reality the LDA model needs this funny `doc_tc` list but in `RDD` format : 

In [13]:
rdd = sc.parallelize(doc_tc)

---
## Exercise 4.8: Topics extraction

In [14]:
"""Extract K topics from LDA Model
@param: (model) LDA Model on which we work
@param: (words_per_topic) Words to print per topic
@param: (words_ids_map) Dict mapping words to their ID
"""
def extract_topic(model, words_per_topic, words_ids_map):
    topics = model.describeTopics(words_per_topic)
    for i, topic in enumerate(topics):
        words_ids = topic[0]
        words = []
        for word_id in words_ids:
            word = list(words_ids_map.keys())[word_id]
            words.append(word)
        print(f'For topic {i + 1 :>2} we have: {words}')

### LDA Model and topic extraction

In [15]:
model_48 = LDA.train(rdd, k=10, seed=1)

In [16]:
extract_topic(model_48, 5, words_ids_map)

For topic  1 we have: ['image', 'method', 'signal', 'system', 'process']
For topic  2 we have: ['data', 'student', 'method', 'report', 'analysis']
For topic  3 we have: ['material', 'structure', 'method', 'student', 'learn']
For topic  4 we have: ['student', 'project', 'work', 'system', 'design']
For topic  5 we have: ['model', 'method', 'theory', 'time', 'problem']
For topic  6 we have: ['learn', 'model', 'student', 'method', 'teach']
For topic  7 we have: ['student', 'method', 'optic', 'engineering', 'learn']
For topic  8 we have: ['student', 'cell', 'chemical', 'biology', 'molecular']
For topic  9 we have: ['system', 'design', 'circuit', 'device', 'technology']
For topic 10 we have: ['energy', 'student', 'learn', 'method', 'content']


### Label topics
***TODO.***
- Topic 1 could be `Image Processing`
- etc.

### Comparison with LSI
***TODO.***
- Proutinos le déglingo

---
## Exercise 4.9: Dirichlet hyperparameters

In [17]:
"""Create LDA Model with alpha variations
@param: (rdd) RDD on which to train the model
@param: (beta) Fixed beta (topic concentration) param
@return: models - LDA Models trained with moving alpha param
@return: alphas - Corresponding alpha's parameters
"""
def moving_alpha(rdd, beta):
    models = []
    alphas = []
    for alpha in range(2, 10):
        print(f'alpha: {alpha} beta: {beta}', end='\r')
        m = LDA.train(rdd, k=10, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)
        models.append(m)
        alphas.append(alpha)
    print(f'Finished to train the {len(models)} models', end='\r')
    return models, alphas

In [18]:
"""Create LDA Model with beta variations
@param: (rdd) RDD on which to train the model
@param: (alpha) Fixed alpha (doc concentration) param
@return: models - LDA Models trained with moving beta param
@return: betas - Corresponding beta's parameters
"""
def moving_beta(rdd, alpha):
    models = []
    betas = []
    for beta in range(1, 10):
        beta = (beta / 10) + 1
        print(f'alpha: {alpha} beta: {beta}', end='\r')
        m = LDA.train(rdd, k=10, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)
        models.append(m)
        betas.append(beta)
    print(f'Finished to train the {len(models)} models', end='\r')
    return models, betas

In [19]:
models_49a, alphas = moving_alpha(rdd, 1.01)

Finished to train the 8 models

In [20]:
models_49b, betas = moving_beta(rdd, 6)

Finished to train the 9 models

### Conclusion
***TODO.***
- Bonsoir on peut voir que proutinos est champion du monde

---
## Exercise 4.10: EPFL's taught subjects

From exercise `4.9` we found out that parameters that give the most interpretable results are : 

In [21]:
k = 10
alpha = 5
beta = 1.5

***TODO.*** We chose these values because el proutor is a boss 

### Run with optimal hyperparameters

In [22]:
model_410 = LDA.train(rdd, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)

In [23]:
extract_topic(model_410, 5, words_ids_map)

For topic  1 we have: ['image', 'method', 'process', 'electron', 'microscopy']
For topic  2 we have: ['student', 'data', 'project', 'report', 'method']
For topic  3 we have: ['material', 'structure', 'method', 'student', 'learn']
For topic  4 we have: ['student', 'model', 'learn', 'method', 'system']
For topic  5 we have: ['model', 'method', 'theory', 'basic', 'time']
For topic  6 we have: ['learn', 'method', 'student', 'model', 'analysis']
For topic  7 we have: ['method', 'learn', 'student', 'concept', 'exercise']
For topic  8 we have: ['student', 'cell', 'biology', 'molecular', 'chemical']
For topic  9 we have: ['system', 'design', 'circuit', 'device', 'student']
For topic 10 we have: ['energy', 'student', 'learn', 'method', 'technology']


### Label topics

- Topic 1 `prout-prout`
- etc.

---
# Wikipedia structure

### Import Data

In [None]:
wikipedia_rdd = sc.textFile('/ix/wikipedia-for-schools.txt').map(json.loads)

### Tools for Data Preparation

In [None]:
"""Create super set bag of words
@param: (rdd) RDD dataset of courses
@return: Set of all existing terms
"""
def bow_wiki(rdd):
    return set(rdd.flatMap(lambda c: c["tokens"]).collect())

### Preprocessing Data

Useful following variables documentation. Note that none of these variables are RDDs.
- **`superbow`** is the *set* of every existing terms we are dealing with.
- **`words_ids_map`** is a *dict* mapping every existing terms to a specific ID $\in N$
- **`doc_ids_map`** is a *dict* mapping every existing documents (here courses descriptions) to a specific ID $\in N$
- **`tc`** is the term count *matrix*. It has one row for every document and one column for every existing term.

In [None]:
wiki_superbow = bow_wiki(wikipedia_rdd)
tokens_ids_map = dict(zip(wiki_superbow, range(len(wiki_superbow))))

In [None]:
page_ids = wikipedia_rdd.map(lambda c: c['page_id']).collect()
wiki_doc_ids_map = dict(zip(page_ids, range(len(page_ids))))
wiki_doc_ids = list(wiki_doc_ids_map.values())

In [None]:
wiki_termcount = (
    wikipedia_rdd
    .map(lambda c: [((wiki_doc_ids_map[c["page_id"]], tokens_ids_map[w]),1) for w in c["tokens"]])
    .flatMap(lambda c: c)
    .reduceByKey(lambda x, y : x + y)
)

In [None]:
wiki_tc = populate_tc(wiki_termcount.collect(), len(wiki_doc_ids), len(wiki_superbow))

In [None]:
wiki_tc.shape

### Model Parameters

**LDA Model** requires **`doc_tc`** a list of tuples with documents IDs and corresponding terms count under a `pyspark.mllib.linalg.vector` format.

In [None]:
wiki_tcs_dense = [Vectors.dense(wordcount) for wordcount in tc]
wiki_tcs_sparse = list(map(lambda x: create_sparse_vector(x), wiki_tcs_dense))
wiki_doc_tc_tup = list(zip(doc_ids, wiki_tcs_sparse))
wiki_doc_tc = list(map(lambda x: list(x), wiki_doc_tc_tup))

Upper cell can take 1-2s to run because we cast to Sparse Vectors format for Model optimization. <br>
Note also that in reality the LDA model needs this funny `doc_tc` list but in `RDD` format : 

In [None]:
wiki_rdd = sc.parallelize(wiki_doc_tc)

---
## Exercise 4.11: Wikipedia structure

From our interpretations we think that optimal parameters giving most interpretable results are : 

In [None]:
k = ?
alpha = ?
beta = ?

***TODO.*** Explanations

### Run with optimal hyperparameters

In [None]:
model_411 = LDA.train(wiki_rdd, k=k, docConcentration=float(alpha), topicConcentration=float(beta), seed=1)

In [None]:
extract_topic(model_410, 5, tokens_ids_map)

### Label topics

- Topic 1 `prout-prout`
- etc.