# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *X*

**Names:**

* *Linqi Liu*
* *Yifei Song*
* *Ying Xu Dempster Tay*
* *Yuhang Yan*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from utils import load_json
import numpy as np
import json

## Exercise 4.8: Topics extraction

In [2]:
# Load data
courses = load_json('data/courses_processed.txt')
course_ids = [x['courseId'] for x in courses]
terms = np.load('terms_indices.npy', allow_pickle=True).tolist()

print('courses[0]:', courses[0], '\n')
print('course_ids[:5]:', course_ids[:5], '\n')
print('terms[:5]:', terms[:5])

courses[0]: {'courseId': 'MSE-440', 'name': 'Composites technology', 'description': 'latest develop process gener organ composit discuss nanocomposit adapt composit present product develop cost analysi studi market practic team work basic composit materi process composit design composit structur current develop nanocomposit composit biocomposit adapt composit applic drive forc market cost analysi aerospac keyword composit applic nanocomposit biocomposit adapt composit design cost prerequisit requir notion polym recommend polym composit propos suitabl product perform criteria product composit part appli basic equat process mechan properti model composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu perform team receiv respond appropri feedback teach cathedra invit speaker group session exercis work project expect activ attend lectur design composit part bibliographi search assess

In [3]:
# Transfer the data to RDD format
courses_rdd = sc.parallelize(courses).map(
    lambda x: [course_ids.index(x['courseId']), Vectors.dense([x['description'].split(" ").count(t) for t in terms])]
)

print('courses_rdd:', courses_rdd)

courses_rdd: PythonRDD[1] at RDD at PythonRDD.scala:54


In [4]:
# Train the LDA model
lda_model = LDA.train(courses_rdd, k=10)

In [5]:
# Virtualize the model
def model_virtualization(model):
    lda_topic = model.describeTopics(maxTermsPerTopic=10)
    for i, topics in enumerate(lda_topic):
        print(f'Topic {i+1}:')
        print([terms[topic] for topic in topics[0]])
        
model_virtualization(lda_model)

Topic 1:
['materi', 'electron', 'energi', 'properti', 'physic', 'devic', 'applic', 'magnet', 'sensor', 'electr']
Topic 2:
['system', 'present', 'paper', 'assess', 'energi', 'process', 'engin', 'control', 'discuss', 'requir']
Topic 3:
['cell', 'structur', 'biolog', 'chemic', 'engin', 'mechan', 'design', 'chemistri', 'water', 'week']
Topic 4:
['problem', 'analysi', 'assess', 'concept', 'basic', 'data', 'exam', 'mass', 'probabl', 'teach']
Topic 5:
['model', 'imag', 'analysi', 'linear', 'process', 'signal', 'statist', 'exercis', 'basic', 'code']
Topic 6:
['data', 'comput', 'program', 'model', 'algorithm', 'commun', 'flow', 'system', 'softwar', 'assess']
Topic 7:
['project', 'report', 'research', 'architectur', 'semest', 'plan', 'work', 'develop', 'evalu', 'laboratori']
Topic 8:
['model', 'case', 'assess', 'develop', 'market', 'innov', 'stochast', 'present', 'group', 'work']
Topic 9:
['optic', 'basic', 'equat', 'theori', 'applic', 'laser', 'process', 'principl', 'mechan', 'quantum']
Topic 1

In [6]:
# Manually add labels to topics
labels = [
    'Chemical Processes',
    'Management and Optimization',
    'Materials and Electronics',
    'Optics and Computation',
    'Architecture and Urban Development',
    'Statistical Analysis and Signal Processing',
    'Electronics and Material Properties',
    'Cellular Biology',
    'Project Management and Research Skills',
    'Energy and Thermodynamics'
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

Topic 1: Chemical Processes
Topic 2: Management and Optimization
Topic 3: Materials and Electronics
Topic 4: Optics and Computation
Topic 5: Architecture and Urban Development
Topic 6: Statistical Analysis and Signal Processing
Topic 7: Electronics and Material Properties
Topic 8: Cellular Biology
Topic 9: Project Management and Research Skills
Topic 10: Energy and Thermodynamics


#### Compared with LSI:


## Exercise 4.9: Dirichlet hyperparameters

### Fix k = 10 and β = 1.01, and vary α

In [7]:
def test_with_a(a, b=1.01, k=10):
    print(f'k = {k}, a = {a}, b = {b}\n')
    lda_model = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
    model_virtualization(lda_model)

In [8]:
a = 1.01
test_with_a(a)

k = 10, a = 1.01, b = 1.01

Topic 1:
['biolog', 'molecular', 'chemistri', 'cell', 'discuss', 'protein', 'present', 'exercis', 'lectur', 'properti']
Topic 2:
['project', 'data', 'assess', 'program', 'present', 'plan', 'comput', 'skill', 'learn', 'research']
Topic 3:
['model', 'analysi', 'report', 'process', 'mass', 'data', 'problem', 'project', 'experiment', 'experi']
Topic 4:
['optic', 'electron', 'materi', 'structur', 'laser', 'microscopi', 'mechan', 'devic', 'techniqu', 'applic']
Topic 5:
['materi', 'physic', 'magnet', 'cell', 'field', 'solid', 'mechan', 'properti', 'effect', 'state']
Topic 6:
['energi', 'process', 'product', 'assess', 'chemic', 'environment', 'problem', 'water', 'lectur', 'reaction']
Topic 7:
['group', 'present', 'work', 'project', 'develop', 'studi', 'assess', 'case', 'semest', 'evalu']
Topic 8:
['system', 'signal', 'process', 'model', 'commun', 'imag', 'lectur', 'digit', 'week', 'teach']
Topic 9:
['theori', 'control', 'statist', 'basic', 'probabl', 'model', 'measu

In [9]:
a = 2.0
test_with_a(a)

k = 10, a = 2.0, b = 1.01

Topic 1:
['project', 'manag', 'present', 'plan', 'assess', 'group', 'case', 'evalu', 'engin', 'work']
Topic 2:
['electron', 'chemic', 'energi', 'chemistri', 'reaction', 'cell', 'devic', 'organ', 'molecular', 'basic']
Topic 3:
['present', 'assess', 'lectur', 'paper', 'problem', 'basic', 'analysi', 'teach', 'data', 'discuss']
Topic 4:
['mechan', 'physic', 'energi', 'process', 'heat', 'mass', 'flow', 'applic', 'materi', 'transfer']
Topic 5:
['model', 'control', 'process', 'signal', 'system', 'code', 'simul', 'digit', 'assess', 'circuit']
Topic 6:
['model', 'theori', 'analysi', 'statist', 'probabl', 'basic', 'equat', 'stochast', 'problem', 'linear']
Topic 7:
['develop', 'present', 'research', 'semest', 'work', 'biolog', 'lectur', 'innov', 'protein', 'technolog']
Topic 8:
['product', 'inform', 'network', 'assess', 'applic', 'system', 'industri', 'commun', 'present', 'model']
Topic 9:
['data', 'project', 'report', 'problem', 'skill', 'algorithm', 'optim', 'learn', 

In [10]:
a = 6.0
test_with_a(a)

k = 10, a = 6.0, b = 1.01

Topic 1:
['project', 'assess', 'research', 'develop', 'discuss', 'present', 'work', 'gener', 'oral', 'reactor']
Topic 2:
['project', 'report', 'data', 'plan', 'skill', 'assess', 'work', 'evalu', 'research', 'scientif']
Topic 3:
['biolog', 'present', 'paper', 'cell', 'learn', 'protein', 'lectur', 'network', 'assess', 'molecular']
Topic 4:
['model', 'analysi', 'linear', 'statist', 'process', 'code', 'materi', 'test', 'assess', 'signal']
Topic 5:
['model', 'stochast', 'market', 'control', 'deriv', 'price', 'risk', 'financi', 'introduct', 'circuit']
Topic 6:
['chemic', 'chemistri', 'reaction', 'present', 'process', 'organ', 'engin', 'semest', 'lectur', 'kinet']
Topic 7:
['energi', 'case', 'technolog', 'power', 'engin', 'week', 'manag', 'assess', 'mass', 'present']
Topic 8:
['equat', 'mechan', 'numer', 'flow', 'physic', 'theori', 'concept', 'basic', 'simul', 'model']
Topic 9:
['optic', 'materi', 'imag', 'electron', 'applic', 'process', 'basic', 'properti', 'laser'

In [11]:
a = 10.0
test_with_a(a)

k = 10, a = 10.0, b = 1.01

Topic 1:
['optic', 'imag', 'basic', 'signal', 'process', 'principl', 'model', 'exercis', 'filter', 'techniqu']
Topic 2:
['process', 'structur', 'energi', 'lectur', 'analysi', 'week', 'assess', 'stabil', 'present', 'mass']
Topic 3:
['materi', 'electron', 'properti', 'chemic', 'mechan', 'reaction', 'structur', 'chemistri', 'magnet', 'process']
Topic 4:
['model', 'data', 'analysi', 'assess', 'present', 'problem', 'risk', 'develop', 'gener', 'linear']
Topic 5:
['cell', 'teach', 'basic', 'biolog', 'quantum', 'model', 'present', 'assess', 'analysi', 'lectur']
Topic 6:
['architectur', 'system', 'model', 'control', 'assess', 'water', 'final', 'present', 'work', 'develop']
Topic 7:
['model', 'probabl', 'theori', 'stochast', 'deriv', 'market', 'statist', 'price', 'time', 'analysi']
Topic 8:
['report', 'project', 'data', 'research', 'scientif', 'plan', 'experi', 'experiment', 'evalu', 'assess']
Topic 9:
['laser', 'applic', 'sensor', 'power', 'basic', 'electron', 'model

In [12]:
a = 20.0
test_with_a(a)

k = 10, a = 20.0, b = 1.01

Topic 1:
['assess', 'present', 'project', 'cell', 'work', 'model', 'analysi', 'activ', 'process', 'lectur']
Topic 2:
['model', 'process', 'materi', 'basic', 'assess', 'exercis', 'introduct', 'lectur', 'techniqu', 'devic']
Topic 3:
['materi', 'model', 'assess', 'applic', 'mechan', 'structur', 'basic', 'lectur', 'electron', 'analysi']
Topic 4:
['process', 'model', 'assess', 'analysi', 'basic', 'present', 'applic', 'data', 'concept', 'prerequisit']
Topic 5:
['system', 'engin', 'model', 'assess', 'concept', 'teach', 'project', 'lectur', 'learn', 'present']
Topic 6:
['optic', 'structur', 'imag', 'week', 'assess', 'basic', 'process', 'microscopi', 'lectur', 'present']
Topic 7:
['model', 'report', 'project', 'assess', 'risk', 'evalu', 'present', 'optim', 'market', 'problem']
Topic 8:
['model', 'project', 'assess', 'analysi', 'process', 'energi', 'lectur', 'present', 'work', 'evalu']
Topic 9:
['model', 'control', 'assess', 'robot', 'lectur', 'basic', 'concept', 'pre

In [13]:
a = 100.0
test_with_a(a)

k = 10, a = 100.0, b = 1.01

Topic 1:
['model', 'assess', 'lectur', 'process', 'present', 'analysi', 'project', 'concept', 'basic', 'work']
Topic 2:
['model', 'assess', 'present', 'analysi', 'lectur', 'process', 'basic', 'project', 'concept', 'exercis']
Topic 3:
['model', 'assess', 'process', 'present', 'basic', 'lectur', 'project', 'work', 'analysi', 'concept']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'basic', 'project', 'concept', 'work', 'analysi']
Topic 5:
['model', 'assess', 'process', 'present', 'analysi', 'basic', 'lectur', 'project', 'concept', 'materi']
Topic 6:
['model', 'assess', 'present', 'process', 'analysi', 'project', 'basic', 'lectur', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'lectur', 'project', 'analysi', 'concept', 'process', 'work', 'basic']
Topic 8:
['model', 'assess', 'process', 'project', 'lectur', 'analysi', 'present', 'basic', 'concept', 'activ']
Topic 9:
['model', 'assess', 'present', 'process', 'analysi', 'lectur', 'basi

A high α value leads to a uniform distribution of topics across documents, where each topic is equally represented. When α=100, general terms dominate the topics, present in almost all documents. Even with α=20, some specific words still persist, but the overall effect remains similar.

No notable difference is seen in topics for α values {1.01, 2, 5}, with the top terms staying consistent. However, for α=10, we begin to see specific domain terms diminish and relevance scores decrease, approaching a uniform distribution.

### Fix k = 10 and α = 6, and vary β

In [14]:
def test_with_b(b, a=6.0, k=10):
    print(f'k = {k}, a = {a}, b = {b}\n')
    lda_model = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
    model_virtualization(lda_model)

In [15]:
b = 1.01
test_with_b(b)

k = 10, a = 6.0, b = 1.01

Topic 1:
['materi', 'electron', 'applic', 'properti', 'devic', '2', 'network', 'physic', 'field', 'organ']
Topic 2:
['energi', 'process', 'chemic', 'protein', 'biolog', 'reaction', 'cell', 'thermodynam', 'engin', 'concept']
Topic 3:
['work', 'learn', 'assess', 'teach', 'project', 'group', 'concept', 'code', 'commun', 'present']
Topic 4:
['optic', 'present', 'imag', 'project', 'research', 'develop', 'paper', 'assess', 'lectur', 'work']
Topic 5:
['report', 'project', 'scientif', 'present', 'evalu', 'plan', 'research', 'assess', 'laboratori', 'industri']
Topic 6:
['model', 'linear', 'numer', 'analysi', 'exercis', 'problem', 'statist', 'probabl', 'optim', 'data']
Topic 7:
['data', 'project', 'assess', 'technolog', 'evalu', 'present', 'develop', 'inform', 'polici', 'skill']
Topic 8:
['system', 'analysi', 'function', 'signal', 'control', 'problem', 'fourier', 'assess', 'theori', 'model']
Topic 9:
['mechan', 'structur', 'equat', 'physic', 'flow', 'quantum', 'dynam',

In [16]:
b = 2.0
test_with_b(b)

k = 10, a = 6.0, b = 2.0

Topic 1:
['model', 'project', 'assess', 'data', 'report', 'present', 'research', 'optim', 'lectur', 'skill']
Topic 2:
['optic', 'energi', 'electron', 'materi', 'applic', 'physic', 'devic', 'principl', 'properti', 'sensor']
Topic 3:
['structur', 'model', 'system', 'assess', 'week', 'engin', 'lectur', 'work', 'project', 'present']
Topic 4:
['cell', 'assess', 'present', 'develop', 'lectur', 'work', 'activ', 'concept', 'teach', 'project']
Topic 5:
['materi', 'assess', 'project', 'present', 'report', 'work', 'skill', 'robot', 'properti', 'plan']
Topic 6:
['model', 'data', 'process', 'analysi', 'algorithm', 'assess', 'project', 'signal', 'linear', 'code']
Topic 7:
['model', 'imag', 'equat', 'basic', 'concept', 'assess', 'flow', 'function', 'mechan', 'lectur']
Topic 8:
['model', 'analysi', 'problem', 'exercis', 'assess', 'process', 'basic', 'lectur', 'system', 'concept']
Topic 9:
['present', 'evalu', 'assess', 'project', 'technolog', 'paper', 'report', 'studi', 'acti

In [17]:
b = 5.0
test_with_b(b)

k = 10, a = 6.0, b = 5.0

Topic 1:
['assess', 'project', 'model', 'process', 'present', 'analysi', 'lectur', 'work', 'activ', 'evalu']
Topic 2:
['model', 'assess', 'process', 'present', 'lectur', 'basic', 'analysi', 'materi', 'concept', 'applic']
Topic 3:
['model', 'assess', 'process', 'basic', 'analysi', 'concept', 'lectur', 'materi', 'system', 'present']
Topic 4:
['model', 'assess', 'process', 'analysi', 'present', 'project', 'lectur', 'concept', 'work', 'basic']
Topic 5:
['optic', 'model', 'assess', 'present', 'basic', 'lectur', 'materi', 'concept', 'process', 'applic']
Topic 6:
['model', 'assess', 'project', 'present', 'process', 'data', 'analysi', 'lectur', 'concept', 'basic']
Topic 7:
['assess', 'model', 'present', 'lectur', 'work', 'project', 'concept', 'basic', 'analysi', 'activ']
Topic 8:
['model', 'assess', 'analysi', 'basic', 'lectur', 'present', 'process', 'concept', 'project', 'problem']
Topic 9:
['model', 'assess', 'present', 'lectur', 'process', 'basic', 'structur', 'pro

In [18]:
b = 10.0
test_with_b(b)

k = 10, a = 6.0, b = 10.0

Topic 1:
['model', 'assess', 'analysi', 'present', 'process', 'project', 'lectur', 'basic', 'concept', 'work']
Topic 2:
['model', 'assess', 'present', 'lectur', 'process', 'basic', 'project', 'analysi', 'concept', 'work']
Topic 3:
['model', 'assess', 'present', 'project', 'process', 'lectur', 'analysi', 'work', 'concept', 'basic']
Topic 4:
['model', 'assess', 'present', 'process', 'analysi', 'lectur', 'project', 'concept', 'basic', 'work']
Topic 5:
['model', 'assess', 'present', 'process', 'analysi', 'project', 'lectur', 'concept', 'basic', 'work']
Topic 6:
['model', 'assess', 'process', 'basic', 'analysi', 'lectur', 'present', 'concept', 'project', 'prerequisit']
Topic 7:
['model', 'assess', 'process', 'present', 'basic', 'lectur', 'analysi', 'concept', 'project', 'prerequisit']
Topic 8:
['model', 'assess', 'process', 'basic', 'present', 'lectur', 'analysi', 'concept', 'prerequisit', 'activ']
Topic 9:
['model', 'assess', 'process', 'present', 'lectur', 'basi

In [19]:
b = 20.0
test_with_b(b)

k = 10, a = 6.0, b = 20.0

Topic 1:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 2:
['model', 'assess', 'process', 'present', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 3:
['model', 'assess', 'present', 'process', 'lectur', 'basic', 'analysi', 'project', 'concept', 'work']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 5:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 6:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 8:
['model', 'assess', 'present', 'process', 'project', 'lectur', 'analysi', 'basic', 'concept', 'work']
Topic 9:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'pro

In [20]:
b = 100.0
test_with_b(b)

k = 10, a = 6.0, b = 100.0

Topic 1:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 2:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 3:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 5:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 6:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 8:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 9:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', '

Similarly, adjusting the value of β in the LDA model impacts the similarity between topics. As β increases, topics tend to become more alike, while decreasing β results in more distinct and diverse topics. This phenomenon arises from the role of β in shaping the prior distribution of topics across terms. Larger β values promote smoother inferred distributions, fostering uniformity among topics.

## Exercise 4.10: EPFL's taught subjects

As EPFL has 20 different sessions, we choose $k=20$.

In [21]:
k = 20

As we want the distribution of topics per document not to be not very close to uniform, we choose small values for $a=b=1.01$.

In [22]:
a, b = 1.01, 1.01

In [23]:
print(f'k = {k}, a = {a}, b = {b}\n')
lda_epfl = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
model_virtualization(lda_epfl)

k = 20, a = 1.01, b = 1.01

Topic 1:
['model', 'data', 'statist', 'problem', 'stochast', 'comput', 'analysi', 'linear', 'algorithm', 'theori']
Topic 2:
['model', 'linear', 'process', 'control', 'dynam', 'analysi', 'exercis', 'applic', 'machin', 'basic']
Topic 3:
['structur', 'materi', 'properti', 'magnet', 'physic', 'solut', 'electr', 'electron', 'stabil', 'applic']
Topic 4:
['analysi', 'theori', 'measur', 'fourier', 'problem', 'process', 'space', 'transform', 'model', 'signal']
Topic 5:
['system', 'control', 'program', 'circuit', 'model', 'digit', 'simul', 'commun', 'design', 'implement']
Topic 6:
['research', 'lectur', 'technolog', 'develop', 'present', 'week', 'semest', 'engin', 'network', 'field']
Topic 7:
['optic', 'laser', 'project', 'experi', 'work', 'learn', 'evalu', 'technolog', 'practic', 'light']
Topic 8:
['materi', 'structur', 'mechan', 'properti', 'polym', 'surfac', 'model', 'applic', 'metal', 'assess']
Topic 9:
['cell', 'biolog', 'protein', 'molecular', 'present', 'functi

In [24]:
# Manually add labels to topics
labels = [
    'Project',
    'Statistical Mechanics',
    'Finance',
    'Optics',
    'Biology',
    'Chemistry',
    'Material',
    'Project',
    'Risk Management',
    'Sciense',
    'Sensor Design',
    'Computer Science',
    'Architecture',
    'Energy',
    'Electronic Engineering',
    'Bio Engineering',
    'Data Science',
    'Physice',
    'Quantum Theory',
    'Project'
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

Topic 1: Project
Topic 2: Statistical Mechanics
Topic 3: Finance
Topic 4: Optics
Topic 5: Biology
Topic 6: Chemistry
Topic 7: Material
Topic 8: Project
Topic 9: Risk Management
Topic 10: Sciense
Topic 11: Sensor Design
Topic 12: Computer Science
Topic 13: Architecture
Topic 14: Energy
Topic 15: Electronic Engineering
Topic 16: Bio Engineering
Topic 17: Data Science
Topic 18: Physice
Topic 19: Quantum Theory
Topic 20: Project


## Exercise 4.11: Wikipedia structure

Wikipedia content covers a lot of areas, and since memory is limited, we chose $k=20$. Also, since we want to filter out more representative keywords, we still chose small values for $a=b=1.01$.

In [25]:
# Load the dataset
wiki_RDD = sc.textFile('/ix/wikipedia-for-schools.txt').map(json.loads)
wikipage_RDD = wiki_RDD.map(lambda p: p['page_id']).distinct()
N = wikipage_RDD.count()

In [26]:
# Record every page
pageID = list(dict(zip(wikipage_RDD.collect(), range(N))).values())

# Record all words
words_RDD = wiki_RDD.flatMap(lambda p: p["tokens"]).distinct()
M = words_RDD.count() 

# Record all terms
terms = dict(zip(words_RDD.collect(), range(M)))
vectorized_term = np.vectorize(lambda x: terms[x])
termID = {v: k for k, v in terms.items()}

# Reduce wikipedia RDD with only indexes
red_wiki_RDD = wiki_RDD.map(lambda c: (pageID[c["page_id"]], vectorized_term(c["tokens"])))

In [27]:
def doc_to_vector(doc):
    vector = {}
    for term in doc[1]:
        vector[term] = vector.get(term, 0) + 1
    return (doc[0], Vectors.sparse(M, vector))

# Build the term-document matrix
term_doc_matrix = red_wiki_RDD.map(lambda x: doc_to_vector(x)).map(list)

In [28]:
# Dirichlet hyperparameters
k = 20
a = 1.01
b = 1.01

# Train the LDA model
lda = LDA.train(term_doc_matrix, k=k, docConcentration=a, topicConcentration=b)
model_virtualization(lda_wiki)

Topic 1:
['games', 'game', 'players', 'world', 'time', 'olympic', '–', 'cup', 'player', 'events']
Topic 2:
['theory', 'number', '=', 'numbers', 'work', 'set', 'called', 'form', 'written', 'century']
Topic 3:
['city', '·', 'centre', 'century', 'law', 'population', 'system', 'government', 'state', 'years']
Topic 4:
['blood', 'people', 'health', 'cancer', 'medical', 'treatment', 'risk', 'patients', 'high', 'years']
Topic 5:
['eruption', 'years', 'comet', 'lava', 'volcanic', 'volcano', 'india', 'soil', 'large', 'time']
Topic 6:
['island', 'islands', 'european', 'city', 'population', 'country', 'north', 'south', 'ireland', 'east']
Topic 7:
['south', 'lake', 'mi', 'area', 'river', 'city', 'oil', 'water', 'population', 'north']
Topic 8:
['gas', 'lens', 'game', 'time', 'earth', 'lenses', 'water', 'number', 'ds', 'haiku']
Topic 9:
['music', 'instruments', 'painting', 'art', 'made', 'popular', 'instrument', 'set', 'bass', 'style']
Topic 10:
['american', '–', 'calendar', 'january', 'march', 'febr

In [29]:
# Manually add labels to topics
labels = [
    "Olympic Games",
    "Maths",
    "City",
    "Disease",
    "Geography",
    "Europe",
    "Population",
    "Lenses and Light",
    "Art",
    "Calendar",
    "Computers and Software",
    "Earth's Energy",
    "Film",
    "Forms of Government",
    "Rivers and Lakes",
    "History",
    "Months",
    "Wars",
    "Environment",
    "Global Corporations",
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

Topic 1: Olympic Games
Topic 2: Maths
Topic 3: City
Topic 4: Disease
Topic 5: Geography
Topic 6: Europe
Topic 7: Population
Topic 8: Lenses and Light
Topic 9: Art
Topic 10: Calendar
Topic 11: Computers and Software
Topic 12: Earth's Energy
Topic 13: Film
Topic 14: Forms of Government
Topic 15: Rivers and Lakes
Topic 16: History
Topic 17: Months
Topic 18: Wars
Topic 19: Environment
Topic 20: Global Corporations
