# Text 3: Latent Dirichlet allocation
**Internet Analytics - Lab 4**

---

**Group:** *X*

**Names:**

* *Linqi Liu*
* *Yifei Song*
* *Ying Xu Dempster Tay*
* *Yuhang Yan*

---

#### Instructions

*This is a template for part 3 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from utils import load_json
import numpy as np
import json

## Exercise 4.8: Topics extraction

In [4]:
# Load data
courses = load_json('data/courses_processed.txt')
course_ids = [x['courseId'] for x in courses]
terms = np.load('terms_indices.npy', allow_pickle=True).tolist()

print('courses[0]:', courses[0], '\n')
print('course_ids[:5]:', course_ids[:5], '\n')
print('terms[:5]:', terms[:5])

courses[0]: {'courseId': 'MSE-440', 'name': 'Composites technology', 'description': 'latest develop process gener organ composit discuss nanocomposit adapt composit present product develop cost analysi studi market practic team work basic composit materi process composit design composit structur current develop nanocomposit composit biocomposit adapt composit applic drive forc market cost analysi aerospac keyword composit applic nanocomposit biocomposit adapt composit design cost prerequisit requir notion polym recommend polym composit propos suitabl product perform criteria product composit part appli basic equat process mechan properti model composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu perform team receiv respond appropri feedback teach cathedra invit speaker group session exercis work project expect activ attend lectur design composit part bibliographi search assess

In [5]:
# Transfer the data to RDD format
courses_rdd = sc.parallelize(courses).map(
    lambda x: [course_ids.index(x['courseId']), Vectors.dense([x['description'].split(" ").count(t) for t in terms])]
)

print('courses_rdd:', courses_rdd)

courses_rdd: PythonRDD[3] at RDD at PythonRDD.scala:54


In [18]:
# Train the LDA model
lda_model = LDA.train(courses_rdd, k=10)

In [10]:
# Virtualize the model
def model_virtualization(model):
    lda_topic = model.describeTopics(maxTermsPerTopic=10)
    for i, topics in enumerate(lda_topic):
        print(f'Topic {i+1}:')
        print([terms[topic] for topic in topics[0]])
        
model_virtualization(lda_model)

NameError: name 'lda_model' is not defined

In [25]:
# Manually add labels to topics
labels = [
    'Chemical Processes',
    'Management and Optimization',
    'Materials and Electronics',
    'Optics and Computation',
    'Architecture and Urban Development',
    'Statistical Analysis and Signal Processing',
    'Electronics and Material Properties',
    'Cellular Biology',
    'Project Management and Research Skills',
    'Energy and Thermodynamics'
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

Topic 1: Chemical Processes
Topic 2: Management and Optimization
Topic 3: Materials and Electronics
Topic 4: Optics and Computation
Topic 5: Architecture and Urban Development
Topic 6: Statistical Analysis and Signal Processing
Topic 7: Electronics and Material Properties
Topic 8: Cellular Biology
Topic 9: Project Management and Research Skills
Topic 10: Energy and Thermodynamics


#### Compared with LSI:


## Exercise 4.9: Dirichlet hyperparameters

### Fix k = 10 and β = 1.01, and vary α

In [108]:
def test_with_a(a, b=1.01, k=10):
    print(f'k = {k}, a = {a}, b = {b}\n')
    lda_model = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
    model_virtualization(lda_model)

In [109]:
a = 1.01
test_with_a(a)

k = 10, a = 1.01, b = 1.01

Topic 1:
['analysi', 'program', 'exercis', 'system', 'lectur', 'linear', 'optim', 'model', 'comput', 'learn']
Topic 2:
['present', 'microscopi', 'electron', 'research', 'field', 'scienc', 'semest', 'lectur', 'discuss', 'materi']
Topic 3:
['data', 'network', 'algorithm', 'theori', 'comput', 'model', 'problem', 'lectur', 'probabl', 'exercis']
Topic 4:
['model', 'process', 'imag', 'circuit', 'signal', 'basic', 'statist', 'system', 'digit', 'analysi']
Topic 5:
['biolog', 'report', 'project', 'protein', 'scientif', 'cell', 'experiment', 'experi', 'molecular', 'assess']
Topic 6:
['optic', 'model', 'physic', 'applic', 'mechan', 'sensor', 'magnet', 'devic', 'quantum', 'materi']
Topic 7:
['model', 'risk', 'price', 'stochast', 'polici', 'financi', 'market', 'analysi', 'assess', 'introduct']
Topic 8:
['process', 'chemistri', 'chemic', 'engin', 'mass', 'exercis', 'water', 'product', 'theori', 'assess']
Topic 9:
['project', 'work', 'assess', 'present', 'plan', 'evalu', '

In [110]:
a = 2.0
test_with_a(a)

k = 10, a = 2.0, b = 1.01

Topic 1:
['model', 'architectur', 'problem', 'semest', 'optim', 'work', 'mathemat', 'method', 'numer', 'studi']
Topic 2:
['optic', 'theori', 'electron', 'basic', 'quantum', 'principl', 'microscopi', 'imag', 'applic', 'techniqu']
Topic 3:
['circuit', 'design', 'analysi', 'model', 'space', 'digit', 'process', 'integr', 'reactor', 'assess']
Topic 4:
['model', 'concept', 'flow', 'energi', 'dynam', 'equat', 'process', 'thermodynam', 'mechan', 'structur']
Topic 5:
['project', 'data', 'report', 'comput', 'scientif', 'assess', 'research', 'present', 'program', 'skill']
Topic 6:
['materi', 'cell', 'devic', 'applic', 'organ', 'properti', 'electron', 'magnet', 'structur', 'lectur']
Topic 7:
['learn', 'technolog', 'manag', 'work', 'present', 'develop', 'teach', 'risk', 'assess', 'lectur']
Topic 8:
['chemic', 'chemistri', 'biolog', 'present', 'reaction', 'assess', 'imag', 'process', 'activ', 'keyword']
Topic 9:
['model', 'assess', 'evalu', 'present', 'class', 'materi', 'p

In [111]:
a = 6.0
test_with_a(a)

k = 10, a = 6.0, b = 1.01

Topic 1:
['program', 'learn', 'algorithm', 'teach', 'optim', 'comput', 'data', 'assess', 'system', 'exercis']
Topic 2:
['cell', 'biolog', 'molecular', 'research', 'present', 'protein', 'discuss', 'scientif', 'develop', 'project']
Topic 3:
['model', 'structur', 'analysi', 'mechan', 'energi', 'problem', 'assess', 'concept', 'studi', 'case']
Topic 4:
['project', 'present', 'plan', 'assess', 'group', 'work', 'develop', 'skill', 'semest', 'inform']
Topic 5:
['chemic', 'engin', 'lectur', 'process', 'problem', 'concept', 'space', 'analysi', 'assess', 'exercis']
Topic 6:
['imag', 'present', 'technolog', 'flow', 'develop', 'assess', 'polici', 'lectur', 'innov', 'evalu']
Topic 7:
['electron', 'optic', 'circuit', 'devic', 'laser', 'microscopi', 'organ', 'architectur', 'techniqu', 'imag']
Topic 8:
['data', 'model', 'report', 'analysi', 'statist', 'process', 'stochast', 'time', 'probabl', 'practic']
Topic 9:
['materi', 'energi', 'properti', 'applic', 'magnet', 'power', 'p

In [112]:
a = 10.0
test_with_a(a)

k = 10, a = 10.0, b = 1.01

Topic 1:
['report', 'model', 'problem', 'equat', 'numer', 'theori', 'optim', 'flow', 'project', 'comput']
Topic 2:
['system', 'architectur', 'engin', 'process', 'technolog', 'product', 'lectur', 'materi', '2', 'project']
Topic 3:
['present', 'biolog', 'paper', 'optic', 'assess', 'discuss', 'oral', 'develop', 'scientif', 'evalu']
Topic 4:
['teach', 'learn', 'electron', 'imag', 'data', 'magnet', 'applic', 'basic', 'scienc', 'assess']
Topic 5:
['model', 'control', 'system', 'assess', 'techniqu', 'signal', 'analysi', 'process', 'exercis', 'robot']
Topic 6:
['cell', 'data', 'assess', 'project', 'concept', 'present', 'molecular', 'metal', 'biolog', 'basic']
Topic 7:
['process', 'present', 'engin', 'assess', 'manag', 'model', 'chemistri', 'exercis', 'environment', 'work']
Topic 8:
['model', 'optic', 'stochast', 'introduct', 'deriv', 'assess', 'lectur', 'analysi', 'exam', 'protein']
Topic 9:
['structur', 'mechan', 'materi', 'process', 'mass', 'energi', 'applic', 'pr

In [113]:
a = 20.0
test_with_a(a)

k = 10, a = 20.0, b = 1.01

Topic 1:
['process', 'model', 'basic', 'assess', 'concept', 'chemic', 'exercis', 'analysi', 'lectur', 'activ']
Topic 2:
['model', 'system', 'assess', 'analysi', 'lectur', 'robot', 'concept', 'exercis', 'control', 'basic']
Topic 3:
['assess', 'data', 'model', 'analysi', 'project', 'process', 'present', 'evalu', 'develop', 'work']
Topic 4:
['energi', 'assess', 'magnet', 'model', 'process', 'present', 'materi', 'project', 'concept', 'basic']
Topic 5:
['assess', 'model', 'project', 'process', 'present', 'analysi', 'data', 'applic', 'report', 'basic']
Topic 6:
['engin', 'structur', 'model', 'lectur', 'present', 'assess', 'process', 'week', 'concept', 'basic']
Topic 7:
['optic', 'materi', 'devic', 'model', 'assess', 'applic', 'basic', 'lectur', 'concept', 'electron']
Topic 8:
['model', 'risk', 'stochast', 'assess', 'probabl', 'analysi', 'theori', 'linear', 'optim', 'price']
Topic 9:
['architectur', 'project', 'work', 'assess', 'present', 'develop', 'research', 'se

In [114]:
a = 100.0
test_with_a(a)

k = 10, a = 100.0, b = 1.01

Topic 1:
['model', 'assess', 'process', 'present', 'project', 'lectur', 'analysi', 'concept', 'basic', 'teach']
Topic 2:
['model', 'assess', 'process', 'lectur', 'present', 'concept', 'basic', 'project', 'exercis', 'analysi']
Topic 3:
['model', 'assess', 'lectur', 'process', 'analysi', 'project', 'present', 'basic', 'work', 'concept']
Topic 4:
['model', 'assess', 'process', 'present', 'analysi', 'lectur', 'basic', 'concept', 'project', 'teach']
Topic 5:
['model', 'assess', 'analysi', 'present', 'basic', 'concept', 'lectur', 'work', 'process', 'project']
Topic 6:
['model', 'assess', 'present', 'project', 'process', 'analysi', 'lectur', 'basic', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'lectur', 'project', 'process', 'analysi', 'basic', 'exercis', 'work']
Topic 8:
['model', 'assess', 'present', 'lectur', 'process', 'project', 'analysi', 'basic', 'work', 'teach']
Topic 9:
['model', 'assess', 'present', 'basic', 'analysi', 'lectur', 'project',

A high α value leads to a uniform distribution of topics across documents, where each topic is equally represented. When α=100, general terms dominate the topics, present in almost all documents. Even with α=20, some specific words still persist, but the overall effect remains similar.

No notable difference is seen in topics for α values {1.01, 2, 5}, with the top terms staying consistent. However, for α=10, we begin to see specific domain terms diminish and relevance scores decrease, approaching a uniform distribution.

### Fix k = 10 and α = 6, and vary β

In [115]:
def test_with_b(b, a=6.0, k=10):
    print(f'k = {k}, a = {a}, b = {b}\n')
    lda_model = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
    model_virtualization(lda_model)

In [116]:
b = 1.01
test_with_b(b)

k = 10, a = 6.0, b = 1.01

Topic 1:
['learn', 'assess', 'teach', 'evalu', 'work', 'structur', 'inform', 'group', 'innov', 'week']
Topic 2:
['imag', 'process', 'architectur', 'magnet', 'digit', 'project', 'model', 'quantum', 'visual', 'work']
Topic 3:
['assess', 'energi', 'present', 'properti', 'physic', 'materi', 'network', 'metal', 'electr', 'teach']
Topic 4:
['cell', 'molecular', 'energi', 'electron', 'materi', 'biolog', 'mechan', 'discuss', 'spectroscopi', 'principl']
Topic 5:
['present', 'biolog', 'develop', 'chemic', 'assess', 'activ', 'manag', 'chemistri', 'process', 'work']
Topic 6:
['model', 'theori', 'probabl', 'optim', 'stochast', 'market', 'time', 'deriv', 'risk', 'introduct']
Topic 7:
['problem', 'numer', 'equat', 'analysi', 'solv', 'simul', 'function', 'comput', 'flow', 'exercis']
Topic 8:
['electron', 'applic', 'materi', 'mechan', 'introduct', 'techniqu', 'control', 'model', 'process', 'basic']
Topic 9:
['project', 'report', 'optic', 'research', 'plan', 'scientif', 'evalu

In [117]:
b = 2.0
test_with_b(b)

k = 10, a = 6.0, b = 2.0

Topic 1:
['model', 'structur', 'assess', 'data', 'week', 'problem', 'analysi', 'project', 'lectur', 'theori']
Topic 2:
['electron', 'optic', 'imag', 'microscopi', 'assess', 'lectur', 'principl', 'biolog', 'protein', 'model']
Topic 3:
['model', 'theori', 'basic', 'process', 'stochast', 'assess', 'probabl', 'linear', 'optim', 'statist']
Topic 4:
['optic', 'model', 'assess', 'basic', 'lectur', 'present', 'concept', 'exercis', 'applic', 'activ']
Topic 5:
['model', 'basic', 'exercis', 'reaction', 'comput', 'process', 'assess', 'concept', 'theori', 'analysi']
Topic 6:
['model', 'process', 'equat', 'flow', 'exercis', 'engin', 'concept', 'assess', 'basic', 'numer']
Topic 7:
['materi', 'applic', 'properti', 'mechan', 'assess', 'physic', 'present', 'structur', 'lectur', 'devic']
Topic 8:
['technolog', 'energi', 'project', 'present', 'evalu', 'report', 'assess', 'polici', 'industri', 'plan']
Topic 9:
['cell', 'project', 'develop', 'assess', 'present', 'learn', 'work', 'b

In [118]:
b = 5.0
test_with_b(b)

k = 10, a = 6.0, b = 5.0

Topic 1:
['optic', 'model', 'assess', 'lectur', 'analysi', 'present', 'basic', 'concept', 'process', 'activ']
Topic 2:
['assess', 'present', 'model', 'project', 'process', 'work', 'lectur', 'concept', 'evalu', 'activ']
Topic 3:
['model', 'assess', 'data', 'analysi', 'project', 'present', 'lectur', 'process', 'basic', 'work']
Topic 4:
['model', 'assess', 'project', 'process', 'present', 'analysi', 'work', 'data', 'lectur', 'concept']
Topic 5:
['model', 'assess', 'present', 'analysi', 'lectur', 'project', 'concept', 'basic', 'work', 'process']
Topic 6:
['model', 'assess', 'analysi', 'process', 'basic', 'lectur', 'concept', 'present', 'exercis', 'system']
Topic 7:
['model', 'basic', 'assess', 'process', 'applic', 'materi', 'lectur', 'concept', 'prerequisit', 'analysi']
Topic 8:
['model', 'assess', 'basic', 'lectur', 'process', 'materi', 'concept', 'applic', 'exercis', 'theori']
Topic 9:
['assess', 'present', 'project', 'process', 'lectur', 'model', 'work', 'evalu

In [119]:
b = 10.0
test_with_b(b)

k = 10, a = 6.0, b = 10.0

Topic 1:
['model', 'assess', 'present', 'analysi', 'process', 'lectur', 'project', 'concept', 'basic', 'work']
Topic 2:
['model', 'assess', 'process', 'present', 'lectur', 'basic', 'analysi', 'concept', 'project', 'work']
Topic 3:
['model', 'assess', 'process', 'lectur', 'basic', 'present', 'analysi', 'concept', 'project', 'work']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'basic', 'project', 'analysi', 'concept', 'work']
Topic 5:
['model', 'assess', 'process', 'present', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 6:
['model', 'assess', 'process', 'present', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 7:
['model', 'assess', 'project', 'present', 'analysi', 'lectur', 'process', 'work', 'data', 'concept']
Topic 8:
['model', 'assess', 'present', 'process', 'lectur', 'basic', 'project', 'analysi', 'concept', 'work']
Topic 9:
['model', 'assess', 'process', 'lectur', 'present', 'basic', 'analysi', 'proj

In [120]:
b = 20.0
test_with_b(b)

k = 10, a = 6.0, b = 20.0

Topic 1:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 2:
['model', 'assess', 'process', 'present', 'lectur', 'basic', 'analysi', 'project', 'concept', 'work']
Topic 3:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'project', 'concept', 'work']
Topic 5:
['model', 'assess', 'present', 'process', 'lectur', 'project', 'analysi', 'basic', 'concept', 'work']
Topic 6:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 8:
['model', 'assess', 'present', 'process', 'project', 'lectur', 'analysi', 'basic', 'concept', 'work']
Topic 9:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'basic', 'pro

In [121]:
b = 100.0
test_with_b(b)

k = 10, a = 6.0, b = 100.0

Topic 1:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 2:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 3:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 4:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 5:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 6:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 7:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 8:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', 'basic', 'concept', 'work']
Topic 9:
['model', 'assess', 'present', 'process', 'lectur', 'analysi', 'project', '

Similarly, adjusting the value of β in the LDA model impacts the similarity between topics. As β increases, topics tend to become more alike, while decreasing β results in more distinct and diverse topics. This phenomenon arises from the role of β in shaping the prior distribution of topics across terms. Larger β values promote smoother inferred distributions, fostering uniformity among topics.

## Exercise 4.10: EPFL's taught subjects

As EPFL has 20 different sessions, we choose $k=20$.

In [129]:
k = 20

As we want the distribution of topics per document not to be not very close to uniform, we choose small values for $a=b=1.01$.

In [135]:
a, b = 1.01, 1.01

In [136]:
print(f'k = {k}, a = {a}, b = {b}\n')
lda_epfl = LDA.train(courses_rdd, docConcentration=a, topicConcentration=b, k=k)
model_virtualization(lda_epfl)

k = 20, a = 1.01, b = 1.01

Topic 1:
['project', 'assess', 'plan', 'skill', 'present', 'evalu', 'work', 'group', 'research', 'report']
Topic 2:
['control', 'theori', 'model', 'probabl', 'time', '2', 'applic', 'hour', 'exam', 'studi']
Topic 3:
['model', 'price', 'stochast', 'fourier', 'deriv', 'financi', 'financ', 'risk', 'market', 'introduct']
Topic 4:
['optic', 'laser', 'imag', 'light', 'principl', 'basic', 'measur', 'process', 'microscopi', 'applic']
Topic 5:
['present', 'paper', 'biolog', 'protein', 'discuss', 'research', 'field', 'lectur', 'assess', 'report']
Topic 6:
['reaction', 'mechan', 'molecular', 'structur', 'organ', 'properti', 'solut', 'basic', 'assess', 'state']
Topic 7:
['materi', 'chemistri', 'magnet', 'applic', 'metal', 'properti', 'field', 'chemic', 'physic', 'reaction']
Topic 8:
['report', 'project', 'evalu', 'semest', 'activ', 'scientif', 'student', 'supervis', 'work', 'laboratori']
Topic 9:
['risk', 'assess', 'present', 'develop', 'busi', 'evalu', 'case', 'manag', 

In [None]:
# Manually add labels to topics
labels = [
    'Project',
    'Statistical Mechanics',
    'Finance',
    'Optics',
    'Biology',
    'Chemistry',
    'Material',
    'Project',
    'Risk Management',
    'Sciense',
    'Sensor Design',
    'Computer Science',
    'Architecture',
    'Energy',
    'Electronic Engineering',
    'Bio Engineering',
    'Data Science',
    'Physice',
    'Quantum Theory',
    'Project'
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

## Exercise 4.11: Wikipedia structure

Wikipedia content covers a lot of areas, and since memory is limited, we chose $k=30$. Also, since we want to filter out more representative keywords, we still chose small values for $a=b=1.01$.

In [7]:
# Load the dataset
wiki_RDD = sc.textFile('/ix/wikipedia-for-schools.txt').map(json.loads)
wikipage_RDD = wiki_RDD.map(lambda p: p['page_id']).distinct()
N = wikipage_RDD.count()

In [17]:
# Record every page
pageID = list(dict(zip(wikipage_RDD.collect(), range(N))).values())

# Record all words
words_RDD = wiki_RDD.flatMap(lambda p: p["tokens"]).distinct()
M = words_RDD.count() 

# Record all terms
terms = dict(zip(words_RDD.collect(), range(M)))
vectorized_term = np.vectorize(lambda x: terms[x])
termID = {v: k for k, v in terms.items()}

# Reduce wikipedia RDD with only indexes
red_wiki_RDD = wiki_RDD.map(lambda c: (pageID[c["page_id"]], vectorized_term(c["tokens"])))

In [18]:
def doc_to_vector(doc):
    vector = {}
    for term in doc[1]:
        vector[term] = vector.get(term, 0) + 1
    return (doc[0], Vectors.sparse(M, vector))

# Build the term-document matrix
term_doc_matrix = red_wiki_RDD.map(lambda x: doc_to_vector(x)).map(list)

In [31]:
print("Topic 1:\n['games', 'game', 'players', 'world', 'time', 'olympic', '–', 'cup', 'player', 'events']\nTopic 2:\n['theory', 'number', '=', 'numbers', 'work', 'set', 'called', 'form', 'written', 'century']\nTopic 3:\n['city', '·', 'centre', 'century', 'law', 'population', 'system', 'government', 'state', 'years']\nTopic 4:\n['blood', 'people', 'health', 'cancer', 'medical', 'treatment', 'risk', 'patients', 'high', 'years']\nTopic 5:\n['eruption', 'years', 'comet', 'lava', 'volcanic', 'volcano', 'india', 'soil', 'large', 'time']\nTopic 6:\n['island', 'islands', 'european', 'city', 'population', 'country', 'north', 'south', 'ireland', 'east']\nTopic 7:\n['south', 'lake', 'mi', 'area', 'river', 'city', 'oil', 'water', 'population', 'north']\nTopic 8:\n['gas', 'lens', 'game', 'time', 'earth', 'lenses', 'water', 'number', 'ds', 'haiku']\nTopic 9:\n['music', 'instruments', 'painting', 'art', 'made', 'popular', 'instrument', 'set', 'bass', 'style']\nTopic 10:\n['american', '–', 'calendar', 'january', 'march', 'february', 'july', 'april', 'june', 'december']\nTopic 11:\n['computer', 'software', 'oil', 'language', 'acid', 'system', 'internet', 'apple', '^', 'version']\nTopic 12:\n['energy', 'water', 'mass', 'light', 'earth', 'surface', 'chemical', 'form', 'temperature', 'solar']\nTopic 13:\n['film', 'john', 'england', 'series', 'time', 'london', '–', 'years', 'house', 'george']\nTopic 14:\n['government', 'state', 'rights', 'states', 'china', 'united', 'president', 'national', 'city', 'republic']\nTopic 15:\n['river', 'sea', 'lake', 'area', 'north', 'world', 'large', 'south', 'years', 'al']\nTopic 16:\n['bc', 'gods', 'egyptian', 'horse', 'god', 'egypt', 'mythology', 'modern', 'greek', 'temple']\nTopic 17:\n['war', 'british', 'american', 'united', 'german', 'september', 'august', 'states', 'july', 'army']\nTopic 18:\n['ice', 'space', 'war', 'nuclear', 'soviet', 'forces', 'weapons', 'time', 'force', 'mission']\nTopic 19:\n['system', 'systems', 'energy', 'information', 'distribution', 'data', 'number', 'dna', 'engine', 'process']\nTopic 20:\n['company', '$', 'war', 'market', 'states', 'world', 'united', 'government', 'system', 'japanese']")

Topic 1:
['games', 'game', 'players', 'world', 'time', 'olympic', '–', 'cup', 'player', 'events']
Topic 2:
['theory', 'number', '=', 'numbers', 'work', 'set', 'called', 'form', 'written', 'century']
Topic 3:
['city', '·', 'centre', 'century', 'law', 'population', 'system', 'government', 'state', 'years']
Topic 4:
['blood', 'people', 'health', 'cancer', 'medical', 'treatment', 'risk', 'patients', 'high', 'years']
Topic 5:
['eruption', 'years', 'comet', 'lava', 'volcanic', 'volcano', 'india', 'soil', 'large', 'time']
Topic 6:
['island', 'islands', 'european', 'city', 'population', 'country', 'north', 'south', 'ireland', 'east']
Topic 7:
['south', 'lake', 'mi', 'area', 'river', 'city', 'oil', 'water', 'population', 'north']
Topic 8:
['gas', 'lens', 'game', 'time', 'earth', 'lenses', 'water', 'number', 'ds', 'haiku']
Topic 9:
['music', 'instruments', 'painting', 'art', 'made', 'popular', 'instrument', 'set', 'bass', 'style']
Topic 10:
['american', '–', 'calendar', 'january', 'march', 'febr

In [30]:
# Manually add labels to topics
labels = [
    "Olympic Games",
    "Maths",
    "City",
    "Disease",
    "Geography",
    "Europe",
    "Population",
    "Lenses and Light",
    "Art",
    "Calendar",
    "Computers and Software",
    "Earth's Energy",
    "Film",
    "Forms of Government",
    "Rivers and Lakes",
    "History",
    "Months",
    "Wars",
    "Environment",
    "Global Corporations",
]

for i, label in enumerate(labels):
    print(f'Topic {i+1}: {label}')

SyntaxError: invalid syntax (<ipython-input-30-2517f0d966b3>, line 8)