
#  Topic Modeling with LDA (Latent Dirichlet Allocation) and visualisation

This notebook applies classical topic modeling using **LDA** from `gensim` on the `full_text` column of one of the enhanced datasets. You can apply it to any textual column - the more text, the merrier. It includes visualisations to explore topics and their distributions.

We will:
- Clean and preprocess the text
- Train an LDA model (5 topics)
- Visualise results with word clouds, bar charts, and pyLDAvis

The aim of the notebook is having an overview of what the collection contains.



## 📦 Import Libraries

We use:
- `pandas` to load the dataset
- `nltk` and `spacy` for text preprocessing
- `gensim` for topic modeling
- `pyLDAvis` and `matplotlib` for visualizations


In [2]:
!pip install spacy


Collecting spacy
  Downloading spacy-3.8.5-cp39-cp39-macosx_10_9_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.12-cp39-cp39-macosx_10_9_x86_64.whl.metadata (2.1 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp39-cp39-macosx_10_9_x86_64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp39-cp39-macosx_10_9_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp39-cp39-macosx_10_9_x86_64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloadin

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m898.5/898.5 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hUsing cached numpy-2.0.2-cp39-cp39-macosx_10_9_x86_64.whl (21.2 MB)
Downloading typer-0.15.2-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.1/45.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading wasabi-1.1.3-py3-none-any.whl (27 kB)
Downloading weasel-0.4.1-py3-none-any.whl (50 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Downloading cloudpathlib-0.21.0-py3-none-any.whl (52 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading confection-0.1.5-py3-none-any.whl (35 kB)
Downloading language_data-1.3.0-py3-none-any.whl (5.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
!pip install pyldavis


Collecting pyldavis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting pandas>=2.0.0 (from pyldavis)
  Downloading pandas-2.2.3-cp39-cp39-macosx_10_9_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib>=1.2.0 (from pyldavis)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting funcy (from pyldavis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting tzdata>=2022.7 (from pandas>=2.0.0->pyldavis)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.
Collecting scipy (from pyldavis)
  Downloading scipy-1.13.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.7 MB/s[0

In [6]:
!pip install wordcloud

Collecting wordcloud
  Downloading wordcloud-1.9.4-cp39-cp39-macosx_10_9_x86_64.whl.metadata (3.4 kB)
Collecting numpy>=1.6.1 (from wordcloud)
  Downloading numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading wordcloud-1.9.4-cp39-cp39-macosx_10_9_x86_64.whl (172 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.3/172.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl (20.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy, wordcloud
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently t

In [7]:

import pandas as pd
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
import nltk
import spacy
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import defaultdict
from nltk.corpus import stopwords
import string
import warnings
warnings.filterwarnings("ignore")


## 🔧 Download Resources

In [8]:

nltk.download('stopwords')
!python -m spacy download en_core_web_sm


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andreakocsis/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



## 📄 Load Dataset

We load the dataset which contains the parsed information and extract the `full_text` column.


In [9]:

df = pd.read_csv("data/AoT5_enhanced.csv")
texts = df['full_text'].dropna().astype(str).tolist()
print(f"Loaded {len(texts)} documents.")


Loaded 1057 documents.



## 🧹 Preprocess Text

We lowercase, remove punctuation, remove stopwords, and lemmatize the text.


In [10]:

stop_words = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def preprocess(texts):
    clean_texts = []
    for doc in nlp.pipe(texts, batch_size=500):
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.text.lower() not in stop_words]
        clean_texts.append(tokens)
    return clean_texts

processed_texts = preprocess(texts)



## 📚 Create Dictionary and Corpus for Gensim

These are the input formats required by `gensim`'s LDA model.


In [11]:

dictionary = corpora.Dictionary(processed_texts)
corpus = [dictionary.doc2bow(text) for text in processed_texts]



## 🤖 Train LDA Model (5 Topics)

We use `gensim`'s LDA implementation to discover 5 latent topics in the text.


In [12]:

lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=5,
                     random_state=42,
                     update_every=1,
                     chunksize=100,
                     passes=10,
                     alpha='auto',
                     per_word_topics=True)



## 🔍 View Topics and Keywords


In [13]:

topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)


(0, '0.035*"learn" + 0.023*"r" + 0.022*"n" + 0.014*"cookie" + 0.009*"u" + 0.008*"egg" + 0.008*"void" + 0.008*"c" + 0.008*"skin" + 0.008*"register"')
(1, '0.014*"vaccine" + 0.008*"study" + 0.007*"covid" + 0.006*"report" + 0.006*"death" + 0.006*"base" + 0.006*"datum" + 0.006*"use" + 0.005*"cancer" + 0.005*"risk"')
(2, '0.013*"health" + 0.009*"people" + 0.007*"help" + 0.007*"use" + 0.007*"make" + 0.006*"support" + 0.006*"long" + 0.006*"need" + 0.006*"woman" + 0.006*"work"')
(3, '0.015*"people" + 0.013*"service" + 0.012*"gender" + 0.012*"care" + 0.012*"healthcare" + 0.010*"disability" + 0.009*"nhs" + 0.008*"support" + 0.008*"health" + 0.006*"england"')
(4, '0.009*"say" + 0.007*"go" + 0.007*"back" + 0.006*"pain" + 0.005*"police" + 0.005*"like" + 0.005*"sleep" + 0.005*"one" + 0.005*"year" + 0.004*"everything"')



## 📊 Interactive pyLDAvis Visualization

This shows topic distances, importance, and term relevance.


In [14]:

import pyLDAvis
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.


## ☁️ Word Clouds for Each Topic

These visualize the most relevant words per topic using `WordCloud`.


In [None]:

for topic_id in range(5):
    plt.figure(figsize=(10, 5))
    plt.title(f"Word Cloud for Topic #{topic_id}")
    word_freqs = dict(lda_model.show_topic(topic_id, 30))
    wc = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freqs)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.show()



## 📈 Topic Frequency Bar Chart

This chart shows how many documents are most strongly associated with each topic.


In [None]:

topic_counts = defaultdict(int)
for doc in corpus:
    dominant_topic = sorted(lda_model.get_document_topics(doc), key=lambda x: -x[1])[0][0]
    topic_counts[dominant_topic] += 1

topics = list(topic_counts.keys())
counts = list(topic_counts.values())

plt.figure(figsize=(8, 5))
plt.bar(topics, counts, color='skyblue')
plt.xlabel("Topic")
plt.ylabel("Number of Documents")
plt.title("Dominant Topic Distribution")
plt.xticks(topics)
plt.show()
