**Topic Modeling on Protests Dataset using Bertopic**

This notebook uses the Bertopic model for performing topic modeling on protest notes from the ACLED dataset for Iran. The goal is to identify coherent and distinct topics underlying protest narratives and evaluate the model's performance through coherence and topic diversity metrics.
We compare three embedding models—`all-mpnet-base-v2`, `paraphrase-MiniLM-L6-v2`, and `all-MiniLM-L6-v2`—by computing coherence and diversity, then visualize the `all-mpnet-base-v2` results.

In [None]:
#%% Imports and Environment Setup
## Loading all the packages

import os  
import re  
import random 
from collections import Counter  

import numpy as np  
import pandas as pd 
import nltk 
from nltk.corpus import stopwords  

from sentence_transformers import SentenceTransformer

from umap import UMAP
from hdbscan import HDBSCAN

from bertopic import BERTopic

from sklearn.feature_extraction.text import CountVectorizer

from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.utils import simple_preprocess
import matplotlib.pyplot as plt  # plotting library

# Prevents parallel tokenizer warnings when using Hugging Face tokenizers
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [None]:
# Read protest notes from local CSV file
data_path = "/home/ubuntu/Capstone_Files/data/ACELD_Iran.csv"
iran_df = pd.read_csv(data_path, sep=';')

## Printing the first five rows of the dataset
print(iran_df.head())

In [None]:
def clean_notes(text):
    """Strip out explicit dates (e.g., '12 March 2020') and
    remove words like 'protest', 'rally', etc., to reduce noise."""
    if pd.isna(text):
        return ""
    # Remove date patterns
    text = re.sub(r'\b\d{1,2}\s+\w+\s+\d{4}\b', '', text)
    # Remove protest-related terms
    text = re.sub(
        r'\b(protest(?:ed|ing)?|rally|demonstrat(?:ed|ing)?|march|strike|held)\b',
        '',
        text,
        flags=re.IGNORECASE
    )
    return text

iran_df['clean_notes'] = iran_df['notes'].apply(clean_notes)

In [None]:
nltk.download('stopwords')  # Downloads the stopwords corpus if not already present

# Define custom stopwords to exclude  months and common words
custom_stopwords = {"october", "january", "february", "may", "november", "december", "april", "march", "front", "plan", "outside", "building"}

# Combine NLTK's English stopwords with the custom list
stop_words = set(stopwords.words('english')).union(custom_stopwords)

In [None]:
def preprocess(text):
    """
    Tokenize and preprocess input text.
    Input:
        text (str): raw text string to process
    Output:
        List[str]: list of cleaned tokens
    """
    # Tokenize text into words, remove non-alphanumeric characters
    tokens = simple_preprocess(text, deacc=True)
    # Filter tokens: remove stopwords and tokens shorter than three characters
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    return tokens

# Apply preprocessing to each note and obtain list of tokens
processed_texts = iran_df['clean_notes'].fillna("").apply(preprocess)

# Reconstruct documents by joining tokens for BERTopic input
docs = processed_texts.apply(lambda tokens: " ".join(tokens))


## UMAP  
**Purpose:** Uniform Manifold Approximation and Projection (UMAP) reduces high‑dimensional embeddings to a lower-dimensional space while preserving both local and global structure.  
- **`n_neighbors`**: The number of nearest neighbor points used to estimate the manifold structure (balances local vs. global structure).  
- **`n_components`**: The dimensionality of the target space (e.g. 5 for 5‑dimensional output).  
- **`min_dist`**: The minimum distance between points in the low‑dimensional embedding (controls how tightly UMAP packs points).  
- **`metric`**: The distance metric used to compute point‑to‑point similarity in the original space (e.g. “cosine” for cosine distance).  
- **`random_state`**: Seed for the random number generator to ensure reproducible embeddings.  

---

## HDBSCAN  
**Purpose:** Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) finds clusters of varying density and marks outliers as noise.  
- **`min_cluster_size`**: The smallest size grouping that should be considered a cluster.  
- **`min_samples`**: The number of nearby points required to consider a point part of a dense region (higher values → more conservative clusters).  
- **`metric`**: The distance metric used to measure pairwise distances (e.g. “euclidean” for straight‑line distance).  
- **`prediction_data`**: Whether to store additional data needed for assigning new points to existing clusters.  

---

## CountVectorizer  
**Purpose:** Converts a collection of text documents into a matrix of token (word or n‑gram) counts for feature extraction.  
- **`ngram_range=(1,2)`**: Extract both unigrams (single words) and bigrams (two‑word sequences).  
- **`stop_words="english"`**: Remove common English stopwords (e.g., “the”, “and”) before counting.  
- **`min_df=5`**: Ignore tokens that appear in fewer than 5 documents to reduce noise and dimensionality.  


In [None]:
umap_params = {
    "n_neighbors": 15,
    "n_components": 5,
    "min_dist": 0.1,
    "metric": "cosine",
    "random_state": 42
}
hdbscan_params = {
    "min_cluster_size": 20,
    "min_samples": 10,
    "metric": "euclidean",
    "prediction_data": True
}

vectorizer = CountVectorizer(ngram_range=(1,2), stop_words="english", min_df=5)

# SentenceTransformer: 
A model that converts sentences or documents into fixed-size dense vectors (embeddings) that capture semantic meaning, enabling downstream tasks like clustering.

The **C_V Coherence Score** assesses how semantically coherent the top words within a topic are, based on how often they appear together in the original texts and how similar their meanings are.

A higher C_V score indicates that:

-- The top words in a topic frequently occur together.

-- The words are semantically related.

The C_V score combines two key concepts:

Co-occurrence Frequency: Measures how often pairs of top words in a topic appear together in a sliding window across the original texts.

Semantic Similarity: Evaluates how similar these word pairs are in meaning, typically using cosine similarity over word embeddings.

These values are combined using a statistical measure called Normalized Pointwise Mutual Information (NPMI) and aggregated to produce a single coherence score for each topic.


### 🧮 C_V Coherence Score Formula

The C_V coherence score is calculated as:

$$
C_V = \frac{1}{|W|} \sum_{w_i, w_j \in W} \text{NPMI}(w_i, w_j) \cdot \text{cosine\_similarity}(w_i, w_j)
$$

Where:  
$W$ is the set of top words in a topic  
$\text{NPMI}(w_i, w_j)$ is the Normalized Pointwise Mutual Information between words $w_i$ and $w_j$  
$\text{cosine\_similarity}(w_i, w_j)$ is the semantic similarity between word embeddings of $w_i$ and $w_j$

### Topic Diversity

**What it is:**  
Topic Diversity measures how distinct the top words are across all topics. A higher score indicates less overlap in the most important words between topics, suggesting more distinctive topic representations.

**Formula:**  

$
\text{Topic Diversity} = \frac{\lvert \bigcup_{t=1}^{T} W_t \rvert}{T \times k}
$

- $T$ = number of topics (excluding the “noise” topic)  
- $k$ = number of top words considered per topic  
- $W_t$ = set of the top $k$ words for topic $t$
- $U_{t=1}^{T} W_t$ = union of all these top‑word sets  

So you collect the top‑k words from each of the T topics, count how many unique words appear in total, and divide by T x k. A value of 1.0 means no overlap; lower values indicate more shared terms across topics.  


In [None]:
def evaluate_model(embedding_name):
    """
    Fit BERTopic with a given SentenceTransformer embedding model,
    compute topic coherence and diversity, and return the scores with the fitted model.
    """
    # Load the specified SentenceTransformer on GPU (if available) to generate embeddings
    embed = SentenceTransformer(embedding_name, device='cuda')
    
    # Instantiate the BERTopic model:
    # - embedding_model: our sentence embeddings
    # - umap_model: reduces embeddings to lower dimensions
    # - hdbscan_model: clusters the reduced embeddings into topics
    # - vectorizer_model: extracts n‑gram features for topic representation
    topic_model = BERTopic(
        embedding_model=embed,
        umap_model=UMAP(**umap_params),
        hdbscan_model=HDBSCAN(**hdbscan_params),
        vectorizer_model=vectorizer,
        language='english',
        calculate_probabilities=True,
        verbose=False
    )
    
    # Fit BERTopic to our document list, retrieving topic assignments and probabilities
    topics, probs = topic_model.fit_transform(docs.tolist())

    # Reduce to 30 topics for clearer interpretability
    reduced = topic_model.reduce_topics(docs.tolist(), nr_topics=30)
    
    # Build a list of top words for each topic (excluding the noise topic -1)
    topic_words = [
        [word for word, _ in topic_model.get_topic(topic_id)]
        for topic_id in topic_model.get_topic_info().Topic
        if topic_id != -1
    ]
    
    # Prepare a Gensim dictionary from our tokenized texts for coherence calculation
    dictionary = Dictionary(processed_texts.tolist())
    
    # Calculate c_v coherence using the topic word lists and original tokenized texts
    cv = CoherenceModel(
        topics=topic_words,
        texts=processed_texts.tolist(),
        dictionary=dictionary,
        coherence='c_v'
    ).get_coherence()
    
    # Compute topic diversity:
    # - For each topic, collect its top `topk` words
    # - Diversity = (unique top words) / (total possible top words)
    topk = 10
    top_sets = [
        set([w for w, _ in topic_model.get_topic(topic_id)[:topk]])
        for topic_id in topic_model.get_topic_info().Topic
        if topic_id != -1
    ]
    all_words = [word for s in top_sets for word in s]
    diversity = len(set(all_words)) / (topk * len(top_sets))
    
    # Return the rounded coherence score, diversity score, and the fitted model
    return round(cv, 3), round(diversity, 3), topic_model

# List of embedding models to compare
models = [
    "all-mpnet-base-v2",
    "paraphrase-MiniLM-L6-v2",
    "all-MiniLM-L6-v2"
]

# Evaluate each model and print results
results = {}
for name in models:
    # Compute coherence, diversity, and retrieve the trained model
    cv_score, div_score, mdl = evaluate_model(name)
    # Store metrics in a dictionary for later summary
    results[name] = {"Coherence": cv_score, "Diversity": div_score}
    # Output the evaluation summary for this embedding
    print(f"{name}: Coherence={cv_score}, Diversity={div_score}")

In [None]:
# Display results in a DataFrame for clarity
pd.DataFrame(results).T.rename_axis("Embedding").reset_index()

In [None]:
mpnet_model = results["all-mpnet-base-v2"]["model"]  # now defined

Visualization: Document Counts per Topic 
This horizontal bar chart shows how many documents each of the 30 topics contains.
This chart highlights the relative prevalence of each topic in the corpus, with the top bars representing the most dominant themes and the bottom bars the least common ones.

In [None]:
topic_info = mpnet_model.get_topic_info()                # full topic info
freq = topic_info[topic_info.Topic != -1]                # drop the outlier cluster (-1)
freq = freq.sort_values(by="Count", ascending=False)     # sort desc by document count

# 2. Plot as horizontal bars
plt.figure(figsize=(12, 8))                              # set size to 12×8 inches
plt.barh(freq['Name'], freq['Count'])                    # barh(names on y, counts on x)
plt.xlabel("Number of Documents")                        # label x‑axis
plt.title("Document Counts per Topic (all‑mpnet‑base‑v2)") # give chart a title
plt.gca().invert_yaxis()                                 # invert y so largest bar sits at top
plt.tight_layout()                                       # adjust margins
plt.show()                                               # render the plot

Top 10 Words Before vs. After Preprocessing
Two stacked bar charts:
- **Top**: raw token frequencies from clean_notes  
- **Bottom**: processed token frequencies after stop‑word removal


This comparison reveals how preprocessing filters out frequent but uninformative tokens—prominent in the raw plot—and surfaces more meaningful, content-specific words in the processed visualization.


In [None]:
# Count raw tokens
raw_tokens = []
for text in iran_df['clean_notes'].dropna():     # iterate over cleaned notes
    raw_tokens.extend(simple_preprocess(text, deacc=True))  # tokenize, strip punctuation
raw_counts = Counter(raw_tokens)                          # tally frequencies
top10_raw = raw_counts.most_common(10)                    # pick ten most common
words_raw, freq_raw = zip(*top10_raw)                     # split into words & counts

# Count processed tokens
processed_tokens = [w for tokens in processed_texts for w in tokens]  # flatten list
proc_counts = Counter(processed_tokens)
top10_proc = proc_counts.most_common(10)
words_proc, freq_proc = zip(*top10_proc)

# Plot both
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))    # two rows, one column

# Raw tokens plot
ax1.bar(words_raw, freq_raw)                              # bars: x=words_raw, height=freq_raw
ax1.set_title("Top 10 Words BEFORE Preprocessing")        # title
ax1.set_ylabel("Frequency")                               # y‑axis label
for i, v in enumerate(freq_raw):                          # annotate counts
    ax1.text(i, v, str(v), ha='center', va='bottom')

# Processed tokens plot
ax2.bar(words_proc, freq_proc)
ax2.set_title("Top 10 Words AFTER Preprocessing")
ax2.set_ylabel("Frequency")
for i, v in enumerate(freq_proc):
    ax2.text(i, v, str(v), ha='center', va='bottom')

plt.tight_layout()                                        # tidy spacing
plt.show()                                                # display

Topic Distribution for 2 Random Documents

This chart displays the probability distribution of topics for two individual documents, highlighting which topics each document is most strongly associated with.

Purpose: To demonstrate BERTopic’s soft clustering by showing how a single document can relate to multiple topics with varying strengths, aiding in qualitative inspection of topic assignments. 


In [None]:
# Select two random doc indices
sample_ids = random.sample(range(len(docs)), 2)

for doc_id in sample_ids:
    # 1. Compute topic probabilities for this document
    _, doc_probs = mpnet_model.transform([docs[doc_id]])[0:2]
    probs = doc_probs[0]                                   # extract array of probabilities

    # 2. Filter for probability > 0.1
    significant = [(i, p) for i, p in enumerate(probs) if p > 0.1]
    if not significant:
        print(f"Document {doc_id}: No topics >0.1 probability.\n")
        continue

    # 3. Sort descending by probability
    significant.sort(key=lambda x: x[1], reverse=True)
    topic_labels = [f"Topic {i}" for i,_ in significant]   # e.g. "Topic 5"
    p_vals = [p for _,p in significant]                    # probabilities list

    # 4. Print the original text
    print(f"\nDocument {doc_id}:\n{iran_df.iloc[doc_id]['notes']}\n")

    # 5. Plot bar chart
    plt.figure(figsize=(10, 4))                            # wide and short figure
    bars = plt.bar(topic_labels, p_vals)                   # bars: x=topic_labels, height=p_vals
    plt.ylim(0,1)                                          # y‑axis from 0 to 1
    plt.xlabel("Topics")                                   # x‑axis label
    plt.ylabel("Probability")                              # y‑axis label
    plt.title(f"Topic Distribution for Document {doc_id}") # chart title
    plt.xticks(rotation=45, ha='right')                    # rotate xticks for readability

    # 6. Annotate each bar with its probability
    for bar, prob in zip(bars, p_vals):
        plt.text(bar.get_x() + bar.get_width()/2,
                 prob,
                 f"{prob:.2f}",
                 ha='center',
                 va='bottom',
                 fontsize=9)
    plt.tight_layout()                                     # adjust layout
    plt.show()                                             # render

The interactive BERTopic visualization creates an “inter‑topic distance map” where each topic is shown as a bubble (positioned by its semantic similarity to other topics and sized by its prevalence), and lets you:

Hover over a topic to see its top words and their importances

Zoom and pan to explore how topics cluster or separate

Click into individual topics to inspect term rankings and document examples

Its purpose is to give you an intuitive, dynamic way to explore the model’s topics—understanding which themes are most common, how they relate to one another, and what key words define each topic.

In [None]:
# Generate the interactive BERTopic visualization object
topic_vis_data = mpnet_model.visualize_topics()

from IPython.display import display

display(topic_vis_data)