# 📘 Protest Topic Modeling using BERTopic + LLaMA 2


## 🎯 Objective
This notebook applies **BERTopic** combined with **LLaMA 2** for labeling protest topics in the ACLED Iran dataset. 
The goal is to discover coherent topics and generate human-readable labels and explanations using LLaMA 2 text generation.

---


## 1. Install and Import Required Libraries

In [2]:

import numpy as np
import pandas as pd
import re
import nltk
import os
import torch
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.utils import simple_preprocess
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, TextGeneration

nltk.download("stopwords")
os.environ["TOKENIZERS_PARALLELISM"] = "false"


[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load and Preview Dataset

In [3]:

df = pd.read_csv("/home/ubuntu/Capstone_Files/data/ACELD_Iran.csv", sep=';')
df[['notes']].head()


Unnamed: 0,notes
0,"On 6 February 2025, nurses and health workers ..."
1,"On 6 February 2025, Continental Plateau Oil Co..."
2,"On 5 February 2025, workers at the Telecommuni..."
3,"On 5 February 2025, investors of Cryptoland Di..."
4,"On 5 February 2025, landowners at the 33 lands..."


## 3. Preprocess Protest Notes

In [4]:

def clean_notes(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'\b(?:on\s+)?\d{1,2}\s+\w+\s+\d{4}\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(protest(ed|ing)?|rally|gather(ed|ing)?|demonstration|march|strike|held)\b', '', text, flags=re.IGNORECASE)
    return text

df['clean_notes'] = df['notes'].apply(clean_notes)

custom_stopwords = {"october", "february", "january", "may", "november", "december", "april", "march"}
stop_words = set(stopwords.words("english")).union(custom_stopwords)

def preprocess(text):
    tokens = simple_preprocess(text, deacc=True)
    return [word for word in tokens if word not in stop_words and len(word) > 2]

processed_texts = df['clean_notes'].fillna("").apply(preprocess).tolist()
docs = [" ".join(tokens) for tokens in processed_texts]


## 4. Generate Sentence Embeddings

In [5]:

embedding_model = SentenceTransformer("all-mpnet-base-v2", device='cuda')
embeddings = embedding_model.encode(docs, show_progress_bar=True)


Batches:   0%|          | 0/819 [00:00<?, ?it/s]

## 5. UMAP Dimensionality Reduction and HDBSCAN Clustering

In [6]:

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.1, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=10, metric='euclidean', prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=5)


## 6. Load LLaMA 2 Model for Few-Shot Labeling

In [7]:

llama_model_id = "meta-llama/Llama-2-7b-chat-hf"
llama_tokenizer = AutoTokenizer.from_pretrained(llama_model_id, trust_remote_code=True)
llama_model = AutoModelForCausalLM.from_pretrained(llama_model_id, trust_remote_code=True, device_map="auto", torch_dtype=torch.float16).eval()
llama_generator = pipeline("text-generation", model=llama_model, tokenizer=llama_tokenizer, temperature=0.1, max_new_tokens=256, repetition_penalty=1.1)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


## 7. Prepare Few-Shot Prompt Template for LLaMA 2

In [8]:

system_prompt = """<s>[INST] <<SYS>> You are a helpful, respectful, and honest assistant for identifying the main reason behind each protest topic. <</SYS>>"""

few_shot_examples = """
I have a topic that contains the following documents:
- Retired employees rallied for overdue pension payments and social security benefits.
- Elderly workers demonstrated at the national pension office demanding fair treatment.
- A crowd of retirees chanted for insurance premium reductions and increased monthly payments.

The topic is described by the following keywords: 'retirees, pension, insurance, payment, benefits, elderly, demand, treatment, office'.

Based on the information above, please provide:
Reason: Frustration over pension and benefit delays.
Explanation: Retirees are protesting for delayed pension payments and fairer treatment regarding their benefits.
Label: Retiree protests over pension issues
"""

main_prompt = """
[INST]
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information above, please provide:
Reason: The main reason for the protests.
Explanation: A brief explanation connecting the documents to the reason.
Label: A short, descriptive label for this topic.
[/INST]
"""

prompt_template = system_prompt + few_shot_examples + main_prompt


## 8. Build BERTopic Model and Fit

In [9]:

representation_model = {
    "KeyBERT": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "LLaMA": TextGeneration(model=llama_generator, prompt=prompt_template)
}

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    calculate_probabilities=True,
    language='english',
    representation_model=representation_model,
    verbose=False
)

topics, probs = topic_model.fit_transform(docs, embeddings)
topic_model = topic_model.reduce_topics(docs, nr_topics=30)
topics, probs = topic_model.transform(docs)
df["topic"] = topics


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


## 9. Evaluate Coherence Score and Topic Diversity

### 📈 C_V Coherence Score: What it is

The **C_V Coherence Score** evaluates how semantically meaningful and internally consistent a topic is. It combines two important dimensions:

- **Co-occurrence Frequency**: How often the top words appear together in the original texts (using a sliding window).
- **Semantic Similarity**: How closely related the top words are in meaning, based on cosine similarity of their word embeddings.

A **higher C_V score** means:
- The topic’s top words frequently occur together in the texts.
- The words are more semantically coherent.

The C_V score leverages **Normalized Pointwise Mutual Information (NPMI)** and cosine similarity to assess topic coherence.

---

### 🧮 Formula (Conceptual)

Let:
- \( W = \{w_1, w_2, \dots, w_k\} \) : the top-k words in a topic
- \( \text{NPMI}(w_i, w_j) \) : the Normalized Pointwise Mutual Information between \( w_i \) and \( w_j \)
- \( \text{Sim}(w_i, w_j) \) : the semantic similarity (e.g., cosine similarity) between word embeddings of \( w_i \) and \( w_j \)

Then the C_V score is computed as an **aggregated combination** of:

$$
\\text{C}_V = \\frac{1}{|W|^2} \\sum_{i,j} \\text{NPMI}(w_i, w_j) \\times \\text{Sim}(w_i, w_j)
$$

> *Note*: Exact implementation may vary depending on the library used (e.g., Gensim), but the conceptual components remain the same.


📊 Topic Diversity (Top2Vec)
What it is:

Topic Diversity measures how distinct the top words are across all discovered topics. A higher score suggests that each topic captures a unique theme, with less repetition of keywords between topics.

For Top2Vec, which automatically identifies the number of topics and generates top keywords per topic based on document embeddings, this metric is useful for evaluating how semantically diverse the topics are.

**Formula:**  

Let \( T \) = number of topics  
Let \( k \) = number of top words per topic  
Let \( W_t \) = set of top‑k words for topic \( t \)

Then, Topic Diversity is calculated as:

$$
\text{Diversity} = \frac{\left|\bigcup_{t=1}^{T} W_t\right|}{T \times k}
$$

A value of 1.0 means all topics have completely unique top words (no overlap), while lower values indicate overlapping or redundant topic terms.

In [10]:

topic_words = [[word for word, _ in topic_model.get_topic(i)] for i in range(len(topic_model.get_topics())) if topic_model.get_topic(i)]
dictionary = Dictionary(processed_texts)
coherence_model = CoherenceModel(topics=topic_words, texts=processed_texts, dictionary=dictionary, coherence='c_v')
print(f"CV Coherence Score: {round(coherence_model.get_coherence(), 3)}")

def compute_topic_diversity(model, topk=10):
    top_words = [set(word for word, _ in model.get_topic(i)[:topk]) for i in range(len(model.get_topics())) if model.get_topic(i)]
    all_words = [word for topic in top_words for word in topic]
    return len(set(all_words)) / (topk * len(top_words))

print(f"Topic Diversity Score: {round(compute_topic_diversity(topic_model), 3)}")


CV Coherence Score: 0.735
Topic Diversity Score: 0.855


## 🔍 Manual Review of Selected Topics

In [11]:

topics_to_review = [0, 1, 2]
rep_docs = topic_model.get_representative_docs()

for tid in topics_to_review:
    docs_for_topic = rep_docs.get(tid, [])[:3]
    if not docs_for_topic:
        continue

    topic_keywords = ", ".join([word for word, _ in topic_model.get_topic(tid)[:10]])
    doc_snippets = "\n".join([f"- {doc}" for doc in docs_for_topic])
    filled_prompt = prompt_template.replace("[DOCUMENTS]", doc_snippets).replace("[KEYWORDS]", topic_keywords)

    try:
        response = llama_generator(filled_prompt, return_full_text=False)
        raw_output = response[0]["generated_text"].strip()
        reason = re.search(r"Reason:\s*(.+?)\n", raw_output)
        explanation = re.search(r"Explanation:\s*(.+?)\n", raw_output)
        label = re.search(r"Label:\s*(.+)", raw_output)

        reason = reason.group(1).strip() if reason else "Not provided"
        explanation = explanation.group(1).strip() if explanation else "Not provided"
        label = label.group(1).strip() if label else f"Topic {tid}"

    except Exception as e:
        print(f"Error extracting for Topic {tid}: {e}")
        reason, explanation, label = "Error", "Error", f"Topic {tid}"

    print(f"\nTopic {tid}")
    print(f"Label: {label}")
    print(f"Reason: {reason}")
    print(f"Explanation: {explanation}")
    print("Representative Documents:")
    for doc in docs_for_topic:
        print(f"- {doc}\n")
    print("=" * 60)



Topic 0
Label: Worker protests over unpaid salaries
Reason: Delays in salary payments.
Explanation: Workers from various companies, including Haft Tapeh Sugarcane Company and Greenspace Municipal Workers in Hamidiyeh City, Khuzestan, are protesting due to unpaid salaries that have been accumulated for several months. This highlights the financial difficulties faced by these workers and their employers' failure to address the issue in a timely manner.
Representative Documents:
- workers haft tapeh sugarcane company located outside shush months unpaid salaries

- workers haft tapeh sugarcane company located outside shush months unpaid salaries

- greenspace municipal workers hamidiyeh city khuzestan months unpaid wages


Topic 1
Label: Iranian retirees protest insurance rate hikes, pension arrears.
Reason: Dispute over insurance premium rates and payment of pension arrears.
Explanation: Retirees in Iran are protesting against the recent changes in insurance rates and the delay in paymen