# Topic Modeling for the Debates Dataset

This notebook documents the **Topic Modeling pipeline for the debates dataset** for the BSc thesis:  
`Debates, Media, and Discourse: A Computational Analysis of Temporal Shifts in U.S. Presidential Debates and Media Framing Across the Political Spectrum`, written by **Emma Cristina Mora** (emma.mora@studbocconi.it) at **Bocconi University** under the supervision of **Professor Carlo Rasmus Schwarz**.

This notebook develops a **multi-stage pipeline** for analyzing the thematic structure of U.S. presidential debates (1960–2020). The objective is to identify both **broad political themes** and **finer-grained subcomponents** using hierarchical topic modeling. The process combines unsupervised clustering, manual labeling, and structured exports to create an enriched dataset that serves as input for framing, sentiment, and ideological analyses.

**Dataset Preparation**
- Debate transcripts were first parsed into **utterance-level units** with metadata.
- Preprocessing included **speaker normalization** (e.g., unifying “MR. OBAMA” → `Obama`), **noise removal** (applause, laughter, ultra-short moderator remarks), and **metadata enrichment** (`debate_id`, `cycle`, `party`, `decade`).
- Final cleaned dataset: ~6,300 utterances.

**Embedding and Theme Modeling**
- Each utterance was embedded with **SBERT (all-MiniLM-L6-v2)**, producing a 6,300 × 384 matrix.
- **BERTopic** was applied to discover coarse-grained **themes** (17 categories) using UMAP + HDBSCAN.
- Manual relabeling ensured interpretability; incoherent clusters were reassigned or collapsed into `noise_or_unspecified`.

**Hierarchical Subthemes**
- A second round of BERTopic was run within large themes (≥200 utterances) to identify **subthemes**.  
- Example results:  
  - *foreign_policy_national_security* → 12 subthemes (`russia_soviet_union`, `military_budget`, …)  
  - *healthcare_social_security* → 4 subthemes (`affordable_care_health_insurance`, `medicare_prescription_drugs`, …)  
  - *tax_policy* → 2 subthemes (`tax_cuts_policy_proposals`, `tax_burden_fairness_inequality`)  

**Outputs**
- `debates_df_topic.csv` — utterance-level dataset with `theme_name` and `subtheme_name` labels  
- `debates_embeddings.npy` — SBERT embedding matrix  
- `topics_summary.csv` and `subtopics_summary.csv` — descriptive cluster overviews  

**Notebook Contribution**
This pipeline produces the **core debate dataset** used in downstream analyses, enabling:  
- Tracking of theme/subtheme frequency across time  
- Sentiment and stance analysis within themes  
- Framing and omission analysis in relation to media coverage  
- Embedding-based ideological scaling

## 1. Introduction and Config

In [3]:
# === SETUP ===

# standard libraries
from pathlib import Path
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import json

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# reproducibility (used later for sampling etc.)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [4]:
# === FILE PATHS ===

# set base repository path (assumes notebook is in repo/notebooks/)
REPO_DIR = Path(".").resolve().parents[0]

# key data paths
DATA_DIR = REPO_DIR / "data" 
DEBATES_DF_PATH = DATA_DIR / "debates_dataset.csv" 
METADATA_PATH = REPO_DIR / "data" / "debates_metadata.csv"

# confirm setup
print("Repository Path:", REPO_DIR)
print("Data Directory:", DATA_DIR)
print("Debates Dataset:", DEBATES_DF_PATH)
print("Metadata CSV:", METADATA_PATH)

# color palette
with open(Path(REPO_DIR / "color_palette_config.json")) as f:
    palette = json.load(f)

Repository Path: /Users/emmamora/Documents/GitHub/thesis
Data Directory: /Users/emmamora/Documents/GitHub/thesis/data
Debates Dataset: /Users/emmamora/Documents/GitHub/thesis/data/debates_dataset.csv
Metadata CSV: /Users/emmamora/Documents/GitHub/thesis/data/debates_metadata.csv


## 2. Load Data & Define Model Scope

In [57]:
# === LOAD DATA ===

debates_df = pd.read_csv(DEBATES_DF_PATH)

# basic sanity checks
expected_cols = {
    "text", "speaker_normalized", "speaker", "party", "winner", "winner_party", 
    "year", "debate_type", "debate_id", "utterance_id", "lemmatized_text"
}
missing = expected_cols - set(debates_df.columns)
assert not missing, f"Missing columns in dataset: {missing}"

# ensure text is string and non-empty
debates_df["text"] = debates_df["text"].astype(str).str.strip()
debates_df = debates_df[debates_df["text"].str.len() > 0].reset_index(drop=True)

print(f"Loaded {len(debates_df):,} utterances from {DEBATES_DF_PATH.name}")
debates_df.head(2)

Loaded 7,577 utterances from debates_dataset.csv


Unnamed: 0,text,speaker_normalized,speaker,party,winner,winner_party,year,debate_type,debate_id,utterance_id,lemmatized_text,token_count
0,good evening. the television and radio stations of the united states and their affiliated stations are proud to provide facilities for a discussio...,Moderator,Moderator,,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_001,good evening the television and radio station of the united states and their affiliated station be proud to provide facility for a discussion of i...,146
1,"mr. smith, mr. nixon. in the election of 1860, abraham lincoln said the question was whether this nation could exist half-slave or half-free. in t...",Candidate_D,Kennedy,Democrat,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_002,mr smith mr nixon in the election of 1860 abraham lincoln say the question be whether this nation could exist half slave or half free in the elect...,1290


In [58]:
# === TYPE FIXES & HELPER COLUMNS ===

# year as int
debates_df["year"] = debates_df["year"].astype(int)

# decade label (e.g., 1990 -> "1990s")
debates_df["decade"] = (debates_df["year"] // 10 * 10).astype(int).astype(str) + "s"

# short party code (handy for legends/colors later)
party_map = {"Republican": "R", "Democrat": "D", "Independent": "I"}
debates_df["party_code"] = debates_df["party"].map(party_map)

# quick peek
debates_df[["year", "decade", "party", "party_code"]].head(3)

Unnamed: 0,year,decade,party,party_code
0,1960,1960s,,
1,1960,1960s,Democrat,D
2,1960,1960s,,


## 3. Initial Topic Modeling

### 3.1. Generate Sentence Embeddings (SBERT)

In [7]:
# === FILTER UTTERANCES FOR TOPIC MODELING ===

TEXT_COL = "text"
MIN_CHAR_LEN = 85  # discard short moderator lines

# include all candidates
is_candidate = debates_df["party"].isin(["Republican", "Democrat", "Independent"])

# include moderator utterances only if they're long enough
is_long_moderator = debates_df["party"].isna() & (debates_df[TEXT_COL].str.len() >= MIN_CHAR_LEN)

# combine
include_mask = is_candidate | is_long_moderator
debates_df_topic = debates_df[include_mask].copy()

# clean and validate text
debates_df_topic[TEXT_COL] = debates_df_topic[TEXT_COL].astype(str).str.strip()
debates_df_topic = debates_df_topic[debates_df_topic[TEXT_COL].str.len() > 0].reset_index(drop=True)

# optional: print moderator filtering stats
n_total = len(debates_df)
n_total_mods = debates_df["party"].isna().sum()
n_kept_mods = is_long_moderator.sum()

print(f"Total utterances: {n_total:,}")
print(f"→ Moderator utterances: {n_total_mods:,}")
print(f"→ Moderator kept (>{MIN_CHAR_LEN} chars): {n_kept_mods:,}")
print(f"Final modeling utterances: {len(debates_df_topic):,}")

debates_df_topic[[TEXT_COL, "year", "party", "speaker"]].head(3)

Total utterances: 7,577
→ Moderator utterances: 2,813
→ Moderator kept (>85 chars): 1,552
Final modeling utterances: 6,316


Unnamed: 0,text,year,party,speaker
0,good evening. the television and radio station...,1960,,Moderator
1,"mr. smith, mr. nixon. in the election of 1860,...",1960,Democrat,Kennedy
2,"mr. smith, senator kennedy. the things that se...",1960,Republican,Nixon


In [11]:
# === LOAD SBERT MODEL & COMPUTE EMBEDDINGS ===

from sentence_transformers import SentenceTransformer
import torch

SBERT_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# pick best available device
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

sbert = SentenceTransformer(SBERT_MODEL_NAME, device=device)
print(f"SBERT loaded: {SBERT_MODEL_NAME} on {device.upper()}")

# encode utterances
texts = debates_df_topic[TEXT_COL].tolist()
embeddings = sbert.encode(
    texts,
    batch_size=64,
    convert_to_numpy=True,
    normalize_embeddings=True,
    show_progress_bar=True
)

print("Embeddings shape:", embeddings.shape)

# save embeddings 
TOPIC_MODELING_PATH = REPO_DIR / "results" / "topic_modeling" / "debates_embeddings.npy"
TOPIC_MODELING_PATH.parent.mkdir(parents=True, exist_ok=True)
np.save(TOPIC_MODELING_PATH, embeddings)

SBERT loaded: sentence-transformers/all-MiniLM-L6-v2 on MPS


Batches: 100%|██████████| 99/99 [00:22<00:00,  4.41it/s]

Embeddings shape: (6316, 384)





### 3.2. Run BERTopic Clustering (HDBSCAN)

In [59]:
# === PREPARE TEXT FOR TOPIC LABELING (EXCLUDE MODERATORS) ===

DESCR_COL = "lemmatized_text"

# remove moderators to avoid polluting topic labels
labeling_df = debates_df_topic[
    debates_df_topic["party"].isin(["Republican", "Democrat", "Independent"])
].copy()

rep_docs = (
    labeling_df[DESCR_COL]
    .fillna("")
    .astype(str)
    .str.lower()
    .str.replace(r"[^a-z\\s]", " ", regex=True)
    .str.replace(r"\\s+", " ", regex=True)
    .str.strip()
    .tolist()
)

print(f"Utterances for topic labeling: {len(rep_docs):,}")

Utterances for topic labeling: 4,764


In [60]:
# === DEFINE CUSTOM STOPWORDS FOR TOPIC LABELING ===

from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# noise from debate transcripts
fillers = {"well","just","im","youre","hes","shes","ive","weve","lets","thats","uh","um","okay","ok","yea","yeah",
           "oh","right","wait","fine","great","hey","gotta"}
artifacts = {"crosstalk","laughter","applause","pm","debate"}
truthiness = {"true","false","absolutely","simply","statement","fact","facts","agree","disagree","quote",
            "correct","incorrect"}
discourse = {"let","talk","say","want","going","got","get","make","think","know","ask","answer","question",
            "respond","reply","tell","told","listen","hear","mention","time","first","second","third","next",
            "also","another","point","ahead","hour","evening","tonight","thank","thanks","welcome","please",
            "minute","minutes","today","year","years","people","president","presidential","country",
            "opening","closing","introduce","introduction","moderator","moderators","candidate","candidates",
            "seconds","interrupt", "quick", "response", "responding", "response", "responses", "wait", "time"}
titles = {"mr","mrs","ms","dr","sen","senator","gov","governor","president","vice", "presidential", 
          "presidency","presidencial"}
moderator_first_names = {"martha","chris","jim","bob","elaine","lester","candy","gwen","judy","tom",
                        "anderson","frank","scott","susan","david","charlie","mike","john"}
candidate_names = {
    "obama", "biden", "joe", "kamala", "harris", "trump", "donald", "clinton",
    "hillary", "bernie", "sanders", "romney", "mitt", "cheney", "pence",
    "bush", "george", "reagan", "nixon", "kerry", "dukakis", "perot", "dole",
    "gore", "edwards", "mccain", "palin", "ryan", "quayle", "mondale",
    "hillary clinton", "donald trump", "mitt romney", "bernie sanders", "kamala harris",
    "george bush", "george w", "george w bush", "don", "donnell", "joe", "tim"
    }

custom_stopwords = (
    ENGLISH_STOP_WORDS
    .union(fillers)
    .union(artifacts)
    .union(truthiness)
    .union(discourse)
    .union(titles)
    .union(moderator_first_names)
    .union(candidate_names)
)

vectorizer_model = CountVectorizer(
    stop_words=list(custom_stopwords),
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.75
)

In [61]:
# === CONFIGURE & RUN BERTopic ===

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
import umap
import hdbscan

topic_model = BERTopic(
    embedding_model=sbert,
    umap_model=umap.UMAP(
        n_neighbors=15,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
        random_state=RANDOM_SEED
    ),
    hdbscan_model=hdbscan.HDBSCAN(
        min_cluster_size=40,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True
    ),
    vectorizer_model=vectorizer_model,
    representation_model=KeyBERTInspired(),
    calculate_probabilities=True,
    verbose=True
)

topics, probs = topic_model.fit_transform(debates_df_topic[TEXT_COL].tolist(), embeddings)
debates_df_topic["topic"] = topics
debates_df_topic["probability"] = probs.max(axis=1)

n_total = len(set(topics))
n_noise = sum(1 for t in topics if t == -1)
n_valid = n_total - (1 if -1 in topics else 0)

print(f"Total clusters (incl. noise): {n_total}")
print(f"Noise cluster (-1): {n_noise:,} utterances")
print(f"Valid topics discovered: {n_valid}")

2025-09-08 13:29:43,147 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-08 13:29:51,154 - BERTopic - Dimensionality - Completed ✓
2025-09-08 13:29:51,157 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-08 13:29:51,451 - BERTopic - Cluster - Completed ✓
2025-09-08 13:29:51,455 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-09-08 13:29:59,368 - BERTopic - Representation - Completed ✓


Total clusters (incl. noise): 20
Noise cluster (-1): 1,945 utterances
Valid topics discovered: 19


In [62]:
# === QUICK TOPIC WORD PREVIEW ===

TOP_N_WORDS = 10  # adjust as needed
topics_dict = topic_model.get_topics()

for t_id, word_weight_list in topics_dict.items():
    if t_id == -1:
        continue  # skip noise cluster
    top_terms = ", ".join([w for w, _ in word_weight_list[:TOP_N_WORDS]])
    print(f"[Topic {t_id:>2}] {top_terms}")

[Topic  0] nuclear weapons, nuclear, foreign policy, national security, iran, afghanistan, terrorism, threat, terrorist, terrorists
[Topic  1] happy, totally, answers, perfect, opposite, lucky, completely, glad, good job, check
[Topic  2] security medicare, social security, medicare, medicaid, obamacare, retirement, health insurance, bipartisan, tax cut, pension
[Topic  3] raising taxes, raise taxes, tax cuts, tax cut, tax reform, cut taxes, tax increase, tax rates, tax relief, paying taxes
[Topic  4] oil gas, energy policy, oil, natural gas, oil companies, coal, fuel, clean energy, gasoline, drilling
[Topic  5] supreme court, constitutional, judges, judge, appointed, constitution, courts, court, supreme, appoint
[Topic  6] debates, questioning, discussion, campaigns, election day, meeting, nominee, audience, rebuttal, moderate
[Topic  7] civil rights, law enforcement, racial, police officers, discrimination, race, justice, enforcement, african americans, police
[Topic  8] government s

In [63]:
# === REASSIGN BAD TOPICS TO 2ND-BEST CLUSTER ===

# topics you judged to be low quality or off-theme
bad_topics = {1, 12, 16}  # modify this set as needed

# copy for reassignment tracking
reassigned = 0
new_topics = topics.copy()

# iterate over each utterance and check if it needs reassignment
for i, (assigned_topic, topic_probs) in enumerate(zip(topics, probs)):
    if assigned_topic in bad_topics:
        # suppress bad topics in probability vector
        adjusted_probs = topic_probs.copy()
        adjusted_probs[list(bad_topics)] = -float("inf")

        # assign next highest topic
        new_topic = adjusted_probs.argmax()
        new_topics[i] = new_topic
        reassigned += 1

# update dataframe
debates_df_topic["topic"] = new_topics
print(f"Reassigned {reassigned:,} utterances from bad topics → next-best topic")

Reassigned 915 utterances from bad topics → next-best topic


In [64]:
# === CHECK SAMPLE OF REASSIGNED UTTERANCES ===

# convert to arrays for safe indexing
topics = np.array(topics)
new_topics = np.array(new_topics)

# compute reassignment mask and align it as a Series
reassigned_mask = topics != new_topics
reassigned_series = pd.Series(reassigned_mask, index=debates_df_topic.index)

# sample size for preview
sample_size = min(20, reassigned_series.sum())

# get indices of reassigned rows
reassigned_indices = np.where(reassigned_mask)[0]  # integer positions

# build the df
reassigned_df = pd.DataFrame({
    "text": debates_df_topic.iloc[reassigned_indices][TEXT_COL].values,
    "original_topic": topics[reassigned_indices],
    "new_topic": new_topics[reassigned_indices],
    "speaker": debates_df_topic.iloc[reassigned_indices]["speaker"].values,
    "year": debates_df_topic.iloc[reassigned_indices]["year"].values,
    "prob_original": probs[reassigned_indices, topics[reassigned_indices]],
    "prob_new": probs[reassigned_indices, new_topics[reassigned_indices]]
})

# sample for inspection
sample_size = min(20, len(reassigned_df))
reassigned_df.sample(sample_size, random_state=RANDOM_SEED)

Unnamed: 0,text,original_topic,new_topic,speaker,year,prob_original,prob_new
380,is that what you're saying?,1,9,Romney,2012,0.683921,0.0221785
855,"well, you gotta talk-- you gotta talk 'em into it, joe. sometimes you gotta talk 'em into it.",16,9,Trump,2020,0.044663,0.01480894
355,"no, i had a question——",1,9,Romney,2012,0.533298,0.03673281
357,"all right, and it is?",1,9,Romney,2012,0.430262,0.03469385
362,"no, he got the first——",1,9,Romney,2012,0.139816,0.05924564
486,i'll be — i'll be very respectful.,1,9,Pence,2016,0.588778,0.03263134
677,he doesn't want to answer the question.,1,9,Trump,2020,0.154105,0.05172389
595,... but we believed that we could make the country better. and i was proud of that.,1,9,Trump,2016,0.032231,0.04641205
551,"well, first, let me start by saying that so much of what he's just said is not right, but he gets to run his campaign any way he chooses. he gets ...",12,7,Clinton Hillary,2016,0.124299,0.09702829
30,that's not bad. that's true.,1,9,Bush Sr,1988,1.0,2.662527e-308


In [65]:
# === FORCE LOW-CONFIDENCE + SHORT UTTERANCES TO NOISE CLUSTER ===

# convert to array if not already
topics = np.array(topics)
new_topics = np.array(new_topics)

# define a low-probability + short utterance filter
low_confidence_mask = (
    (probs[np.arange(len(probs)), topics] < 0.2) & 
    (probs[np.arange(len(probs)), new_topics] < 0.2) & 
    (debates_df_topic[TEXT_COL].str.len() < 50)
)

print(f"Utterances forced to noise: {low_confidence_mask.sum():,}")

# apply it
new_topics[low_confidence_mask] = -1

# update dataframe
debates_df_topic["topic"] = new_topics

Utterances forced to noise: 438


### 3.3. Topic Labeling 

In [66]:
# === PREVIEW ONLY CLEANED TOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

# make sure we use only your reassigned topics
final_topic_assignments = debates_df_topic["topic"]
valid_topic_ids = sorted(set(final_topic_assignments) - {-1})

# build topic preview table from scratch
manual_topic_preview = []

for topic_id in valid_topic_ids:
    subset_df = debates_df_topic[final_topic_assignments == topic_id]
    
    # get top tokens from CountVectorizer (cleaned stopwords already applied)
    transformed = vectorizer_model.transform(subset_df[TEXT_COL])
    total_counts = transformed.sum(axis=0).A1
    vocab = np.array(vectorizer_model.get_feature_names_out())
    top_indices = total_counts.argsort()[::-1][:TOP_N_WORDS]
    top_words = ", ".join(vocab[top_indices])
    
    # sample representative utterances
    samples = subset_df[TEXT_COL].sample(n=min(N_EXAMPLES, len(subset_df)), random_state=RANDOM_SEED).tolist()
    samples += [""] * (N_EXAMPLES - len(samples))  # pad to fixed number of columns

    manual_topic_preview.append({
        "topic_id": topic_id,
        "top_words": top_words,
        "sample_1": samples[0],
        "sample_2": samples[1],
        "sample_3": samples[2],
        "sample_4": samples[3],
        "sample_5": samples[4]
    })

topic_preview_df = pd.DataFrame(manual_topic_preview)

# display nicely
pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(topic_preview_df)

Unnamed: 0,topic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"nuclear, iraq, weapons, troops, iran, soviet, russia, peace, defense, countries","governor, president bush said we would leave iraq at the end of 2011. and, elaine, iraq didn't want our troops to stay, and they wouldn't give us ...","mr. mondale, in this general area, sir, of arms control, president carter's national security adviser, zbigniew brzezinski, said, ""a nuclear freez...","perhaps in no area do we disagree more than this administration's policies on human rights. i went to the philippines as vice president, pressed f...","mr. president, the eyes of the country tonight are on the hostages in iran. i realize this is a sensitive area, but the question of how we respond...","well, let me speak, first of all, to what the vice president just said, and then i'll answer that question. this, unfortunately -- what the vice p..."
1,2,"medicare, social, social security, insurance, tax, money, costs, seniors, cost, health insurance","well, i think it's pretty liberal; i'll put that label on it. when you take a look at all the programs you've advocated, mr. president, thank good...",let me just — let me just say this. we are not — we're saying don't change benefits for people 55 and above. they already organized their retireme...,what i support is no change for current retirees and near-retirees to medicare. and the president supports taking $716 billion out of that program.,"hal, president bush has had his health care reform agenda on capitol hill for 8 months. he's had parts of it up there for years. you talk about in...","well, of course, we're going to cover americans with pre-existing conditions. in fact, a lot of my family members have gotten health care, i belie..."
2,3,"tax, taxes, income, pay, class, middle class, tax cut, billion, cuts, tax cuts","governor, to follow up on your answer, in order for any kind of tax relief to really be felt by the middle- and lower-income people, according to ...",they're going to raise your taxes. we're going to cut your taxes.,"well gwen, where i come from, it's called fairness, just simple fairness. the middle class is struggling. the middle class under john mccain's tax...","no, mr. president, i'm asking you a question. will you tell us how much you paid in federal income taxes in 2016 and 2017?","look, we don't cut it. and i might add, this so-called — i know we don't want to use the fancy word ""sequester,"" this automatic cut — that was par..."
3,4,"energy, oil, clean, coal, climate, environment, gas, water, production, tax","... growth is unrealistic. and they say—you talk a lot about growing the energy industry. they say with oil prices as low as they are right now, t...","sure. so first of all, let's start with the hurricane because it's an unbelievable, unspeakable human tragedy. i just saw today, actually, a photo...","well, mr. greenberg, i simply cannot allow to go unpassed the statements that have just been made by mr. reagan, who once again, has demonstrated,...",you yourself said on multiple occasions when you were running for president that you would ban fracking. joe biden looked his supporter in the eye...,"you believe that human pollution, gas, greenhouse gas emissions contributes to the global warming of this planet?"
4,5,"court, supreme, supreme court, woman, women, pro, faith, church, child, amendment","oh, i'm it's my question. but whether i agree or disagree with some individual, or what he may say, or how he may say it, i don't think there's an...","vice president harris, i want to give you your time to respond. but i do want to ask, would you support any restrictions on a woman's right to an ...",-- issue of waffling. he's waffled on the abortion issue.,"i want to turn to the issue of abortion. president trump, you've often touted that you were able to kill roe v. wade. last year, you said that you...",i would consider anyone in their qualifications. i do not believe that someone who has supported roe v. wade that would be part of those qualifica...
5,6,"debates, news, university, commission, statements, audience, agreed, rules, night, nominee",we'll be right back with more from the cnn presidential debate live from georgia. [ commercial break ],and that brings us to the rules of tonight's debate: 90 minutes with two commercial breaks. no topics or questions have been shared with the campa...,good evening from the field house at washington university in st. louis. i'm jim lehrer of the news hour on pbs. and i welcome you to this third a...,"on behalf of the commission on presidential debates, i am pleased to welcome you to this vice presidential debate. i'm judy woodruff of pbs' macne...","good evening from the clark athletic center at the university of massachusetts in boston. i'm jim lehrer of the newshour on pbs, and i welcome you..."
6,7,"women, black, police, rights, african, race, racial, justice, enforcement, affirmative","time. the next question goes to you. gentlemen, this is the 21st century. yet on average an american working woman in our great nation earns 75 ce...",". . . if i might finish the question, what does re-imagining policing mean and do you support the black lives matter call uh, for uh, community co...","but we need—lester, we need law and order. and we need law and order in the inner cities, because the people that are most affected by what's happ...","thank you, vice president. in march, breonna taylor a 26-year-old emergency room technician in louisville, was shot and killed after police office...","well, first of all, those stories have been largely debunked. those people—i don't know those people. i have a feeling how they came. i believe it..."
7,8,"deficit, spending, budget, tax, billion, taxes, economy, debt, programs, money","well, the basic problem with it is it doesn't balance the budget. if you forecast it out, you still will have a significant deficit under each of ...","yes, i've been very specific about those, john. and let me lay out for you my own strategy for bring that deficit down, because as a chief executi...",that's the way you bring a deficit down and help to improve the quality of life for people at the same time.,"without any doubt, i have stood up and told the american people that that $263 billion deficit must come down. and i've done what no candidate for...","... the fact is, he's going to advocate for the largest tax cuts we've ever seen, three times more than the tax cuts under the bush administration..."
8,9,"experience, qualifications, role, happy, arms, finish, final, national security, explain, excuse","– and it's just – he interrupted me and i'd like to just finish, please. if you have a pre-existing condition, heart disease, diabetes, breast can...","– has to be taken by the federal government and when we took action, it had a favorable response.",this is the same man who told you-,"i may not be for your version, mr. vice president, but i'm for what i just described to the lady.",let me -- let me complete...
9,10,"economy, economic, tax, problems, opportunity, character, crisis, experience, unemployment, peace","i, i think what i'm going to have to do is i'm going to start correcting the vice-president's statistics. there are 6 million more people who have...","bill clinton's top priority is putting america back to work. bill clinton and i will create good, high-wage jobs for our people, the same way he h...","mr. perot, even if you've got what people say are the guts to take on changes in the most popular and the most sacred of the entitlements, medicar...","no, there's no difference on that. there is a difference, though, as to what the economy has meant. i think the economy has meant more for the gor...","i could run this string out a long time, but remember this, jim: those 209 americans last thursday night in richmond told us they wanted us to sto..."


In [67]:
# === MANUAL TOPIC LABELING  ===

topic_labels = {
    0: "foreign_policy_national_security",
    2: "healthcare_social_security",
    3: "tax_policy",
    4: "energy_environment",
    5: "judiciary_supreme_court",
    6: "debate_format_procedure",
    7: "civil_rights_law_enforcement",
    8: "government_spending_budget",
    9: "leadership_executive_experience",
    10: "partisan_gridlock_new_leadership",
    11: "education_public_schools",
    13: "immigration_borders",
    14: "gun_control",
    15: "electoral_politics_governance",
    17: "public_health_pandemics",
    18: "china_global_trade",
}

# fallback label for noise/unlabeled
DEFAULT_LABEL = "noise_or_unspecified"

# create new column with mapped labels
debates_df_topic["theme_name"] = debates_df_topic["topic"].map(topic_labels).fillna(DEFAULT_LABEL)

# check distribution
debates_df_topic["theme_name"].value_counts()

theme_name
noise_or_unspecified                2154
foreign_policy_national_security    1191
leadership_executive_experience      737
healthcare_social_security           393
tax_policy                           226
energy_environment                   222
judiciary_supreme_court              213
civil_rights_law_enforcement         193
debate_format_procedure              190
immigration_borders                  151
government_spending_budget           136
partisan_gridlock_new_leadership     124
education_public_schools             105
electoral_politics_governance         77
gun_control                           77
public_health_pandemics               66
china_global_trade                    61
Name: count, dtype: int64

## 4. Hierarchical Topic Modeling

In [101]:
# === REUSABLE FUNCTION FOR SUBTOPIC MODELING ===

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
import umap
import hdbscan

def run_subtopic_modeling(df, text_col, embeddings, min_cluster_size=15, min_df=2):
    """
    Run BERTopic on a subset of utterances to discover subtopics within a main theme.
    Returns the model, topics, probs, and updated DataFrame.
    """
    # configure vectorizer
    vectorizer_model = CountVectorizer(
        stop_words=list(custom_stopwords),
        ngram_range=(1, 2),
        min_df=min_df,
        max_df=0.75
    )

    # configure BERTopic
    subtopic_model = BERTopic(
        embedding_model=sbert,
        umap_model=umap.UMAP(
            n_neighbors=10,
            n_components=3,
            min_dist=0.0,
            metric="cosine",
            random_state=RANDOM_SEED
        ),
        hdbscan_model=hdbscan.HDBSCAN(
            min_cluster_size=min_cluster_size,
            metric="euclidean",
            cluster_selection_method="eom",
            prediction_data=True
        ),
        vectorizer_model=vectorizer_model,
        representation_model=KeyBERTInspired(),
        calculate_probabilities=True,
        verbose=False
    )

    # run BERTopic
    subtopic_texts = df[text_col].tolist()
    topics, probs = subtopic_model.fit_transform(subtopic_texts, embeddings)

    # add to df
    df = df.copy()
    df["subtopic"] = topics
    df["subtopic_prob"] = probs.max(axis=1)

    return subtopic_model, topics, probs, df

# ensure destination columns exist exactly once
for _c in ("subtopic", "subtopic_prob", "subtheme_name"):
    if _c not in debates_df_topic.columns:
        debates_df_topic[_c] = pd.NA

### 4.1. Foreign Policy and National Security

In [103]:
# == RUN SUBTOPIC MODELING FOR "FOREING_POLICY_NATIONAL_SECURITY" === 

# filter data and embeddings
theme_name = "foreign_policy_national_security"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

# run subtopic discovery
subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=3
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'foreign_policy_national_security': 12


In [104]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

# build preview table
preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

# display nicely
pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"russia, russian, russians, putin, vladimir putin, crimea, ukraine, cold war, vladimir, syria",some democrats cringe at the words spying and covert activity. do you believe both of them have a legitimate role in countering terrorist activity...,—and the people of the soviet union want it to stop.,"again, if you're not rich, you're not a superpower, so we have two that i'd put as number one. i have a ""1"" and ""1a."" one is, we've got to have th...","well, it's cost-effective to help russia succeed in its revolution. it's pennies on the dollar compared to going back to the cold war. russia's st...","i'd rather answer her question first, and then i'll be glad to, because the question you ask is important. the end of the cold war brings an incre..."
1,1,"nuclear weapons, nuclear arms, strategic arms, nuclear proliferation, nuclear war, use nuclear, soviets, soviet union, treaty, missiles","mr. vice president, according to news dispatches soviet premier khrushchev said today that prime minister macmillan had assured him that there wou...","no, of course i haven't talked to prime minister macmillan. it would not be appropriate for me to do so. the president is still going to be presid...","mr. mcgee, we have a contractual right to be in berlin coming out of the conversations at potsdam and of world war ii. that has been reinforced by...","senator kennedy, last week you said that before we should hold another summit conference, that it was important that the united states build its s...","well i think we should st- strengthen our conventional forces, and we should attempt in january, february, and march of next year to increase the ..."
2,2,"sanctions iran, threat iran, iran nuclear, nuclear program, sanctions, nuclear weapons, north korea, diplomacy, nuclear weapon, getting nuclear","governor carter apparently doesn't realize that since i've been president, we have sold to the israelis over $4 billion in military hardware. we h...",i would hope that as we move to one area of the world from another--and the united states must not spread itself too thinly; that was one of the p...,"well, i firmly believe, mr. kraft, that it's unwise for a president to signal in advance what options he might exercise if any international probl...","now we have a chance. now we have a chance. and, so, i think that i'd leave it right there and say that you judge on the whole record. and let me ...","did he state your position correctly, you're not calling for eliminating the sanctions, are you?"
3,3,"laden, afghanistan pakistan, bin laden, qaeda, afghan, osama, al qaeda, osama bin, troops iraq, al qaida","well, the invasion of afghanistan didn't take place on our watch. i have described what has happened in iran, and we weren't here then either. i d...",i too thank the university of miami and say our prayers are with the good people of this state who've suffered a lot. september the 11th changed h...,"no, i don't believe it's going to happen. i believe i'm going to win because the american people know i know how to lead. i've shown the american ...","no, every life is precious. every life matters. you know, my hardest—the hardest part of the job is to know that i committed the troops in harm's ...",here's what it means: it means that saddam hussein needed to be confronted. john kerry and i have consistently said that. that's why we voted for ...
4,4,"defeat isis, isis, syria, iraqi, terrorist threat, assad, bashar assad, humanitarian crisis, american troops, bashar",r.c. east is the most dangerous place in the world.,"nobody is proposing to send troops to syria. american troops. now, let me say it this way. how would we do things differently? we wouldn't refer t...","let me — you don't go through the u.n. we are in the process now — and have been for months — in making sure that help, humanitarian aid, as well ...","well, we agree with the same red line, actually, they do on chemical weapons, but not putting american troops in, other than to secure those chemi...",all right. the president . that will help us maintain the kind of american leadership that we need. syria
5,5,"foreign relations, castro, khrushchev, kennedy, truman, treaty, vietnam, cuba, previous administration, eisenhower","mr. smith, mr. nixon. in the election of 1860, abraham lincoln said the question was whether this nation could exist half-slave or half-free. in t...","mr. smith, senator kennedy. the things that senator kennedy has said many of us can agree with. there is no question but that we cannot discuss ou...",it would be rather difficult to cover them in eight and- in two and a half minutes. i would suggest that these proposals could be mentioned. first...,"senator kennedy, on another subject, communism is so often described as an ideology or a belief that exists somewhere other than in the united sta...",i agree with senator kennedy's appraisal generally in this respect. the question of communism within the united states has been one that has worri...
6,6,"invasion iraq, saddam, war iraq, free iraq, saddam hussein, hussein, wrong war, war wrong, war terror, iraqi","sir, this question concerns your administrative performance as president. the other day, general george brown, the chairman of the joint chiefs of...","i have indicated to general brown that the words that he used in that interview, in that particular case, and in several others were very ill-advi...","well, just briefly, i think this is the second time that general brown has made a statement for which he did have to apologize--and i know that ev...","let's take mr. bush for the moment at his word. i mean, he's right, we don't have any evidence, at least, that our government did tell saddam huss...",it's awful easy when you're dealing with 90/90 hindsight. we did try to bring saddam hussein into the family of nations. he did have the fourth la...
7,7,"patriot act, homeland security, patriot, american citizens, terrorist attack, fbi, rights, intelligence surge, citizens, cia","mr. president, the real problem with the fbi-in fact, all of the intelligence agencies-is there are no real laws governing them. such laws as ther...","you are familiar, of course, with the fact that i am the first president in 30 years who has reorganized the intelligence agencies in the federal ...",there has been too much government secrecy and not enough respect for the personal privacy of american citizens.,"mr. president, the government [general] accounting office has just put out a report suggesting that you shot from the hip in the mayaguez rescue m...","the white house did not prevent the release of that report. on july 12 of this year, we gave full permission for the release of that report. i was..."
8,8,"budget, spending, national defense, trillion dollars, navy, spend, billion dollars, billions dollars, trillion, strategic",governor carter again is talking in broad generalities. let me take just one question that he raises--the military strength and capability of the ...,"well, mr. ford, unfortunately, has just made a statement that's not true. i have never advocated a communist government for italy; that would, obv...","governor, we always seem, in our elections, and maybe in between, too, to argue about who can be tougher in the world. give or take a few billion ...","well, always in the past we have had an ability to have a strong defense and also to have a strong domestic economy and also to be strong in our r...","if i hear you right, sir, you are saying guns and butter both, but president johnson also had trouble keeping up both vietnam and his domestic pro..."
9,9,"hamas, israeli, said israel, israelis, netanyahu, peace middle, humanitarian, hussein, arab, saddam","i'd like to go back just one moment to the previous question, where mr. ford, i think, confused the issue by trying to say that we're shipping isr...","governor carter, if the price of gaining influence among the arabs is closing our eyes a little bit to their boycott against israel, how would you...","what we've done is to support arab stores that want to stand up against international terror, quite different. we believe in supporting, without j...","well, governor, and vice president bush, you've both talked tonight about hard choices. let me try to give you one. somewhere in the middle east t...","we have a very consistent policy in the middle east: it is to support the peace process, to support the security of israel, and to support those w..."


In [106]:
# === MANUAL SUBTOPIC LABELING: foreign_policy_national_security ===

foreign_policy_subthemes = {
    0: "russia_soviet_union",
    1: "nuclear_arms_treaties",
    2: "iran_nuclear_program",
    3: "afghanistan_alqaeda_binladen",
    4: "syria_isis_terrorism",
    5: "historical_foreign_policy_cuba_vietnam",
    6: "iraq_invasion",
    7: "patriot_act_homeland_security",
    8: "military_budget",
    9: "israel_middle_east_peace",
    10: "military_deployments",
    11: "humanitarian_interventions_bosnia_kosovo"
}

# === WRITE-BACK (replaces the mask-based assignment) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(foreign_policy_subthemes).astype("object").values
)

print("Foreign Policy – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Foreign Policy – subtopics written back.
subtheme_name
NaN                                         379
russia_soviet_union                         119
nuclear_arms_treaties                       110
iran_nuclear_program                         93
afghanistan_alqaeda_binladen                 83
syria_isis_terrorism                         69
historical_foreign_policy_cuba_vietnam       68
iraq_invasion                                68
patriot_act_homeland_security                60
military_budget                              46
israel_middle_east_peace                     41
military_deployments                         32
humanitarian_interventions_bosnia_kosovo     23
Name: count, dtype: int64


### 4.2. Leadership and Executive Experience - Fail

In [107]:
# === RUN SUBTOPIC MODELING FOR "leadership_executive_experience" ===

# filter data and embeddings
theme_name = "leadership_executive_experience"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

# run subtopic discovery
subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=3
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'leadership_executive_experience': 3


In [108]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

# build preview table
preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

# display nicely
pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"said, ve, read said, like ve, ll, , , , ,","can i have the summation time please? we've completed our questions and our comments, and in just a moment, we'll have the summation time.",i don't have any response.,i'll catch up with it later.,you remember the last time you said that?,"no, i might as well just go with-"
1,1,"speak, ve, law, decision, talking, change, like ve, gentlemen, follow, work","senator, the vice president in his campaign has said that you were naive and at times immature. he has raised the question of leadership. on this ...","mr. vice president, your campaign stresses the value of your eight year experience, and the question arises as to whether that experience was as a...","mr. vice president, do i take it then you believe that you can work better with democratic majorities in the house and senate than senator kennedy...","gentlemen, we have approximately four minutes remaining. may i ask you to make your questions and answers as brief as possible consistent with cla...","gentlemen, if i may remind you, time is growing short, so please keep your questions and answers as brief as possible consistent with clarity. mr...."
2,2,"law, talking, speak, gentlemen, follow, read said, decision, come, said, doesn","– has to be taken by the federal government and when we took action, it had a favorable response.","and, incidentally, may i say that that's the decision of the congress, and the president has concurred.","in 1987, you wrote a letter, and we'll pass this out to the media --","and then, when they were included in a plan that the congress passed, --",i don't use language like that and i don't think that we should.


### 4.3. Healthcare and Social Security

In [111]:
# === RUN SUBTOPIC MODELING FOR "healthcare_social_security" ===

# filter data and embeddings
theme_name = "healthcare_social_security"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

# run subtopic discovery
subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=3
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'healthcare_social_security': 4


In [112]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

# build preview table
preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

# display nicely
pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"affordable care, health insurance, obamacare, cost health, care costs, care plan, buy health, insurance companies, private health, care act","mr. vice president, you've said you want a kinder, gentler presidency, one that helps the less fortunate. today, 37 million americans including ma...","well, no, it's no answer to those 37 million people, most of them members of working families who don't have a dime of health insurance and don't ...",i thought the oregon plan should at least have been allowed to be tried because at least the people in oregon were trying to do something. let me ...,-- going to have new taxes. i hope you talked to them about the fact that you were going to increase spending to $220 billion. i'm sure what you d...,"okay, let's move on now. i would like to remind the audience of one thing. trying to stop you from applauding may be a lost cause. i didn't say an..."
1,1,"security social, security fund, trust fund, reform social, pension, fiscal, cut benefits, funds, benefits senior, trillion dollars",i didn't indicate. i did not advocate reducing the federal debt because i don't believe that you're going to be able to reduce the federal debt ve...,"governor reagan, wage earners in this country — especially the young — are supporting a social security system that continues to affect their inco...","the social security system was based on a false premise, with regard to how fast the number of workers would increase and how fast the number of r...","yes, president carter. wage earners in this country, especially the young, are supporting a social security system that continues to affect their ...","as long as there's a democratic president in the white house, we will have a strong and viable social security system, free of the threat of bankr..."
2,2,"save medicare, medicare problem, medicare social, benefits senior, care elderly, prescription drugs, prescription drug, senior citizen, prescripti...","there you go again. [laughter] when i opposed medicare, there was another piece of legislation meeting the same problem before the congress. i hap...","yes. with regard to medicare, no, but it's time for us to say that medicare is in pretty much the same condition that social security was, and som...","governor clinton, ann compton has brought up medicare. i remember in 1965 when wilbur mills of arkansas, the chairman of ways and means, was pushi...","well, i must say i looked back at the vote on medicare in 1965—we had a program called eldercare that also provided drugs and was means-tested so ...","before i answer that, jim, let me just say it is disgraceful, the campaign being waged to scare the american senior citizens, in this state and my..."
3,3,"save medicare, tax cuts, affordable care, fiscal, medicare social, trillion debt, balanced budget, budget plan, bipartisan support, tax relief","- reducing the interest rate. in my judgment, the hard money, tight money policy, fiscal policy of this administration has contributed to the slow...",senator kennedy has indicated on several occasions in this program tonight that i have been misstating his record and his figures. i will issue a ...,perhaps the most pronounced difference that ref- separates the democratic and the republican candidates is reflected in the question that was just...,"well, i think both the presidential budget and our estimates agree that if we move back to full employment, as we intend to do, and achieve a five...","you said it when president carter said that you were going to cut medicare, and you said, ""oh, no, there you go again, mr. president."" and what di..."


In [113]:
# === MANUAL SUBTOPIC LABELING: healthcare_social_security ===

healthcare_subthemes = {
    0: "affordable_care_health_insurance",
    1: "social_security_pensions",
    2: "medicare_prescription_drugs",
    3: "budget_medicare_tax_policy"
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(healthcare_subthemes).astype("object").values
)

print("Healthcare & Social Security – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Healthcare & Social Security – subtopics written back.
subtheme_name
NaN                                 175
affordable_care_health_insurance    100
social_security_pensions             61
medicare_prescription_drugs          34
budget_medicare_tax_policy           23
Name: count, dtype: int64


### 4.4. Tax Policy

In [114]:
# === RUN SUBTOPIC MODELING FOR "tax_policy" ===

theme_name = "tax_policy"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=2
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'tax_policy': 2


In [115]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"raise taxes, increase taxes, cut taxes, taxes middle, tax increase, tax rates, middle income, tax rate, new taxes, tax relief","mr. president, i would like to continue for a moment on this question of taxes which you have just raised. you have said that you favor more tax c...","at the time, mr. gannon, that i made the recommendation for a $28 billion tax cut-three-quarters of it to go to individual taxpayers and 25 percen...","mr. president, to follow up a moment, the congress has passed a tax bill which is before you now which did not meet exactly the sort of outline th...",that tax bill does not entirely meet the criteria that i established. i think the congress should have added another $10 billion reduction in pers...,"well, mr. ford is changing considerably his previous philosophy. the present tax structure is a disgrace to this country. it's just a welfare prog..."
1,1,"taxes paying, federal taxes, paying taxes, paid taxes, pay taxes, pay tax, eliminate tax, pay federal, tax credits, little tax","vice-president bush, last year you paid less than 13 percent of your income in federal taxes. according to the irs, someone in your bracket normal...","what that figure - and i kind of like the way mrs. ferraro and mr. zaccaro reported - because they reported federal taxes, state and local taxes -...","i want to respond to that, i want to respond to that. george bush, in case you've forgotten, dan, said ""read my lips -- no new taxes."" (laughter a...","i can see my wife and i think she's saying, ""i think he should go out into the private sector.""",this taxes a million small businesses. he keeps trying to make you think that it's just some movie star or hedge fund guy or an actor...


In [116]:
# === MANUAL SUBTOPIC LABELING: tax_policy ===

tax_policy_subthemes = {
    0: "tax_cuts_policy_proposals",
    1: "tax_burden_fairness_inequality"
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(tax_policy_subthemes).astype("object").values
)

print("Tax Policy – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Tax Policy – subtopics written back.
subtheme_name
tax_cuts_policy_proposals         152
tax_burden_fairness_inequality     43
NaN                                31
Name: count, dtype: int64


### 4.5. Energy and Environment

In [117]:
# === RUN SUBTOPIC MODELING FOR "energy_environment" ===

theme_name = "energy_environment"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=2
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'energy_environment': 2


In [118]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"oil industry, giving oil, foreign oil, oil gas, oil imports, oil coal, gas prices, price gasoline, big oil, alternative energy","well, among my other experiences in the past i've been a nuclear engineer, and i did graduate work in this field. i think i know the capabilities ...","mr. movers, in addition to saying that this is no time for a tax cut, in view of the incipient signs of renewed inflation, in addition to calling ...","well, i cannot see where a $.50 a gallon tax applied to gasoline would have changed the price of gasoline. it would still have gone up as much as ...","well, i believe that conservation, at course, is worthy in and of itself. anything that would preserve, or help us use less energy, that would be ...","well, mr. greenberg, i simply cannot allow to go unpassed the statements that have just been made by mr. reagan, who once again, has demonstrated,..."
1,1,"climate change, climate crisis, deal climate, environmental policy, climate accord, climate, water conservation, emissions, pollution, greenhouse","yes, i would. some of the things that can be done about this is a change in the rate structure of electric power companies. we now encourage peopl...","first, let me set the record straight. i vetoed the strip mining bill, mr. kraft, because it was the overwhelming consensus of knowledgeable peopl...","well, i might say i think the league of conservation voters is absolutely right. this administration's record of environment is very bad. i think ...","that is a misstatement, of course, of my position. i just happen to believe that free enterprise can do a better job of producing the things that ...",i have a very strong record on the environment in the united states senate. (laughter) i have a record where i voted for the superfund legislation...


In [119]:
# === MANUAL SUBTOPIC LABELING: energy_environment ===

energy_environment_subthemes = {
    0: "oil_gas_industry",
    1: "renewable_energy_climate_change"
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(energy_environment_subthemes).astype("object").values
)

print("Energy & Environment – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Energy & Environment – subtopics written back.
subtheme_name
NaN                                145
oil_gas_industry                    41
renewable_energy_climate_change     36
Name: count, dtype: int64


### 4.6. Judiciary and Supreme Court

In [120]:
# === RUN SUBTOPIC MODELING FOR "judiciary_supreme_court" ===

theme_name = "judiciary_supreme_court"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=1
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'judiciary_supreme_court': 2


In [121]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"issue abortion, birth abortion, birth abortions, term abortion, pro choice, unborn child, unborn, pro life, interpret constitution, democratic","governor carter, in the nearly 200-year history of the constitution, there have been only, i think it is, 25 amendments, most of them on issues of...","i would not work hard to support any of those. we have always had, i think, a lot of constitutional amendments proposed but the passage of them ha...",i support the republican platform which calls for a constitutional amendment that would outlaw abortions. i favor the particular constitutional am...,"governor, you've said the supreme court today is, as you put it, moving back in the proper direction in rulings that have limited the rights of cr...","while i was governor of georgia, although i am not a lawyer, we had complete reform of the georgia court system. we streamlined the structure of t..."
1,1,"religion politics, approve church, intolerant religion, article faith, church state, devout catholic, practice religion, separation church, religi...","i'd like to switch the focus from inflation to god. this week, cardinal medeiros of boston warned catholics that it's sinful to vote for candidate...","oh, i'm it's my question. but whether i agree or disagree with some individual, or what he may say, or how he may say it, i don't think there's an...","okay. i would point out that churches are tax-exempt institutions, and i'll repeat my question. do you approve the church's action this week in bo...","ms. golden, certainly the church has the right to take a position on moral issues. but to try, as occurred in the case that you mentioned - that s...","mr. president, would you describe your religious beliefs, noting particularly whether you consider yourself a born-again christian, and explain ho..."


In [122]:
# === MANUAL SUBTOPIC LABELING: judiciary_supreme_court ===

judiciary_supreme_court_subthemes = {
    0: "abortion_constitutional_amendments",
    1: "religion_church_state_debate"
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(judiciary_supreme_court_subthemes).astype("object").values
)

print("Judiciary & Supreme Court – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Judiciary & Supreme Court – subtopics written back.
subtheme_name
abortion_constitutional_amendments    181
religion_church_state_debate           32
Name: count, dtype: int64


### 4.7. Civil Rights and Law Enforcement

In [123]:
# === RUN SUBTOPIC MODELING FOR "civil_rights_law_enforcement" ===

theme_name = "civil_rights_law_enforcement"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=2
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'civil_rights_law_enforcement': 4


In [124]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"black families, issue race, black americans, racist, african american, american community, whites, racist person, ethnic, minority","yes, governor reagan. blacks and other nonwhites are increasing in numbers in our cities. many of them feel that they are facing a hostility from ...","i believe in it. i am eternally optimistic, and i happen to believe that we've made great progress from the days when i was young and when this co...","yes, president carter, i'd like to repeat the same followup to you. blacks and other nonwhites are increasing in numbers in our cities. many of th...","mr. perot, racial division continues to tear apart our great cities, the last episode being this spring in los angeles. why is this still happenin...","i grew up in the segregated south, thankfully raised by a grandfather with almost no formal education but with a heart of gold who taught me early..."
1,1,"discrimination, quotas, diversity, equal opportunity, civil rights, diverse, disabilities act, policy, equality, legislation","yes. howard, i'm a southerner, and i share the basic beliefs of my region about an excessive government intrusion into the private affairs of amer...","congresswoman ferraro, i would like to ask you about civil rights. you have in the past been a supporter of tuition tax credits for private paroch...","in the area of affirmative action, what steps do you think government can take to increase the representation of minorities and women in the work ...","i do not support the use of quotas. both mr. mondale and i feel very strongly about affirmative action to correct inequities, and we believe that ...","vice-president bush, many critics of your administration say that it is the most hostile to minorities in recent memory. have you inadvertently pe..."
2,2,"policing, police action, local police, police officers, justice reform, officers, cops, law order, police officer, enforcement implicit","yeah, i can't imagine what it would be like to be singled out because of race and stopped and harassed. that's just flat wrong, and that's not wha...","i would agree. and i also agree that most police officers, of course, are doing a good job and hate this practice also. i talked to an african-ame...","well, you're right. race remains a significant challenge in our country. unfortunately, race still determines too much, often determines where peo...","well, first of all, secretary clinton doesn't want to use a couple of words, and that's law and order. and we need law and order. if we don't have...","no, the argument is that we have to take the guns away from these people that have them and they are bad people that shouldn't have them. these ar..."
3,3,"white house, political opponent, election, debates, political opponents, attacked, voters, political, torches, carrying torches","secretary clinton has done an extraordinary job, but she works for me. i'm the president, and i'm always responsible. and that's why nobody is mor...","elaine, if i could — if i could jump in. i've heard senator scott make that eloquent plea. and look, criminal justice is about respecting the law ...","this tape is generating intense interest. in just 48 hours, it's become the single most talked about story of the entire 2016 election on facebook...","he went after mr. and mrs. khan, the parents of a young man who died serving our country, a gold star family, because of their religion. he went a...","well, chris, let me respond to that, because that's horrifying. you know, every time donald thinks things are not going in his direction, he claim..."


In [125]:
# === MANUAL SUBTOPIC LABELING: civil_rights_law_enforcement ===

civil_rights_law_enforcement_subthemes = {
    0: "racism_experiences",
    1: "dei_initiatives_civil_rights",
    2: "law_enforcement_police",
    3: "race_in_political_discourse"
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(civil_rights_law_enforcement_subthemes).astype("object").values
)

print("Civil Rights & Law Enforcement – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Civil Rights & Law Enforcement – subtopics written back.
subtheme_name
NaN                             84
racism_experiences              34
dei_initiatives_civil_rights    30
law_enforcement_police          24
race_in_political_discourse     21
Name: count, dtype: int64


### 4.8. Debate Format Procedure 

In [126]:
# === RUN SUBTOPIC MODELING FOR "debate_format_procedure" ===

theme_name = "debate_format_procedure"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=2
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'debate_format_procedure': 3


In [127]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"election day, tomorrow, nominee, washington, wednesday, st louis, tuesday, cbs news, november, cbs","thank you very much, gentlemen. this hour has gone by all too quickly. thank you very much for permitting us to present the next president of the ...","thank you very much. i'd like to thank vice-president bush, congresswoman ferraro, the members of our panel for joining us in this league of women...","thank you, mr. president. thank you, mr. mondale. our thanks also to the panel, finally, to our audience. we thank you, and the league of women vo...",good evening. on behalf of the commission on presidential debates i am pleased to welcome you to this first presidential debate of the 1988 campai...,"on behalf of the commission on presidential debates, i am pleased to welcome you to this vice presidential debate. i'm judy woodruff of pbs' macne..."
1,1,"debates sponsored, discussion, campaigns agreed, campaigns, topics, audience hall, republican nominee, democratic nominee, nominee, agreed represe...","thank you gentlemen. as we mentioned at the opening of this program, the candidates agreed that the clock alone would determine who had the last w...","well, one of the very serious things that's happened in our government in recent years and has continued up until now is a breakdown in the trust ...","ladies and gentlemen, probably it is not necessary for me to say that we had a technical failure during the debates. it was not a failure in the d...","good evening. i am pauline frederick of npr [national public radio], moderator of the second of the historic debates of the 1976 campaign between ...","thank you gentlemen. the subject matter of tonight's debate, like that of the first two presidential debates, covers domestic and economic policie..."
2,2,"end statements, toss elected, come end, segment secretary, statements, senators, statements order, statements kemp, come close, questions statements",this will allow three minutes and twenty seconds for the summation by each candidate.,we have time for only one or two more questions before the closing statements. now walter cronkite's question for senator kennedy.,"under the agreed rules, gentlemen, we've exhausted the time for questions. each candidate will now have four minutes and thirty seconds for his cl...","it is now time for the closing statements which are to be up to 4 minutes long. governor carter, by the same toss of the coin that directed the fi...","thank you, governor carter. that completes the questioning for this evening. each candidate now has up to 3 minutes for a closing statement. it wa..."


In [128]:
# === MANUAL SUBTOPIC LABELING: debate_format_procedure ===

debate_format_procedure_subthemes = {
    0: "debate_opening_closing_remarks",
    1: "debate_structure_and_sponsorship",
    2: "closing_statements_and_time_rules",
}

# === WRITE-BACK (safe alignment on df_theme.index) ===
debates_df_topic.loc[df_theme.index, "subtopic"] = (
    pd.Series(df_theme_subtopics["subtopic"].values, index=df_theme.index).astype("Int64")
)
debates_df_topic.loc[df_theme.index, "subtopic_prob"] = df_theme_subtopics["subtopic_prob"].values
debates_df_topic.loc[df_theme.index, "subtheme_name"] = (
    df_theme_subtopics["subtopic"].map(debate_format_procedure_subthemes).astype("object").values
)

print("Debate Format & Procedure – subtopics written back.")
print(debates_df_topic.loc[df_theme.index, "subtheme_name"].value_counts(dropna=False))

Debate Format & Procedure – subtopics written back.
subtheme_name
debate_opening_closing_remarks       53
debate_structure_and_sponsorship     52
NaN                                  45
closing_statements_and_time_rules    40
Name: count, dtype: int64


### 4.9. Immigration and Borders - Fail

In [129]:
# === RUN SUBTOPIC MODELING FOR "immigration_borders" ===

theme_name = "immigration_borders"
theme_mask = debates_df_topic["theme_name"] == theme_name
df_theme = debates_df_topic[theme_mask]
embeddings_theme = np.array(embeddings)[theme_mask]

subtopic_model, subtopics, subtopic_probs, df_theme_subtopics = run_subtopic_modeling(
    df=df_theme,
    text_col=TEXT_COL,
    embeddings=embeddings_theme,
    min_cluster_size=20,
    min_df=1
)

print(f"Subtopics found in '{theme_name}':", len(set(subtopics)) - (1 if -1 in subtopics else 0))

Subtopics found in 'immigration_borders': 2


In [130]:
# === PREVIEW SUBTOPICS ===

TOP_N_WORDS = 10
N_EXAMPLES = 5

subtopic_info = subtopic_model.get_topic_info()
valid_subtopic_ids = subtopic_info[subtopic_info.Topic != -1]["Topic"].tolist()

preview_rows = []
for subtopic_id in valid_subtopic_ids:
    top_words = ", ".join([w for w, _ in subtopic_model.get_topic(subtopic_id)[:TOP_N_WORDS]])
    sample_texts = (
        df_theme_subtopics[df_theme_subtopics["subtopic"] == subtopic_id][TEXT_COL]
        .head(N_EXAMPLES)
        .tolist()
    )
    preview_rows.append({
        "subtopic_id": subtopic_id,
        "top_words": top_words,
        "sample_1": sample_texts[0] if len(sample_texts) > 0 else "",
        "sample_2": sample_texts[1] if len(sample_texts) > 1 else "",
        "sample_3": sample_texts[2] if len(sample_texts) > 2 else "",
        "sample_4": sample_texts[3] if len(sample_texts) > 3 else "",
        "sample_5": sample_texts[4] if len(sample_texts) > 4 else "",
    })

subtopic_preview_df = pd.DataFrame(preview_rows)

pd.set_option("display.max_colwidth", 150)
pd.set_option("display.max_rows", 30)
display(subtopic_preview_df)

Unnamed: 0,subtopic_id,top_words,sample_1,sample_2,sample_3,sample_4,sample_5
0,0,"immigration reform, border security, illegal immigration, amnesty, deportations, coming border, deportation force, comprehensive immigration, depo...","mr. mondale, many analysts are now saying that actually our number one foreign policy problem today is one that remains almost totally unrecognize...","this is a very serious problem in our country, and it has to be dealt with. i object to that part of the simpson-mazzoli bill which i think is ver...","sir, people as well-balanced and just as father theodore hesburgh at notre dame, who headed the select commission on immigration, have pointed out...","i think you're right that the polls show that the majority of hispanics want that bill, so i'm not doing it for political reasons. i'm doing it be...","mr. president, you, too, have said that our borders are out of control. yet this fall you allowed the simpson-mazzoli bill—which would at least ha..."
1,1,"nancy pelosi, pelosi, politician, white house, nancy, won election, iran deal, tremendous, spending, run","i have none i'd like to ask of her, but i'd sure like to use the time to talk about the world series or something of that nature. let me put it th...","well, i assume she was supportive of the decision on mcdonnell douglas. i assume she was supporting me on the decision to sell those airplanes. i ...",they didn't get yours or mine? which one didn't they get?,"well, americans have gotten to know sarah palin. they know that she's a role model to women and other -- and reformers all over america. she's a r...","you know, i think it's -- that's going to be up to the american people. i think that, obviously, she's a capable politician who has, i think, exci..."


## 5. Export Processed Dataset

In [131]:
# === FINAL DATA QUALITY CHECKS & SAVE ===

# preview columns
display(debates_df_topic.head(3))

# count total rows and missing values in key columns
print("Total rows:", len(debates_df_topic))
print("Missing theme_name:", debates_df_topic["theme_name"].isna().sum())
print("Missing subtheme_name:", debates_df_topic["subtheme_name"].isna().sum())

# value counts
print("\nThemes:")
print(debates_df_topic["theme_name"].value_counts())

print("\nSubthemes:")
print(debates_df_topic["subtheme_name"].value_counts(dropna=False))

Unnamed: 0,text,speaker_normalized,speaker,party,winner,winner_party,year,debate_type,debate_id,utterance_id,lemmatized_text,token_count,decade,party_code,topic,probability,theme_name,subtheme_name,subtopic,subtopic_prob
0,good evening. the television and radio stations of the united states and their affiliated stations are proud to provide facilities for a discussio...,Moderator,Moderator,,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_001,good evening the television and radio station of the united states and their affiliated station be proud to provide facility for a discussion of i...,146,1960s,,6,0.289106,debate_format_procedure,,-1,0.332321
1,"mr. smith, mr. nixon. in the election of 1860, abraham lincoln said the question was whether this nation could exist half-slave or half-free. in t...",Candidate_D,Kennedy,Democrat,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_002,mr smith mr nixon in the election of 1860 abraham lincoln say the question be whether this nation could exist half slave or half free in the elect...,1290,1960s,D,0,0.184597,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,5,1.0
2,"mr. smith, senator kennedy. the things that senator kennedy has said many of us can agree with. there is no question but that we cannot discuss ou...",Candidate_R,Nixon,Republican,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_004,mr smith senator kennedy the thing that senator kennedy have say many of we can agree with there be no question but that we can not discuss our in...,1406,1960s,R,0,0.154407,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,5,1.0


Total rows: 6316
Missing theme_name: 0
Missing subtheme_name: 3811

Themes:
theme_name
noise_or_unspecified                2154
foreign_policy_national_security    1191
leadership_executive_experience      737
healthcare_social_security           393
tax_policy                           226
energy_environment                   222
judiciary_supreme_court              213
civil_rights_law_enforcement         193
debate_format_procedure              190
immigration_borders                  151
government_spending_budget           136
partisan_gridlock_new_leadership     124
education_public_schools             105
electoral_politics_governance         77
gun_control                           77
public_health_pandemics               66
china_global_trade                    61
Name: count, dtype: int64

Subthemes:
subtheme_name
<NA>                                        2951
NaN                                          860
affordable_care_health_insurance             638
social_security_p

In [132]:
# === FINAL SAVE === 

TOPIC_RESULTS_DIR = REPO_DIR / "results" / "topic_modeling"
TOPIC_RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# save full dataset
debates_df_topic.to_csv(DATA_DIR / "debates_df_themes.csv", index=False)

# save summary table
topic_preview_df.to_csv(TOPIC_RESULTS_DIR / "topics_summary.csv", index=False)

# save theme mapping
with open(TOPIC_RESULTS_DIR / "theme_name_dict.json", "w") as f:
    json.dump(topic_labels, f, indent=2)

# save subtheme mappings
with open(TOPIC_RESULTS_DIR / "foreign_policy_subthemes.json", "w") as f:
    json.dump(foreign_policy_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "healthcare_subthemes.json", "w") as f:
    json.dump(healthcare_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "tax_policy_subthemes.json", "w") as f:
    json.dump(tax_policy_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "energy_environment_subthemes.json", "w") as f:
    json.dump(energy_environment_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "judicial_supreme_court_subthemes.json", "w") as f:
    json.dump(judiciary_supreme_court_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "civil_rights_law_enforcement_subthemes.json", "w") as f:
    json.dump(civil_rights_law_enforcement_subthemes, f, indent=2)

with open(TOPIC_RESULTS_DIR / "debate_format_procedure_subthemes.json", "w") as f:
    json.dump(debate_format_procedure_subthemes, f, indent=2)

# preview saved files
list(TOPIC_RESULTS_DIR.glob("*"))

[PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/.DS_Store'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/tax_policy_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/healthcare_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/energy_environment_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/debates_dataset_with_topics.csv'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/civil_rights_law_enforcement_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/topics_summary.csv'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/foreign_policy_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/GitHub/thesis/results/topic_modeling/debate_format_procedure_subthemes.json'),
 PosixPath('/Users/emmamora/Documents/