# Ideological Drift - Word Drift Using Embeddings

This notebook documents the **Word Drift analysis pipeline** for the BSc thesis:  
`Debates, Media, and Discourse: A Computational Analysis of Temporal Shifts in U.S. Presidential Debates and Media Framing Across the Political Spectrum`, written by **Emma Cristina Mora** (emma.mora@studbocconi.it) at **Bocconi University** under the supervision of **Professor Carlo Rasmus Schwarz**.  

The objective of this stage is to explore how the **semantic meaning of key political anchors** (e.g., *freedom*, *security*, *immigration*) has evolved in U.S. presidential debates across decades, and how these anchors align with policy themes in both **debates** and **media coverage**. By tracking semantic drift and divergence, the analysis highlights how rhetorical and policy anchors shift, stabilize, or fragment in political discourse.  

**Dataset Preparation**  
- The input dataset is the **debates_df_themes.csv**, containing ~6,300 utterances enriched with speaker, party, year, decade, and thematic labels.  
- Utterances were paired with **SBERT embeddings (all-MiniLM-L6-v2, 384-dim)** previously computed during topic modeling.  
- Anchor terms were defined with **alias expansions** to capture linguistic variation (e.g., *security, safety, defense*).  
- Filtering removed moderator interventions and retained only **candidate utterances**.  

**Anchor Embedding Computation**  
- For each anchor, embeddings were grouped by **party × decade** to create centroid representations.  
- This enabled measurement of:  
  - **Semantic drift**: cosine distance of anchor centroids across consecutive decades.  
  - **Party divergence**: cosine distance between Democratic and Republican anchors within the same decade.  

**Anchor–Theme Alignment**  
- Debate anchor embeddings (2010s–2020s) were compared to **debate theme centroids** to identify their closest policy anchors.  
- Media anchors (NYT, WSJ, NYP, 2012–2024) were included via balanced Factiva datasets to compare how **debates vs. media** frame the same anchors.  
- Results capture both **policy anchors** (healthcare, immigration, taxes) that map consistently to stable themes, and **rhetorical anchors** (freedom, security, America) that drift between symbolic and policy uses.  

**Outputs**  
- `semantic_drift.csv` — decade-to-decade semantic distances for each anchor and party.  
- `party_divergence.csv` — cross-party divergence within the same decade.  
- `anchor_theme_alignment.csv` — mapping of anchors to closest debate and media themes.  

**Notebook Contribution**  
This pipeline provides a framework for analyzing **ideological drift in political discourse**, enabling:  
- Detection of stable vs. drifting anchors in debates.  
- Comparison of rhetorical vs. policy anchors.  
- Cross-domain framing alignment between **debates** and **media**.  
- Empirical evidence on how key political terms are contested, stabilized, or redefined across decades.  

## 1. Introduction and Config

In [1]:
# === SETUP ===

# standard libraries
from pathlib import Path
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import json

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# reproducibility (used later for sampling etc.)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [31]:
# === FILE PATHS ===

# set base repository path (assumes notebook is in repo/notebooks/)
REPO_DIR = Path(".").resolve().parents[0]

# data paths
DATA_DIR = REPO_DIR / "data"
DEBATES_DF_PATH = DATA_DIR / "debates_df_themes.csv"
EMBEDDINGS_PATH = DATA_DIR / "topic_modeling" / "debates_embeddings.npy"
OUTPUT_PATH = DATA_DIR / "ideological_drift" 

# color palette
with open(Path(REPO_DIR / "color_palette_config.json")) as f:
    palette = json.load(f)

print("Repository Path:", REPO_DIR)
print("Data Directory:", DATA_DIR)
print("Debates Dataset:", DEBATES_DF_PATH)
print("Embeddings:", EMBEDDINGS_PATH)

Repository Path: /Users/emmamora/Documents/GitHub/thesis
Data Directory: /Users/emmamora/Documents/GitHub/thesis/data
Debates Dataset: /Users/emmamora/Documents/GitHub/thesis/data/debates_df_themes.csv
Embeddings: /Users/emmamora/Documents/GitHub/thesis/data/topic_modeling/debates_embeddings.npy


## 2. Load Data

In [3]:
# === LOAD DATA ===

df = pd.read_csv(DEBATES_DF_PATH)
embeddings = np.load(EMBEDDINGS_PATH)

print(f"Debates shape: {df.shape}, Embeddings shape: {embeddings.shape}")
df.head(3)

Debates shape: (6316, 20), Embeddings shape: (6316, 384)


Unnamed: 0,text,speaker_normalized,speaker,party,winner,winner_party,year,debate_type,debate_id,utterance_id,lemmatized_text,token_count,decade,party_code,topic,probability,theme_name,subtheme_name,subtopic,subtopic_prob
0,good evening. the television and radio station...,Moderator,Moderator,,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_001,good evening the television and radio station ...,146,1960s,,6,0.289106,debate_format_procedure,,-1.0,0.332321
1,"mr. smith, mr. nixon. in the election of 1860,...",Candidate_D,Kennedy,Democrat,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_002,mr smith mr nixon in the election of 1860 abra...,1290,1960s,D,0,0.184597,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,5.0,1.0
2,"mr. smith, senator kennedy. the things that se...",Candidate_R,Nixon,Republican,Kennedy,Democrat,1960,presidential,1960_1_Presidential_Nixon_Kennedy,1960_1_Presidential_Nixon_Kennedy_004,mr smith senator kennedy the thing that senato...,1406,1960s,R,0,0.154407,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,5.0,1.0


## 3. Anchor Terms

In [15]:
# === ANCHOR TERMS WITH ALIASES ===

# aliases are used to capture variations of the anchor terms
# terms have been chosen based on common political themes in US debates that would be well represented in all decades
ANCHORS = {
    "freedom": ["freedom", "freedoms", "liberty", "liberties", "free"],
    "security": ["security", "safety", "protection", "defense"],
    "america": ["america", "american", "americans"],
    "taxes": ["tax", "taxes", "taxation", "taxpayer", "taxpayers"],
    "immigration": ["immigration", "immigrant", "immigrants", "migrant", "migrants", 
                "asylum", "refugee", "refugees", "border", "borders", 
                "border wall", "border security", "illegal alien", "illegal aliens", 
                "deportation", "deport", "visa", "visas"],
    "healthcare": ["healthcare", "health care", "medicare", "medicaid", "obamacare", "affordable care"],
}

print("[INFO] Anchor terms defined with aliases:")
for k, v in ANCHORS.items():
    print(f"  {k:<12} → {', '.join(v)}")

[INFO] Anchor terms defined with aliases:
  freedom      → freedom, freedoms, liberty, liberties, free
  security     → security, safety, protection, defense
  america      → america, american, americans
  taxes        → tax, taxes, taxation, taxpayer, taxpayers
  immigration  → immigration, immigrant, immigrants, migrant, migrants, asylum, refugee, refugees, border, borders, border wall, border security, illegal alien, illegal aliens, deportation, deport, visa, visas
  healthcare   → healthcare, health care, medicare, medicaid, obamacare, affordable care


In [16]:
# === CHECK PRESENCE OF ANCHORS IN TEXT ===

def match_anchor(text: str):
    """Return set of anchor keys present in a given text (alias-based)."""
    text = str(text).lower()
    found = set()
    for anchor, aliases in ANCHORS.items():
        for alias in aliases:
            if alias in text:
                found.add(anchor)
    return list(found)

# quick test on sample rows
sample_texts = df["text"].sample(5, random_state=RANDOM_SEED)
for t in sample_texts:
    print(f"\nUtterance: {t}\nMatched anchors: {match_anchor(t)}")


Utterance: well, i've been a senator, donald...
Matched anchors: []

Utterance: no, let me go back and speak to the points that the president made, and let's get them correct. i did not say that the arizona law was a model for the nation in that aspect. i said that the e-verify portion of the arizona law, which is the portion of the law which says that employers could be able to determine whether someone is here illegally or not illegally, that that was a model for the nation. that's number one. number two, i asked the president a question, i think, hispanics and immigrants all over the nation have asked. he was asked this on univision the other day. why, when you said you'd file legislation in your first year, didn't you do it? and he didn't answer. he doesn't answer that question. he said the standard bearer wasn't for it. i'm glad you thought i was a standard bearer 4 years ago, but i wasn't. four years ago, you said in your first year, you would file legislation. in his first year

In [17]:
# === FILTER DEBATES TO ONLY UTTERANCES WITH ANCHORS (NO MODERATORS) ===

df["matched_anchors"] = df["text"].apply(match_anchor)

mask_has_anchor = df["matched_anchors"].str.len() > 0
mask_is_candidate = df["party"].isin(["Republican", "Democrat", "Independent"])

anchors_df = df[mask_has_anchor & mask_is_candidate].copy()

print(f"\n[INFO] Utterances with at least one anchor term (candidates only): {len(anchors_df)} / {len(df)}")

# distribution of anchors across utterances
from collections import Counter
anchor_counter = Counter([a for anchors in anchors_df["matched_anchors"] for a in anchors])
print("\n[INFO] Anchor frequencies (across candidate utterances):")
for k, v in anchor_counter.items():
    print(f"  {k:<12}: {v}")


[INFO] Utterances with at least one anchor term (candidates only): 2116 / 6316

[INFO] Anchor frequencies (across candidate utterances):
  freedom     : 290
  america     : 1329
  security    : 562
  immigration : 166
  taxes       : 718
  healthcare  : 400


## 4. Compute Anchor Embeddings by Party and Decade

In [18]:
# === ALIGN EMBEDDINGS WITH FILTERED DATA ===

# make sure indices align between df and embeddings
assert len(df) == embeddings.shape[0], "DataFrame and embeddings misaligned!"

# extract only candidate utterances with anchors
anchor_idx = anchors_df.index.to_numpy()
anchor_embeddings = embeddings[anchor_idx]

print(f"[INFO] Anchored utterances embeddings shape: {anchor_embeddings.shape}")

[INFO] Anchored utterances embeddings shape: (2116, 384)


In [19]:
# === GROUP BY PARTY × DECADE × ANCHOR ===

# decade column is already in df
anchors_df["decade"] = (anchors_df["year"] // 10) * 10

# expand rows for multiple matched anchors (e.g., 'freedom' and 'america' in one utterance)
rows = []
for i, row in anchors_df.iterrows():
    emb = embeddings[i]
    for anchor in row["matched_anchors"]:
        rows.append({
            "utterance_id": row["utterance_id"],
            "party": row["party"],
            "decade": row["decade"],
            "anchor": anchor,
            "embedding": emb
        })

anchor_long = pd.DataFrame(rows)

print(f"[INFO] Expanded anchor rows: {len(anchor_long)}")

[INFO] Expanded anchor rows: 3465


In [20]:
# === COMPUTE CENTROIDS PER GROUP ===

from numpy.linalg import norm

centroids = []
for (anchor, party, decade), group in anchor_long.groupby(["anchor", "party", "decade"]):
    group_embs = np.vstack(group["embedding"].values)
    centroid = group_embs.mean(axis=0)
    centroids.append({
        "anchor": anchor,
        "party": party,
        "decade": decade,
        "count": len(group),
        "centroid": centroid
    })

centroids_df = pd.DataFrame(centroids)

print(f"[INFO] Centroids table shape: {centroids_df.shape}")
centroids_df.head()

[INFO] Centroids table shape: (93, 5)


Unnamed: 0,anchor,party,decade,count,centroid
0,america,Democrat,1960,16,"[-0.018588753, -0.029245632, -0.0069335555, -0..."
1,america,Democrat,1970,36,"[-0.032048408, -0.045022864, 0.015777659, -0.0..."
2,america,Democrat,1980,102,"[-0.010267679, -0.011285566, 0.018477034, -0.0..."
3,america,Democrat,1990,109,"[-0.015049924, -0.015090035, 0.03112774, -0.03..."
4,america,Democrat,2000,191,"[-0.018042397, -0.008161504, 0.032586254, -0.0..."


## 5. Analysis per Decade and Party

### 5.1. Semantic Drift

In [27]:
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

# === COSINE DISTANCE BETWEEN DECADES (semantic drift over time) ===

drift_rows = []
for anchor, group in centroids_df.groupby(["anchor", "party"]):
    # sort by decade for consistency
    group_sorted = group.sort_values("decade")
    decades = group_sorted["decade"].tolist()
    embeddings = np.vstack(group_sorted["centroid"].values)
    
    # compute pairwise distances between consecutive decades
    for i in range(1, len(decades)):
        d1, d2 = decades[i-1], decades[i]
        emb1, emb2 = embeddings[i-1], embeddings[i]
        dist = cosine_distances([emb1], [emb2])[0][0]
        drift_rows.append({
            "anchor": anchor,
            "decade1": d1,
            "decade2": d2,
            "drift": dist
        })

drift_df = pd.DataFrame(drift_rows)
print(f"[INFO] Drift pairs computed: {len(drift_df)}")
display(drift_df.head())

[INFO] Drift pairs computed: 75


Unnamed: 0,anchor,decade1,decade2,drift
0,"(america, Democrat)",1960,1970,0.228105
1,"(america, Democrat)",1970,1980,0.080583
2,"(america, Democrat)",1980,1990,0.077667
3,"(america, Democrat)",1990,2000,0.073599
4,"(america, Democrat)",2000,2010,0.073523


### 5.2. Party Divergence

In [28]:
# === COSINE DISTANCE BETWEEN PARTIES (semantic divergence within a decade) ===

divergence_rows = []
for anchor, group in centroids_df.groupby(["anchor", "decade"]):
    anchor_name, decade = anchor
    parties = group["party"].unique()
    if "Republican" in parties and "Democrat" in parties:
        emb_r = group[group["party"]=="Republican"]["centroid"].values[0]
        emb_d = group[group["party"]=="Democrat"]["centroid"].values[0]
        dist = cosine_distances([emb_r],[emb_d])[0][0]
        divergence_rows.append({
            "anchor": anchor_name,
            "decade": decade,
            "divergence_RD": dist
        })

divergence_df = pd.DataFrame(divergence_rows)
print(f"[INFO] Party divergence rows: {len(divergence_df)}")
display(divergence_df.head())

[INFO] Party divergence rows: 41


Unnamed: 0,anchor,decade,divergence_RD
0,america,1960,0.18477
1,america,1970,0.061974
2,america,1980,0.050642
3,america,1990,0.035553
4,america,2000,0.026594


### 5.3. Exports

In [43]:
# === SAVE OUTPUTS ===

OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

drift_outfile = OUTPUT_PATH / "semantic_drift.csv"
divergence_outfile = OUTPUT_PATH / "party_divergence.csv"

drift_df.to_csv(drift_outfile, index=False)
divergence_df.to_csv(divergence_outfile, index=False)

print(f"[DONE] Saved drift results -> {drift_outfile}")
print(f"[DONE] Saved party divergence results -> {divergence_outfile}")

[DONE] Saved drift results -> /Users/emmamora/Documents/GitHub/thesis/data/ideological_drift/semantic_drift.csv
[DONE] Saved party divergence results -> /Users/emmamora/Documents/GitHub/thesis/data/ideological_drift/party_divergence.csv


## 6. Ancor-Theme Alignment

In [50]:
# === ADDITIONAL FILE PATHS ===

from sklearn.metrics.pairwise import cosine_similarity

THEME_IDX_PATH = DATA_DIR / "embeddings" / "debates_aligned_index.csv"
THEME_VEC_PATH = DATA_DIR / "embeddings" / "debates_aligned_sbert.npy"

MEDIA_IDX_PATH = DATA_DIR / "important" / "media_chunks_pred_balanced.csv"
MEDIA_VEC_PATH = DATA_DIR / "embeddings" / "debates_media_theme_alignment_processed" / "media_chunks_pred_balanced_sbert.npy"   

# load theme centroids
debate_theme_index = pd.read_csv(THEME_IDX_PATH)
debate_theme_vecs = np.load(THEME_VEC_PATH)

# load media chunks
media_index = pd.read_csv(MEDIA_IDX_PATH)
media_vecs = np.load(MEDIA_VEC_PATH)

print("[INFO] Debate themes (2010s–2020s):", debate_theme_index.shape)
print("[INFO] Media chunks (2010s–2020s):", media_index.shape)


[INFO] Debate themes (2010s–2020s): (2980, 5)
[INFO] Media chunks (2010s–2020s): (675, 20)


In [52]:
# === CHECK MEDIA INDEX-EMBEDDINGS ALIGNMENT ===

if len(media_index) == media_vecs.shape[0]:
    print("[SUCCESS] CSV and embeddings are aligned")
else:
    print("[WARNING] Mismatch! CSV rows vs embeddings do not align")

[SUCCESS] CSV and embeddings are aligned


In [54]:
# === PREP DEBATE THEME CENTROIDS (aligned) ===

from sklearn.metrics.pairwise import cosine_similarity

debate_theme_index = pd.read_csv(THEME_IDX_PATH)
debate_theme_vecs = np.load(THEME_VEC_PATH)

# mask for 2010s–2020s
debate_theme_index["decade"] = (debate_theme_index["year"] // 10) * 10
debate_themes = debate_theme_index.copy()
debate_themes["embedding"] = list(debate_theme_vecs)

print(f"[INFO] Debate theme centroids: {len(debate_themes)}")

[INFO] Debate theme centroids: 2980


In [55]:
# === PREP MEDIA CHUNKS (balanced) ===

media_index = pd.read_csv(MEDIA_IDX_PATH)
media_vecs = np.load(MEDIA_VEC_PATH)

media_index["embedding"] = list(media_vecs)

# aggregate into theme-level centroids
media_themes = (
    media_index.groupby(["pred_theme", "year"])
    .apply(lambda g: np.mean(np.vstack(g["embedding"].values), axis=0))
    .reset_index()
    .rename(columns={0: "embedding"})
)
media_themes["decade"] = (media_themes["year"] // 10) * 10

print(f"[INFO] Media theme centroids: {len(media_themes)}")

[INFO] Media theme centroids: 36


In [56]:
# === FUNCTION TO COMPUTE TOP-N SIMILAR THEMES ===

def top_similar(anchor_vec, theme_df, label_col="theme", topn=3):
    sims = cosine_similarity([anchor_vec], np.vstack(theme_df["embedding"].values))[0]
    theme_df = theme_df.copy()
    theme_df["similarity"] = sims
    top_df = theme_df.sort_values("similarity", ascending=False).head(topn)
    return top_df[[label_col, "similarity"]]

In [57]:
# === MATCH ANCHORS TO THEMES ===

alignment_rows = []

for _, row in centroids_df.iterrows():
    anchor = row["anchor"]
    party = row["party"]
    decade = row["decade"]
    anchor_vec = row["centroid"]

    # debates
    debate_top = top_similar(anchor_vec, debate_themes, label_col="theme_name", topn=3)
    for _, drow in debate_top.iterrows():
        alignment_rows.append({
            "anchor": anchor,
            "party": party,
            "decade": decade,
            "domain": "debates",
            "theme": drow["theme_name"],
            "similarity": drow["similarity"]
        })

    # media
    media_top = top_similar(anchor_vec, media_themes, label_col="pred_theme", topn=3)
    for _, mrow in media_top.iterrows():
        alignment_rows.append({
            "anchor": anchor,
            "party": party,
            "decade": decade,
            "domain": "media",
            "theme": mrow["pred_theme"],
            "similarity": mrow["similarity"]
        })

alignment_df = pd.DataFrame(alignment_rows)


In [59]:
# === SAVE RESULTS ===

alignment_outfile = OUTPUT_PATH / "anchor_theme_alignment.csv"
alignment_df.to_csv(alignment_outfile, index=False)

print(f"[DONE] Saved anchor–theme alignment results -> {alignment_outfile}")
display(alignment_df.head(10))

[DONE] Saved anchor–theme alignment results -> /Users/emmamora/Documents/GitHub/thesis/data/ideological_drift/anchor_theme_alignment.csv


Unnamed: 0,anchor,party,decade,domain,theme,similarity
0,america,Democrat,1960,debates,china_global_trade,0.585159
1,america,Democrat,1960,debates,china_global_trade,0.57426
2,america,Democrat,1960,debates,foreign_policy_national_security,0.568403
3,america,Democrat,1960,media,foreign_policy_national_security,0.571128
4,america,Democrat,1960,media,partisan_gridlock_new_leadership,0.570139
5,america,Democrat,1960,media,foreign_policy_national_security,0.567073
6,america,Democrat,1970,debates,government_spending_budget,0.703842
7,america,Democrat,1970,debates,partisan_gridlock_new_leadership,0.669125
8,america,Democrat,1970,debates,noise_or_unspecified,0.661748
9,america,Democrat,1970,media,partisan_gridlock_new_leadership,0.716322
