# Aggregate Results for Final Findings Notebook

This notebook documents the **final aggregation pipeline** for debates and media datasets in the BSc thesis:  
`Debates, Media, and Discourse: A Computational Analysis of Temporal Shifts in U.S. Presidential Debates and Media Framing Across the Political Spectrum`, written by **Emma Cristina Mora** (emma.mora@studbocconi.it) at **Bocconi University** under the supervision of **Professor Carlo Rasmus Schwarz**.  

The objective of this notebook is to **consolidate all intermediate analyses** (themes, sentiment & emotions, framing, rhetoric, and ideology) into two coherent datasets:  
- **debates_full.csv** → utterance-level dataset for debates (1960–2024)  
- **media_full.csv** → chunk-level dataset for media coverage (2012–2024)  

This unified structure will support the **results and interpretation stage** of the thesis, enabling comparisons across domains, decades, and parties.  

**Dataset Preparation**  
- **Debates**:  
  - Base: `debates_df_themes.csv` (utterance-level topics & subtopics)  
  - Merged with:  
    - `debates_frames_simple.csv` (framing labels)  
    - `debates_sentiment_emotions.csv` (sentiment & emotions)  
    - `debates_rhetoric.csv` (Benoit model labels)  
    - `debates_ideology.csv` (Political Compass scores)  
  - Harmonization: token-based `utterance_id` (hash of normalized text) used as join key.  
  - Confidence-aware deduplication: kept rows with higher margins/probabilities when duplicates occurred.  
  - Final schema includes: `utterance_id, year, speaker, party_code, text, theme, subtheme, sentiment, emotion, framing, rhetoric, ideology_econ, ideology_soc`.  

- **Media**:  
  - Base: `media_chunks_pred_balanced.csv` (balanced per outlet and theme)  
  - Merged with:  
    - `media_sentiment_emotions.csv`  
    - `media_frames_simple.csv`  
  - Assigned chunk-level IDs (`chunk_id`) and harmonized theme/subtheme labels.  
  - Final schema includes: `chunk_id, year, outlet, outlet_leaning, text, source_theme, theme, subtheme, sentiment, emotion, framing`.  

**Outputs**  
- `debates_full.csv` — comprehensive debate dataset across 6,300 utterances  
- `media_full.csv` — comprehensive balanced media dataset across 675 chunks  
- Distribution snapshots exported separately for quick reporting:  
  - `distribution_frames.csv`  
  - `distribution_rhetoric.csv`  
  - `distribution_sentiment.csv`  
  - `distribution_emotion.csv`  

**Notebook Contribution**  
This notebook **concludes the preprocessing and aggregation stage** of the thesis by:  
- Producing **final clean datasets** for debates and media  
- Ensuring **alignment across analytical dimensions** (themes, sentiment, rhetoric, ideology)  
- Creating **stable IDs and harmonized schemas** for consistent cross-domain analysis  

These datasets will serve as the **foundation for the Results chapter**, where the temporal, partisan, and cross-domain dynamics of political discourse will be interpreted.  

## 1. Introduction and Config

In [3]:
# === SETUP ===

# standard libraries
from pathlib import Path
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import json

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [27]:
# === FILE PATHS ===

# set base repository path (assumes notebook is in repo/notebooks/)
REPO_DIR = Path(".").resolve().parents[0]

# debates paths
DATA_DIR = REPO_DIR / "data" 
DEBATES_THEMES = DATA_DIR / "debates_df_themes.csv"
DEBATES_FRAMING = DATA_DIR / "frames" / "debates_frames_simple.csv"
DEBATES_SENTIMENT_EMOTION = DATA_DIR / "sentiment_emotion" / "debates_sentiment_emotions.csv"
DEBATES_RHETORIC = DATA_DIR / "rhetoric" / "debates_rhetoric.csv"
DEBATES_IDEOLOGY = DATA_DIR / "ideological_drift" / "debates_ideology.csv"

# media paths
MEDIA_THEMES = DATA_DIR / "important" / "media_chunks_pred_balanced.csv"
MEDIA_FRAMING = DATA_DIR / "frames" / "media_frames_simple.csv"
MEDIA_SENTIMENT_EMOTION = DATA_DIR / "sentiment_emotion" / "media_sentiment_emotions.csv"

# confirm setup
print("Repository Path:", REPO_DIR)
print("Data Directory:", DATA_DIR)
print("Debates Themes Path:", DEBATES_THEMES)
print("Media Themes Path:", MEDIA_THEMES)
print("Debates Framing Path:", DEBATES_FRAMING)
print("Media Framing Path:", MEDIA_FRAMING)
print("Debates Sentiment/Emotion Path:", DEBATES_SENTIMENT_EMOTION)
print("Media Sentiment/Emotion Path:", MEDIA_SENTIMENT_EMOTION)
print("Debates Rhetoric Path:", DEBATES_RHETORIC)
print("Debates Ideology Path:", DEBATES_IDEOLOGY)

Repository Path: /Users/emmamora/Documents/GitHub/thesis
Data Directory: /Users/emmamora/Documents/GitHub/thesis/data
Debates Themes Path: /Users/emmamora/Documents/GitHub/thesis/data/debates_df_themes.csv
Media Themes Path: /Users/emmamora/Documents/GitHub/thesis/data/important/media_chunks_pred_balanced.csv
Debates Framing Path: /Users/emmamora/Documents/GitHub/thesis/data/frames/debates_frames_simple.csv
Media Framing Path: /Users/emmamora/Documents/GitHub/thesis/data/frames/media_frames_simple.csv
Debates Sentiment/Emotion Path: /Users/emmamora/Documents/GitHub/thesis/data/sentiment_emotion/debates_sentiment_emotions.csv
Media Sentiment/Emotion Path: /Users/emmamora/Documents/GitHub/thesis/data/sentiment_emotion/media_sentiment_emotions.csv
Debates Rhetoric Path: /Users/emmamora/Documents/GitHub/thesis/data/rhetoric/debates_rhetoric.csv
Debates Ideology Path: /Users/emmamora/Documents/GitHub/thesis/data/ideological_drift/debates_ideology.csv


## 2. Debates_Full Dataset Creation

In [28]:
# === HELPER FUNCTIONS ===

# normalize and hash text into a stable token-based uid
import hashlib

def _normalize_text_for_uid(s: str) -> str:
# lowercase, strip punctuation (keep apostrophes), collapse whitespace
    s = s.lower()
    s = re.sub(r"[^\w\s']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def token_uid(text: str) -> str:
# sha1 hash of normalized text to create a deterministic uid
    if pd.isna(text):
        return np.nan
    norm = _normalize_text_for_uid(str(text))
    return hashlib.sha1(norm.encode("utf-8")).hexdigest()

def make_text_index(s: str, k: int = 7) -> str:
# short human-readable index from first k words (optional debugging)
    if pd.isna(s):
        return np.nan
    s = _normalize_text_for_uid(str(s))
    return " ".join(s.split()[:k])

In [29]:
# === LOAD DEBATES DATA ===

# load the debates datasets
themes = pd.read_csv(DEBATES_THEMES)
frames = pd.read_csv(DEBATES_FRAMING)
sentiment = pd.read_csv(DEBATES_SENTIMENT_EMOTION)
rhetoric = pd.read_csv(DEBATES_RHETORIC)
ideology = pd.read_csv(DEBATES_IDEOLOGY)

# add token_uid and a human-friendly text_index to each df
for df in [themes, frames, sentiment, rhetoric, ideology]:
    if "text" in df.columns:
        df["token_uid"] = df["text"].map(token_uid)
        df["text_index"] = df["text"].map(make_text_index)
    else:
        df["token_uid"] = np.nan
        df["text_index"] = np.nan

print("themes:", themes.shape)
print("frames:", frames.shape)
print("sentiment:", sentiment.shape)
print("rhetoric:", rhetoric.shape)
print("ideology:", ideology.shape)

themes: (6316, 22)
frames: (6316, 28)
sentiment: (6316, 10)
rhetoric: (6316, 10)
ideology: (6316, 14)


In [30]:
# === QUICK SANITY CHECKS ON KEYS ===

# check nulls and duplicates of token_uid per dataset
def key_report(name, df):
    n = len(df)
    nulls = int(df["token_uid"].isna().sum())
    dups = int(df["token_uid"].duplicated(keep=False).sum())
    uniq = int(df["token_uid"].nunique(dropna=True))
    print(f"{name:>28} | rows={n:>7} | null_uid={nulls:>6} | dup_uid_rows={dups:>6} | unique_uids={uniq:>7}")

for name, df in [
    ("debates_df_themes.csv", themes),
    ("debates_frames_simple.csv", frames),
    ("debates_sentiment_emotions.csv", sentiment),
    ("debates_rhetoric.csv", rhetoric),
    ("debates_ideology.csv", ideology),
]:
    key_report(name, df)

       debates_df_themes.csv | rows=   6316 | null_uid=     0 | dup_uid_rows=    44 | unique_uids=   6293
   debates_frames_simple.csv | rows=   6316 | null_uid=     0 | dup_uid_rows=    44 | unique_uids=   6293
debates_sentiment_emotions.csv | rows=   6316 | null_uid=     0 | dup_uid_rows=    44 | unique_uids=   6293
        debates_rhetoric.csv | rows=   6316 | null_uid=     0 | dup_uid_rows=    44 | unique_uids=   6293
        debates_ideology.csv | rows=   6316 | null_uid=     0 | dup_uid_rows=    44 | unique_uids=   6293


In [31]:
# === PREPARE BASE DATA (THEMES) ===

# select base columns from themes and keep token_uid as the primary merge key
base_cols = [
    "token_uid", "utterance_id", "year", "speaker", "party_code",
    "text", "token_count", "theme_name", "subtheme_name", "debate_id"
]
base = themes[base_cols].copy()

# extract debate_number from debate_id when available
base["debate_number"] = base["debate_id"].astype(str).str.extract(r"(\d+)")

print("base rows:", base.shape[0])

base rows: 6316


In [32]:
# === DEDUP AUX TABLES ON TOKEN_UID (CONFIDENCE-AWARE WHERE POSSIBLE) ===

# helper to drop duplicates on token_uid keeping the most confident row when scores exist
def dedupe_on_uid(df, sort_cols=None, ascending=None):
    if sort_cols is not None:
        df = df.sort_values(by=sort_cols, ascending=ascending, na_position="last")
# keep the first row per uid after sorting (i.e., highest confidence)
    return df.drop_duplicates(subset=["token_uid"], keep="first")

# frames -> prefer higher margin, then higher best_score
frames_cols_keep = ["token_uid", "frame_final", "best_score", "margin"]
frames_dedup = dedupe_on_uid(
    frames[frames_cols_keep].copy(),
    sort_cols=["margin", "best_score"], ascending=[False, False]
)

# sentiment -> prefer higher sentiment_score, then higher emotion_score
sent_cols_keep = ["token_uid", "sentiment_final", "sentiment_score", "emotion_final", "emotion_score"]
sentiment_dedup = dedupe_on_uid(
    sentiment[sent_cols_keep].copy(),
    sort_cols=["sentiment_score", "emotion_score"], ascending=[False, False]
)

# rhetoric -> no explicit score columns, just drop duplicate uids
rhet_cols_keep = ["token_uid", "rhetoric_label"]
rhetoric_dedup = dedupe_on_uid(rhetoric[rhet_cols_keep].copy())

# ideology -> prefer rows with both econ and soc present, then smaller std if available
ideo = ideology.copy()
ideo["has_both"] = ideo["econ"].notna().astype(int) + ideo["soc"].notna().astype(int)
if {"econ_std", "soc_std"}.issubset(ideo.columns):
    ideo["std_sum"] = ideo[["econ_std", "soc_std"]].sum(axis=1, min_count=1)
    sort_cols = ["has_both", "std_sum"]
    ascending = [False, True]
else:
    sort_cols = ["has_both"]
    ascending = [False]

ideo_cols_keep = ["token_uid", "econ", "soc"]
if "econ_std" in ideo.columns: ideo_cols_keep.append("econ_std")
if "soc_std" in ideo.columns: ideo_cols_keep.append("soc_std")

ideology_dedup = dedupe_on_uid(
    ideo[ideo_cols_keep + (["has_both", "std_sum"] if "std_sum" in ideo.columns else ["has_both"])].copy(),
    sort_cols=sort_cols, ascending=ascending
)[["token_uid", "econ", "soc"]]# trim helper cols

In [33]:
# === MERGE ON TOKEN_UID ===

# start from base (themes) and left-join the other tables by token_uid
merged = base.merge(
    sentiment_dedup[["token_uid", "sentiment_final", "emotion_final"]],
    on="token_uid", how="left"
).merge(
    frames_dedup[["token_uid", "frame_final"]],
    on="token_uid", how="left"
).merge(
    rhetoric_dedup[["token_uid", "rhetoric_label"]],
    on="token_uid", how="left"
).merge(
    ideology_dedup[["token_uid", "econ", "soc"]],
    on="token_uid", how="left"
)

print("merged rows:", merged.shape[0])

merged rows: 6316


In [34]:
# === CLEAN / RENAME COLUMNS TO FINAL SCHEMA ===

# rename to final names and map token_uid to the new utterance_id
merged = merged.rename(columns={
    "token_uid": "utterance_id",# new, reliable id based on text tokens
    "theme_name": "theme",
    "subtheme_name": "subtheme",
    "sentiment_final": "sentiment",
    "emotion_final": "emotion",
    "frame_final": "framing",
    "rhetoric_label": "rhetoric",
    "econ": "ideology_econ",
    "soc": "ideology_soc"
})

# select and order final columns
final_cols = [
    "utterance_id", "year", "speaker", "party_code",
    "text", "token_count", "theme", "subtheme",
    "sentiment", "emotion", "framing", "rhetoric",
    "ideology_econ", "ideology_soc"
]
debates_full = merged[final_cols].copy()

print("final debates_full shape:", debates_full.shape)

final debates_full shape: (6316, 15)


In [38]:
debates_full

Unnamed: 0,utterance_id,utterance_id.1,year,speaker,party_code,text,token_count,theme,subtheme,sentiment,emotion,framing,rhetoric,ideology_econ,ideology_soc
0,f25967e9ebde6a67d7e07f5e1f6dab708827bdfd,1960_1_Presidential_Nixon_Kennedy_001,1960,Moderator,,good evening. the television and radio station...,146,debate_format_procedure,,positive,joy,unspecified,,,
1,263f28271313bb0ea8d815486ae0b9bfc5403e66,1960_1_Presidential_Nixon_Kennedy_002,1960,Kennedy,D,"mr. smith, mr. nixon. in the election of 1860,...",1290,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,neutral,unspecified,unspecified,acclaim,-0.45,0.05
2,b0d8772ca1e4a0ef0bf21bbc46824a6b4f888812,1960_1_Presidential_Nixon_Kennedy_004,1960,Nixon,R,"mr. smith, senator kennedy. the things that se...",1406,foreign_policy_national_security,historical_foreign_policy_cuba_vietnam,neutral,unspecified,unspecified,defense,0.55,0.10
3,d67300aa39ec063e5dd36b5a05a0ef3121909386,1960_1_Presidential_Nixon_Kennedy_005,1960,Moderator,,"thank you, mr. nixon. that completes the openi...",66,debate_format_procedure,,positive,joy,unspecified,,,
4,f3cc3ad77b299c9d5dbf579b752f7af515d96397,1960_1_Presidential_Nixon_Kennedy_006,1960,Moderator,,"senator, the vice president in his campaign ha...",41,leadership_executive_experience,social_security_pensions,negative,unspecified,unspecified,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6311,b895f43a4439e264fdef041836fa2abeff861ae8,2024_3_Vice_presidential_Vance_Walz_136,2024,Walz,D,"please. yeah, well, i don't run facebook. what...",200,partisan_gridlock_new_leadership,,negative,unspecified,morality and ethics,attack,0.00,0.40
6312,3682ec0098e410d4ed834a526257b7c5eafd5a26,2024_3_Vice_presidential_Vance_Walz_137,2024,Moderator,,"governor, your time is up. thank you, gentleme...",68,debate_format_procedure,,positive,joy,unspecified,,,
6313,0effb049509ae9b2d35a557d8ff3f6cdb4127928,2024_3_Vice_presidential_Vance_Walz_138,2024,Walz,D,"well, thank you, senator vance. thank you to c...",404,partisan_gridlock_new_leadership,,positive,joy,unspecified,acclaim,-0.60,-0.40
6314,0f98f9bc9f9ff512fd2685b214122a7dcc8bd853,2024_3_Vice_presidential_Vance_Walz_140,2024,Vance,R,"well, i want to thank governor walz, you folks...",474,noise_or_unspecified,,neutral,joy,unspecified,attack,0.50,0.20


In [22]:
# === VALIDATION SNAPSHOT ===

# quick checks to ensure we didn't lose base rows and to inspect missing rates
print("base rows:", len(base))
print("debates_full rows:", len(debates_full))

missing_report = {
    "sentiment_missing": int(debates_full["sentiment"].isna().sum()),
    "emotion_missing": int(debates_full["emotion"].isna().sum()),
    "framing_missing": int(debates_full["framing"].isna().sum()),
    "rhetoric_missing": int(debates_full["rhetoric"].isna().sum()),
    "ideology_econ_missing": int(debates_full["ideology_econ"].isna().sum()),
    "ideology_soc_missing": int(debates_full["ideology_soc"].isna().sum()),
}
print("missing fields:", missing_report)

base rows: 6316
debates_full rows: 6316
missing fields: {'sentiment_missing': 0, 'emotion_missing': 0, 'framing_missing': 0, 'rhetoric_missing': 1552, 'ideology_econ_missing': 2473, 'ideology_soc_missing': 2473}


In [23]:
# === SAVE DEBATES_FULL ===

# write final merged file
OUTPUT_PATH = DATA_DIR / "debates_full.csv"
debates_full.to_csv(OUTPUT_PATH, index=False)
print("saved:", OUTPUT_PATH)

saved: /Users/emmamora/Documents/GitHub/thesis/data/debates_full.csv


In [24]:
# === SAVE DISTRIBUTIONS ===

# compute and save distributions to csv for quick reporting
dist_frames = frames_dedup["frame_final"].value_counts(dropna=False).reset_index()
dist_frames.columns = ["frame", "count"]
dist_frames_path = DATA_DIR / "distribution_frames.csv"
dist_frames.to_csv(dist_frames_path, index=False)

dist_rhetoric = rhetoric_dedup["rhetoric_label"].value_counts(dropna=False).reset_index()
dist_rhetoric.columns = ["rhetoric", "count"]
dist_rhetoric_path = DATA_DIR / "distribution_rhetoric.csv"
dist_rhetoric.to_csv(dist_rhetoric_path, index=False)

dist_sentiment = sentiment_dedup["sentiment_final"].value_counts(dropna=False).reset_index()
dist_sentiment.columns = ["sentiment", "count"]
dist_sentiment_path = DATA_DIR / "distribution_sentiment.csv"
dist_sentiment.to_csv(dist_sentiment_path, index=False)

dist_emotion = sentiment_dedup["emotion_final"].value_counts(dropna=False).reset_index()
dist_emotion.columns = ["emotion", "count"]
dist_emotion_path = DATA_DIR / "distribution_emotion.csv"
dist_emotion.to_csv(dist_emotion_path, index=False)

print("saved distributions:")
print(" -", dist_frames_path)
print(" -", dist_rhetoric_path)
print(" -", dist_sentiment_path)
print(" -", dist_emotion_path)

saved distributions:
 - /Users/emmamora/Documents/GitHub/thesis/data/distribution_frames.csv
 - /Users/emmamora/Documents/GitHub/thesis/data/distribution_rhetoric.csv
 - /Users/emmamora/Documents/GitHub/thesis/data/distribution_sentiment.csv
 - /Users/emmamora/Documents/GitHub/thesis/data/distribution_emotion.csv


## 3. Media_Balanced Dataset Creation

In [39]:
# === LOAD MEDIA DATA ===

# load base chunks, sentiment/emotion, and framing
media_chunks = pd.read_csv(MEDIA_THEMES)
media_sentiment = pd.read_csv(MEDIA_SENTIMENT_EMOTION)
media_frames = pd.read_csv(MEDIA_FRAMING)

print("Media Chunks Shape:", media_chunks.shape)
print("Media Sentiment Shape:", media_sentiment.shape)
print("Media Frames Shape:", media_frames.shape)

Media Chunks Shape: (675, 20)
Media Sentiment Shape: (675, 8)
Media Frames Shape: (675, 25)


In [44]:
# === CREATE ROW IDS FOR ALIGNMENT ===

# assign artificial row numbers
media_chunks = media_chunks.reset_index(drop=True)
media_sentiment = media_sentiment.reset_index(drop=True)
media_frames = media_frames.reset_index(drop=True)

media_chunks["row_id"] = media_chunks.index + 1
media_sentiment["row_id"] = media_sentiment.index + 1
media_frames["row_id"] = media_frames.index + 1

In [45]:
# === PREPARE BASE MEDIA ===

# keep relevant columns and rename text
media = media_chunks.rename(columns={"chunk_text": "text"})

# assign theme
media["theme"] = media.apply(
    lambda r: r["pred_theme"] if r["pred_sim"] > 0.5 else "noise_or_unspecified",
    axis=1
)

# assign subtheme
media["subtheme"] = media.apply(
    lambda r: r["pred_subtheme"]
    if (r["pred_sub_sim"] > 0.5 and r["theme"] != "noise_or_unspecified")
    else pd.NA,
    axis=1
)

# keep base columns
media = media[["row_id", "year", "outlet", "outlet_leaning", "text", "source_theme", "theme", "subtheme"]]

In [46]:
# === MERGE SENTIMENT, EMOTION, FRAMING ===

# sentiment + emotion
media = media.merge(
    media_sentiment[["row_id", "sentiment_final", "emotion_final"]],
    on="row_id",
    how="left"
)

# framing
media = media.merge(
    media_frames[["row_id", "frame_final"]],
    on="row_id",
    how="left"
)

# rename
media = media.rename(columns={
    "sentiment_final": "sentiment",
    "emotion_final": "emotion",
    "frame_final": "framing"
})

In [48]:
# === FINALIZE MEDIA FULL ===

# drop helper row_id
media = media.reset_index(drop=True)
media["chunk_id"] = media.index + 1
media["chunk_id"] = media["chunk_id"].apply(lambda x: f"med_{x:06d}")

# reorder columns
media_final = media[[
    "chunk_id", "year", "outlet", "outlet_leaning", "text",
    "source_theme", "theme", "subtheme", "sentiment", "emotion", "framing"
]]

print(media_final.shape)
media_final.head(7)

(675, 11)


Unnamed: 0,chunk_id,year,outlet,outlet_leaning,text,source_theme,theme,subtheme,sentiment,emotion,framing
0,med_000001,2016,nyp,R,"stone said. jonathan gruber, the mit professor...",healthcare_public_health,healthcare_social_security,affordable_care_health_insurance,unspecified,unspecified,unspecified
1,med_000002,2016,nyp,R,"""we're thinking of having him in the spin room...",healthcare_public_health,healthcare_social_security,affordable_care_health_insurance,neutral,unspecified,economic consequences
2,med_000003,2024,nyp,R,iowa caucuses. trump was right when he accused...,foreign_policy_national_security,judiciary_supreme_court,abortion_constitutional_amendments,neutral,unspecified,unspecified
3,med_000004,2016,nyp,R,if the government stops fighting a lawsuit tha...,healthcare_public_health,healthcare_social_security,affordable_care_health_insurance,negative,unspecified,economic consequences
4,med_000005,2020,nyp,R,foreign desk: stopping the ayatollahs' nukes n...,foreign_policy_national_security,foreign_policy_national_security,iran_nuclear_program,neutral,unspecified,economic consequences
5,med_000006,2024,nyp,R,"threats."" two days later, the russians request...",healthcare_public_health,foreign_policy_national_security,russia_soviet_union,negative,anger,unspecified
6,med_000007,2024,nyp,R,"two days later, the russians requested another...",foreign_policy_national_security,foreign_policy_national_security,russia_soviet_union,unspecified,unspecified,unspecified


In [50]:
# === RENUMBER CHUNK IDS WITH METADATA ===

df = pd.read_csv(DATA_DIR / "media_full.csv")

# simplify source_theme (take prefix before "_")
df["theme_simple"] = df["source_theme"].str.split("_").str[0]

# sort for stable numbering
df = df.sort_values(["year", "outlet", "theme_simple"]).reset_index(drop=True)

# assign group counter
df["counter"] = df.groupby(["year", "outlet", "theme_simple"]).cumcount() + 1
df["counter"] = df["counter"].apply(lambda x: f"{x:03d}")

# build new chunk_id
df["chunk_id"] = (
    df["year"].astype(str)
    + "_" + df["outlet"]
    + "_" + df["theme_simple"]
    + "_" + df["counter"]
)

# drop helper cols
df = df.drop(columns=["theme_simple", "counter"])

# save updated file
MEDIA_FULL = DATA_DIR / "media_full.csv"
df.to_csv(MEDIA_FULL, index=False)

print("Updated media_full.csv with new chunk_id:", MEDIA_FULL)
df.head(10)

Updated media_full.csv with new chunk_id: /Users/emmamora/Documents/GitHub/thesis/data/media_full.csv


Unnamed: 0,chunk_id,year,outlet,outlet_leaning,text,source_theme,theme,subtheme,sentiment,emotion,framing
0,2012_nyp_foreign_001,2012,nyp,R,"the debate, in boca raton, fla., is supposed t...",foreign_policy_national_security,foreign_policy_national_security,patriot_act_homeland_security,neutral,unspecified,security and safety
1,2012_nyp_foreign_002,2012,nyp,R,"the debate, in boca raton, fla., is supposed t...",foreign_policy_national_security,foreign_policy_national_security,patriot_act_homeland_security,neutral,unspecified,security and safety
2,2012_nyp_foreign_003,2012,nyp,R,deductions and loopholes; he hasn't been able ...,foreign_policy_national_security,tax_policy,tax_cuts_policy_proposals,negative,unspecified,economic consequences
3,2012_nyp_foreign_004,2012,nyp,R,repeatedly hit romney for turning medicare int...,foreign_policy_national_security,healthcare_social_security,affordable_care_health_insurance,negative,unspecified,economic consequences
4,2012_nyp_foreign_005,2012,nyp,R,winning the first debate is one thing; the ele...,foreign_policy_national_security,debate_format_procedure,debate_opening_closing_remarks,unspecified,unspecified,economic consequences
5,2012_nyp_foreign_006,2012,nyp,R,"his term, saying iran ""saw weakness where it h...",foreign_policy_national_security,foreign_policy_national_security,iran_nuclear_program,unspecified,joy,security and safety
6,2012_nyp_foreign_007,2012,nyp,R,evidence the motive was terrorism. obama said ...,foreign_policy_national_security,foreign_policy_national_security,patriot_act_homeland_security,negative,anger,security and safety
7,2012_nyp_foreign_008,2012,nyp,R,evidence the motive was terrorism. obama said ...,foreign_policy_national_security,foreign_policy_national_security,patriot_act_homeland_security,negative,anger,security and safety
8,2012_nyp_foreign_009,2012,nyp,R,"""netanyahu isn't stupid. he isn't going to sta...",foreign_policy_national_security,foreign_policy_national_security,iran_nuclear_program,neutral,unspecified,unspecified
9,2012_nyp_foreign_010,2012,nyp,R,"in the last three years, food and gas prices h...",foreign_policy_national_security,partisan_gridlock_new_leadership,race_in_political_discourse,negative,unspecified,economic consequences


In [51]:
df.to_csv(DATA_DIR / "media_balanced.csv", index=False)