# Sentiment Analysis (VADER) — Event-Based + Subreddit-Aware

This notebook applies sentiment analysis to the extracted event datasets using VADER, a sentiment tool designed for social media text.

Key ideas:
- Each comment gets a sentiment score (`compound`) between -1 and +1.
- We classify each comment as Positive / Neutral / Negative using standard VADER thresholds.
- We summarize sentiment:
  1) **within each event** (overall distribution)
  2) **within each event + subreddit** (to avoid one large subreddit dominating)

Outputs:
- `data/processed/event_sentiment_comment_level.csv` (optional / may be large)
- `data/processed/event_sentiment_subreddit_level.csv`
- `data/processed/event_sentiment_event_level.csv`


## 1. Imports and Setup

We load libraries and define paths to the extracted event CSV files created in `01_event_extraction.ipynb`.


In [2]:
import os
import pandas as pd
import numpy as np


In [7]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# Download VADER lexicon 
nltk.download("vader_lexicon")

sia = SentimentIntensityAnalyzer()


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/hyf/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## 2. Load Event Datasets

We load the five event CSV files from `data/processed/`.


In [9]:
base_dir = "../data/processed"

event_files = {
    "event1_kyiv": os.path.join(base_dir, "event1_kyiv.csv"),
    "event2_kherson": os.path.join(base_dir, "event2_kherson.csv"),
    "event3_stalemate": os.path.join(base_dir, "event3_stalemate.csv"),
    "event4_trump_election": os.path.join(base_dir, "event4_trump_election.csv"),
    "event5_white_house_meeting": os.path.join(base_dir, "event5_white_house_meeting.csv"),
}

event_files


{'event1_kyiv': '../data/processed/event1_kyiv.csv',
 'event2_kherson': '../data/processed/event2_kherson.csv',
 'event3_stalemate': '../data/processed/event3_stalemate.csv',
 'event4_trump_election': '../data/processed/event4_trump_election.csv',
 'event5_white_house_meeting': '../data/processed/event5_white_house_meeting.csv'}

In [11]:
dfs = []
for event_name, path in event_files.items():
    if not os.path.exists(path):
        print(f"Missing file: {path}")
        continue
    
    df = pd.read_csv(path, low_memory=False)
    df["event"] = event_name
    dfs.append(df)

all_df = pd.concat(dfs, ignore_index=True)
print("Total rows loaded:", len(all_df))
all_df.head()


Total rows loaded: 614040


Unnamed: 0,comment_id,created_time,self_text,subreddit,score,post_id,post_title,post_self_text,post_created_time,post_score,dt,is_us_focused,event
0,i1c74oa,2022-03-19 23:01:46,They probably had people aiming guns at them,InvasionOfUkraine,1,thlaxt,Russians at Putin's pro-war rally yesterday,,2022-03-19 02:17:02,148,2022-03-19 23:01:46,False,event1_kyiv
1,i1c2mtc,2022-03-19 22:27:45,"As I started my comment with, I believe what P...",GTAorRussia,-1,tgi6zf,A personal message to the Russian people from ...,,2022-03-17 19:31:32,926,2022-03-19 22:27:45,False,event1_kyiv
2,i1c1ebk,2022-03-19 22:18:27,"Yeah, the soldier can absolutely do some damag...",UkraineWarFootage,1,tet4n1,Ukrainian soldier chilling in a fox hole waiti...,,2022-03-15 16:37:51,30,2022-03-19 22:18:27,False,event1_kyiv
3,i1c02py,2022-03-19 22:08:47,The Russia's have more light armored vehicles ...,UkraineWarFootage,1,tet4n1,Ukrainian soldier chilling in a fox hole waiti...,,2022-03-15 16:37:51,30,2022-03-19 22:08:47,False,event1_kyiv
4,i1by2r1,2022-03-19 21:53:52,All it needed was a nice 'lil artillery strike...,UkraineWarFootage,1,tawdfo,Exposed Russian infantry column retreat under ...,,2022-03-10 11:09:44,9,2022-03-19 21:53:52,False,event1_kyiv


## 3. Basic Sanity Checks

We verify that we have the columns we need:
- `self_text` (comment text)
- `subreddit`
- `created_time` (optional for plotting later)
- `is_us_focused` (if available)


In [13]:
required_cols = ["self_text", "subreddit", "event"]
missing = [c for c in required_cols if c not in all_df.columns]
print("Missing required columns:", missing)

# How much text is missing?
print("Missing self_text:", all_df["self_text"].isna().sum())
print("Empty self_text:", (all_df["self_text"].astype(str).str.strip() == "").sum())


Missing required columns: []
Missing self_text: 0
Empty self_text: 0


## 4. Compute VADER Sentiment (Comment-Level)

VADER returns a `compound` score between -1 and +1:
- negative values → more negative tone
- positive values → more positive tone
- near zero → neutral / mixed tone

We also create a sentiment label using standard VADER thresholds:
- Positive: compound >= 0.05
- Negative: compound <= -0.05
- Neutral: otherwise


In [15]:
def vader_scores(text: str):
    if not isinstance(text, str):
        text = ""
    scores = sia.polarity_scores(text)
    return scores["compound"]

def label_sentiment(compound: float):
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"


In [17]:
all_df["compound"] = all_df["self_text"].astype(str).apply(vader_scores)
all_df["sentiment_label"] = all_df["compound"].apply(label_sentiment)

all_df[["event", "subreddit", "compound", "sentiment_label"]].head()


Unnamed: 0,event,subreddit,compound,sentiment_label
0,event1_kyiv,InvasionOfUkraine,0.0,neutral
1,event1_kyiv,GTAorRussia,-0.5262,negative
2,event1_kyiv,UkraineWarFootage,-0.3102,negative
3,event1_kyiv,UkraineWarFootage,0.0,neutral
4,event1_kyiv,UkraineWarFootage,0.3182,positive


## 5. Event-Level Sentiment Summary

This gives a high-level view of sentiment in each event window:
- mean compound score
- median compound score
- % positive / neutral / negative

Note: because event sizes are very different, later we also summarize sentiment at the subreddit level.


In [19]:
event_summary = (
    all_df.groupby("event")
    .agg(
        n_comments=("compound", "size"),
        mean_compound=("compound", "mean"),
        median_compound=("compound", "median"),
        pct_positive=("sentiment_label", lambda s: (s == "positive").mean() * 100),
        pct_neutral=("sentiment_label", lambda s: (s == "neutral").mean() * 100),
        pct_negative=("sentiment_label", lambda s: (s == "negative").mean() * 100),
    )
    .reset_index()
    .sort_values("event")
)

event_summary


Unnamed: 0,event,n_comments,mean_compound,median_compound,pct_positive,pct_neutral,pct_negative
0,event1_kyiv,2940,-0.052446,0.0,34.013605,26.326531,39.659864
1,event2_kherson,795,0.014889,0.0,36.477987,30.188679,33.333333
2,event3_stalemate,9955,0.003329,0.0,38.503265,25.223506,36.27323
3,event4_trump_election,359088,-0.02845,0.0,36.149356,25.318863,38.531781
4,event5_white_house_meeting,241262,-0.013245,0.0,37.974484,23.641933,38.383583


## 6. Subreddit-Level Aggregation (Imbalance-Aware)

If we only take a single average for an event, one large subreddit can dominate the result.

So we compute sentiment per (event, subreddit):
- comment_count
- mean_compound
- % positive / neutral / negative

This allows:
- fair comparisons across communities
- U.S.-focused vs non-U.S comparisons where available


In [21]:
subreddit_summary = (
    all_df.groupby(["event", "subreddit"])
    .agg(
        n_comments=("compound", "size"),
        mean_compound=("compound", "mean"),
        pct_positive=("sentiment_label", lambda s: (s == "positive").mean() * 100),
        pct_neutral=("sentiment_label", lambda s: (s == "neutral").mean() * 100),
        pct_negative=("sentiment_label", lambda s: (s == "negative").mean() * 100),
    )
    .reset_index()
)

subreddit_summary.sort_values(["event", "n_comments"], ascending=[True, False]).head(15)


Unnamed: 0,event,subreddit,n_comments,mean_compound,pct_positive,pct_neutral,pct_negative
1,event1_kyiv,InvasionOfUkraine,1700,-0.064994,34.117647,25.058824,40.823529
3,event1_kyiv,RussiaReplacement,664,-0.025208,35.240964,26.807229,37.951807
2,event1_kyiv,RussiaDenies,306,-0.055801,32.026144,29.411765,38.562092
0,event1_kyiv,GTAorRussia,185,-0.008033,34.054054,30.810811,35.135135
5,event1_kyiv,UkraineWarFootage,69,-0.176506,24.637681,27.536232,47.826087
4,event1_kyiv,UkrainePics,16,0.236119,50.0,25.0,25.0
10,event2_kherson,RussiaDenies,214,-0.141495,23.831776,29.906542,46.261682
7,event2_kherson,GTAorRussia,166,-0.052505,29.518072,37.349398,33.13253
15,event2_kherson,volunteersForUkraine,146,0.20401,50.684932,28.767123,20.547945
6,event2_kherson,ArtForUkraine,118,0.218264,52.542373,27.966102,19.491525


## 7. U.S.-focused vs Non-U.S Comparison (Where Available)

Early events may have little or no U.S.-focused subreddit participation.
We will compute the comparison only for events that actually contain both groups.


In [25]:
if "is_us_focused" in all_df.columns:
    us_summary = (
        all_df.groupby(["event", "is_us_focused"])
        .agg(
            n_comments=("compound", "size"),
            mean_compound=("compound", "mean"),
            pct_positive=("sentiment_label", lambda s: (s == "positive").mean() * 100),
            pct_neutral=("sentiment_label", lambda s: (s == "neutral").mean() * 100),
            pct_negative=("sentiment_label", lambda s: (s == "negative").mean() * 100),
        )
        .reset_index()
    )
    us_summary
else:
    print("Column `is_us_focused` not found in dataset.")


## 8. Save Output Tables

We save summary tables to `data/processed/` so they can be reused in:
- the Streamlit dashboard
- REPORT.md tables
- presentation charts


In [27]:
out_dir = "../data/processed"
os.makedirs(out_dir, exist_ok=True)

event_summary_path = os.path.join(out_dir, "event_sentiment_event_level.csv")
subreddit_summary_path = os.path.join(out_dir, "event_sentiment_subreddit_level.csv")

event_summary.to_csv(event_summary_path, index=False)
subreddit_summary.to_csv(subreddit_summary_path, index=False)

print("Saved:", event_summary_path)
print("Saved:", subreddit_summary_path)


Saved: ../data/processed/event_sentiment_event_level.csv
Saved: ../data/processed/event_sentiment_subreddit_level.csv


## 9. Notes for Presentation Slides (Draft)

Suggested slide framing:
- We **cannot** treat all events equally because subreddit composition changes over time.
- Early events are dominated by war-focused communities.
- Later events show stronger U.S.-political participation.
- To avoid misleading results, we summarize sentiment at:
  - the event level
  - and the subreddit level (imbalance-aware).
