<a href="https://colab.research.google.com/github/Yanzhi-002/Yanzhi/blob/main/youtube_vader_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube + VADER Sentiment ‚Äî Student Guide & Live Coding Notebook

> Run this end-to-end in Google Colab. Cells are heavily commented so you can follow what each line does.

## Part 0 ‚Äî VADER Warm‚ÄëUp (Strengths & Limitations)
We‚Äôll start with a small ‚Äúsentiment sandbox‚Äù to see where VADER shines (short, social text; emojis/emphasis) and where it struggles (sarcasm, domain slang, negation edge cases).

### What does the score represent?
VADER‚Äôs compound score = lexicon scores + rule-based adjustments ‚Üí normalized into [‚Äì1, 1].

It captures sentiment in a way that‚Äôs simple, interpretable, powerful, and built for internet language‚Äîwhich makes it a popular choice for policy, media, and social analysis.

### 0.1 Install & Imports

In [76]:

# Keep pandas (already in Colab). Ensure latest NLTK.
!pip -q install --upgrade nltk

import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon (only needs to run once per runtime)
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### 0.2 Quick word tests

In [77]:

# Try a variety of words and short phrases.
# Notice how intensifiers, punctuation, and emojis change scores.
examples = [
    "good", "great!!!", "bad", "terrible", "okay", "not good", "not bad",
    "sick", "sick!", "sick ü§Æ", "sick ü§©",  # same token, different meaning via emoji
    "love", "hate", "meh", "LOL", "lol", "LOL!!!",  # casing & emphasis
]

results = []
for s in examples:
    scores = sia.polarity_scores(s)
    results.append({"text": s, **scores})

pd.DataFrame(results).sort_values("compound", ascending=False)


Unnamed: 0,text,neg,neu,pos,compound
1,great!!!,0.0,0.0,1.0,0.7163
11,love,0.0,0.0,1.0,0.6369
16,LOL!!!,0.0,0.0,1.0,0.5684
0,good,0.0,0.0,1.0,0.4404
6,not bad,0.0,0.26,0.74,0.431
15,lol,0.0,0.0,1.0,0.4215
14,LOL,0.0,0.0,1.0,0.4215
4,okay,0.0,0.0,1.0,0.2263
13,meh,1.0,0.0,0.0,-0.0772
5,not good,0.706,0.294,0.0,-0.3412


### 0.3 Sentence‚Äëlevel experiments

In [None]:

# Explore common edge cases: negation, contrastive conjunctions, sarcasm, domain slang.
edge_cases = [
    "This was good, not great.",
    "This was not good.",
    "This was good but also kind of annoying.",
    "Yeah right, amazing...",  # sarcasm
    "The movie was fire",      # slang (positive in many contexts)
    "The service was mad slow",# regional slang (negative)
    "I love how it crashes every five minutes", # sarcasm
]

pd.DataFrame([{ "text": s, **sia.polarity_scores(s)} for s in edge_cases])


**Things to note:**
- VADER is a **lexicon + rules** approach; it leverages booster words (e.g., *very*), punctuation, capitalization, emojis.
- It can miss **sarcasm** or domain meanings (e.g., *sick* can be good or bad).
- For short, social‚Äëstyle comments VADER performs surprisingly well‚Äîbut you should always sanity‚Äëcheck results.

---

## Part 1 ‚Äî Your YouTube API Key
You‚Äôll need a Google Cloud project with **YouTube Data API v3** enabled. Keep your key private.

**Steps**
1. Go to **Google Cloud Console** ‚Üí create (or select) a **Project**.
2. **APIs & Services** ‚Üí **Enable APIs and Services** ‚Üí search **YouTube Data API v3** ‚Üí **Enable**.
3. **Credentials** ‚Üí **Create Credentials** ‚Üí **API key**.
4. (Optional but recommended) **Restrict** the key (HTTP referrers or IP).
5. Paste it into the Colab runtime via a hidden prompt:

In [88]:

import os
from getpass import getpass

# Paste your API key when prompted (input will be hidden in Colab)
os.environ["YOUTUBE_API_KEY"] = getpass("Paste your API Key: ")

# Quick sanity check
assert os.environ.get("YOUTUBE_API_KEY"), "API key not set ‚Äî please run the cell and paste your key."


Paste your API Key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


## Part 2 ‚Äî Collect YouTube Data (Search ‚Üí Video Details ‚Üí Comments)
We‚Äôll use plain `requests` so you can see raw REST calls, params, and pagination. Then we‚Äôll tidy the results with `pandas`.

### 2.1 Install (if needed) & imports

In [79]:

!pip -q install requests tqdm

import os
import json
from urllib.parse import urlencode

import requests
import pandas as pd
from tqdm import tqdm


### 2.2 Helper: call the YouTube Data API

In [89]:

API_KEY = os.environ.get("YOUTUBE_API_KEY")
BASE_URL = "https://www.googleapis.com/youtube/v3"

if not API_KEY:
    raise ValueError("Missing API key. Set os.environ['YOUTUBE_API_KEY'] first.")

def yt_get(resource: str, params: dict) -> dict:
    """Call YouTube Data API v3.
    - resource: e.g., 'search', 'videos', 'commentThreads'
    - params: dict of query params (we append the API key here)
    Returns parsed JSON as a Python dict.
    """
    q = {**params, "key": API_KEY}
    url = f"{BASE_URL}/{resource}?{urlencode(q)}"
    r = requests.get(url, timeout=30)
    r.raise_for_status()  # raise an HTTPError if the request failed
    return r.json()


### 2.3 Search videos for a topic (collect video IDs)

In [90]:

# ‚úÖ Edit this query to explore your own topic
QUERY = "subway safety NYC"  # e.g., "climate policy", "Taylor Swift"
TARGET_VIDEOS = 60           # upper bound of total videos to collect (keep modest: quotas!)
MAX_RESULTS = 50             # per-page limit for search endpoint

video_hits = []              # will hold basic search results
page_token = None            # used for pagination

with tqdm(total=TARGET_VIDEOS, desc="Searching videos") as pbar:
    while len(video_hits) < TARGET_VIDEOS:
        # The 'search' resource finds videos; we request snippet data (title, channel, publishedAt).
        params = {
            "part": "snippet",
            "q": QUERY,
            "type": "video",
            "maxResults": MAX_RESULTS,
            "order": "relevance",
        }
        if page_token:
            params["pageToken"] = page_token

        data = yt_get("search", params)
        items = data.get("items", [])

        for it in items:
            vid = it.get("id", {}).get("videoId")
            if not vid:
                continue
            snip = it.get("snippet", {})
            video_hits.append({
                "video_id": vid,
                "publishedAt": snip.get("publishedAt"),
                "title": snip.get("title"),
                "channelId": snip.get("channelId"),
                "channelTitle": snip.get("channelTitle"),
            })
            pbar.update(1)
            if len(video_hits) >= TARGET_VIDEOS:
                break

        page_token = data.get("nextPageToken")
        if not page_token:
            break  # no more pages

videos_df = pd.DataFrame(video_hits)
videos_df.head(3)


Searching videos: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 60/60 [00:00<00:00, 60.72it/s]


Unnamed: 0,video_id,publishedAt,title,channelId,channelTitle
0,vh6WnkJUo7g,2025-08-07T20:20:15Z,How safe is the NYC subway?,UC5iIQGb0YQhyA-ey2V4z2wg,Bloomberg Opinion
1,w45CP0x97Pw,2022-12-12T23:10:10Z,NYC subway safety,UCIjSUWHWp6KohfnR5OQTXnQ,FOX 5 New York
2,KP4fZzHZq3g,2025-01-04T01:15:01Z,Growing safety fears among New York subway rid...,UCeY0bbntWzzVIaj2z3QigXg,NBC News


### 2.4 Enrich videos: titles, descriptions, and stats

In [91]:

# We'll call 'videos.list' to fetch details for batches of IDs (up to 50 per call)
def chunked(seq, size):
    for i in range(0, len(seq), size):
        yield seq[i:i+size]

video_ids = videos_df["video_id"].dropna().unique().tolist()

video_details = []
for batch in tqdm(list(chunked(video_ids, 50)), desc="Fetching video details"):
    params = {
        "part": "snippet,statistics",
        "id": ",".join(batch),
        "maxResults": 50,
    }
    data = yt_get("videos", params)
    for it in data.get("items", []):
        snip = it.get("snippet", {})
        stats = it.get("statistics", {})
        video_details.append({
            "video_id": it.get("id"),
            "title": snip.get("title"),
            "description": snip.get("description"),
            "publishedAt": snip.get("publishedAt"),
            "channelTitle": snip.get("channelTitle"),
            # Cast numeric strings to integers when possible
            "viewCount": int(stats.get("viewCount", 0) or 0),
            "likeCount": int(stats.get("likeCount", 0) or 0),
            "commentCount": int(stats.get("commentCount", 0) or 0),
        })

video_details_df = pd.DataFrame(video_details)
video_details_df.head(3)


Fetching video details: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 10.29it/s]


Unnamed: 0,video_id,title,description,publishedAt,channelTitle,viewCount,likeCount,commentCount
0,vh6WnkJUo7g,How safe is the NYC subway?,Is riding the #subway safer than riding in a c...,2025-08-07T20:20:15Z,Bloomberg Opinion,8543,330,15
1,w45CP0x97Pw,NYC subway safety,Members of the City Council grilled New York C...,2022-12-12T23:10:10Z,FOX 5 New York,9645,95,55
2,KP4fZzHZq3g,Growing safety fears among New York subway rid...,A series of violent attacks on New York's subw...,2025-01-04T01:15:01Z,NBC News,20215,128,109


### 2.5 Fetch top‚Äëlevel comments for each video (with pagination)

In [None]:

# Some videos disable comments. We'll handle HTTP errors gracefully and cap per‚Äëvideo volume.
all_comments = []

for vid in tqdm(video_details_df["video_id"].tolist(), desc="Fetching comments"):
    page_token = None
    fetched = 0
    try:
        while True:
            params = {
                "part": "snippet",
                "videoId": vid,
                "maxResults": 100,  # API max per page for commentThreads
                "order": "relevance",  # try 'time' if you want chronological
                # 'textFormat': 'plainText' is default
            }
            if page_token:
                params["pageToken"] = page_token

            data = yt_get("commentThreads", params)
            items = data.get("items", [])

            for it in items:
                top = it.get("snippet", {}).get("topLevelComment", {})
                s = top.get("snippet", {})
                all_comments.append({
                    "video_id": vid,
                    "comment_id": top.get("id"),
                    "author": s.get("authorDisplayName"),
                    "publishedAt": s.get("publishedAt"),
                    "likeCount": s.get("likeCount", 0),
                    "text": s.get("textOriginal", ""),
                })
                fetched += 1

            page_token = data.get("nextPageToken")
            if not page_token:
                break  # no more pages

            if fetched >= 300:
                break  # safety cap so a single video doesn‚Äôt eat your quota

    except requests.HTTPError as e:
        print(f"Skipping {vid} due to HTTP error: {e}")
        continue

comments_df = pd.DataFrame(all_comments)
comments_df.head(3)


## Part 3 ‚Äî Sentiment Analysis with VADER
We will score titles, descriptions, and comments. Then we‚Äôll aggregate by video.

### 3.1 Set up VADER (if you skipped Part 0)

In [None]:

# Already upgraded earlier; safe to re‚Äërun if needed
!pip -q install --upgrade nltk
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()


### 3.2 Score text fields

In [None]:

# Helper to score a text string and return only the 'compound' score ([-1, 1])
def compound_score(text):
    return sia.polarity_scores(text or "")["compound"]

# Video titles & descriptions
video_details_df["title_compound"] = video_details_df["title"].fillna("").apply(compound_score)
video_details_df["description_compound"] = video_details_df["description"].fillna("").apply(compound_score)

# Comments (if any)
if not comments_df.empty:
    comments_df["compound"] = comments_df["text"].fillna("").apply(compound_score)


### 3.3 Aggregate to video level

In [None]:

# Common VADER thresholds
POS, NEG = 0.05, -0.05

if not comments_df.empty:
    comments_df["sentiment_label"] = comments_df["compound"].apply(
        lambda c: "pos" if c > POS else ("neg" if c < NEG else "neu")
    )

    agg = (comments_df.groupby("video_id").agg(
        n_comments=("comment_id", "count"),
        mean_compound=("compound", "mean"),
        pct_pos=("sentiment_label", lambda s: (s == "pos").mean()),
        pct_neg=("sentiment_label", lambda s: (s == "neg").mean()),
        pct_neu=("sentiment_label", lambda s: (s == "neu").mean()),
    ).reset_index())
else:
    # Empty placeholder so the merge below still works
    agg = pd.DataFrame(columns=["video_id", "n_comments", "mean_compound", "pct_pos", "pct_neg", "pct_neu"])

summary = (
    video_details_df.merge(agg, on="video_id", how="left")
    .assign(
        title_compound=lambda d: d["title_compound"].round(3),
        description_compound=lambda d: d["description_compound"].round(3),
        mean_compound=lambda d: d["mean_compound"].round(3),
        pct_pos=lambda d: (d["pct_pos"]*100).round(1),
        pct_neg=lambda d: (d["pct_neg"]*100).round(1),
        pct_neu=lambda d: (d["pct_neu"]*100).round(1),
    )
)

summary_cols = [
    "video_id", "channelTitle", "publishedAt", "viewCount", "likeCount", "commentCount",
    "title_compound", "description_compound", "n_comments", "mean_compound", "pct_pos", "pct_neg", "pct_neu", "title"
]

summary[summary_cols].sort_values(by=["mean_compound"], ascending=False).head(10)


### 3.4 ‚Äî Plotly Visualizations (offline‚Äëfriendly & GitHub‚Äëready)
We‚Äôll create a couple of interactive charts using Plotly and save them as **self‚Äëcontained HTML** (works offline) and as **PNG** (renders in GitHub README previews).

**Tip:** GitHub won‚Äôt render Plotly HTML inline in the repo view, but you can:
- Click the HTML file and then **Raw** to open it, or
- Serve it via **GitHub Pages**, or
- Use the **PNG** in your README and link to the HTML for interactivity.


In [None]:

# Install Plotly + Kaleido (for saving static PNGs)
!pip -q install --upgrade plotly kaleido

import plotly.express as px
import plotly.io as pio

# Set a renderer suitable for Colab. Alternatives: 'notebook_connected', 'svg', 'png'
pio.renderers.default = "colab"

# --- 1) Bar chart: Top 10 videos by mean comment sentiment (requires comments) ---
import pandas as pd

if 'summary' in globals() and not summary.empty and summary['mean_compound'].notna().any():
    top10 = summary.sort_values("mean_compound", ascending=False).head(10).copy()
    # Truncate long titles for readability
    top10["title_short"] = top10["title"].str.slice(0, 60) + top10["title"].apply(lambda t: "‚Ä¶" if len(str(t)) > 60 else "")

    fig_bar = px.bar(
        top10,
        x="title_short",
        y="mean_compound",
        hover_data=["title", "channelTitle", "viewCount", "likeCount", "n_comments"],
        title="Top 10 videos by mean comment sentiment (compound)",
        labels={"title_short": "Video title (truncated)", "mean_compound": "Mean compound sentiment"},
    )
    fig_bar.update_layout(xaxis_tickangle=-30)
    fig_bar.show()

    # Save interactive HTML (self-contained) and PNG (static preview-friendly)
    fig_bar.write_html("plot_top10_sentiment.html", include_plotlyjs="cdn", full_html=True)
    fig_bar.write_image("plot_top10_sentiment.png")

else:
    print("No comment sentiment available to plot. Make sure you fetched comments and computed 'mean_compound'.")


In [None]:

# --- 2) Scatter: Relationship between viewCount and mean comment sentiment ---
if 'summary' in globals() and not summary.empty and summary['mean_compound'].notna().any():
    scatter_df = summary.dropna(subset=["mean_compound"]).copy()
    # Use log scale for views if counts vary widely
    fig_scatter = px.scatter(
        scatter_df,
        x="viewCount",
        y="mean_compound",
        hover_name="title",
        hover_data=["channelTitle", "likeCount", "n_comments"],
        title="View count vs. mean comment sentiment",
        labels={"viewCount": "Views", "mean_compound": "Mean compound sentiment"},
    )
    fig_scatter.update_xaxes(type="log")

    fig_scatter.show()

    # Save HTML + PNG
    fig_scatter.write_html("plot_views_vs_sentiment.html", include_plotlyjs="cdn", full_html=True)
    fig_scatter.write_image("plot_views_vs_sentiment.png")
else:
    print("No sentiment summary to plot. Ensure the aggregation step ran successfully.")


### 3.5 Save your datasets

In [None]:

# Export tidy CSVs for later analysis or visualization
videos_df.to_csv("videos_search_hits.csv", index=False)
video_details_df.to_csv("video_details.csv", index=False)
comments_df.to_csv("video_comments.csv", index=False)
summary.to_csv("video_sentiment_summary.csv", index=False)

print("Saved: videos_search_hits.csv, video_details.csv, video_comments.csv, video_sentiment_summary.csv")


## Part 4 ‚Äî Calculate Sentiment Scores for a State of the Union Address


In this section, we are going to calculate sentiment scores for President Biden's 2023 State of the Union Address.

First, we need use web scraping tools to collect the transcript from the 2023 State of the Union Address. This White House [URL](https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/02/07/remarks-of-president-joe-biden-state-of-the-union-address-as-prepared-for-delivery/) contains the complete transcript.

To start, we need to bring in our "requests" library into our Python environment and next we can make our data request using the URL:

In [None]:
import requests

In [None]:
response = requests.get("https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/02/07/remarks-of-president-joe-biden-state-of-the-union-address-as-prepared-for-delivery/")

Next, we can check to see whether or not the request was successful:

In [None]:
response

In order to get the text data from the response we need to apply the .text method, and we can save the results in a new varibale hltm_string. The results from the data request will be in [HTML format](https://www.udacity.com/blog/2021/04/html-for-dummies.html).

In [None]:
html_string = response.text
print(html_string)

Let's bring in our [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Python library to help us clean up and decode this HTML text data:

In [None]:
from bs4 import BeautifulSoup

Let's run our html_string variable through the Beautiful Soup object and use the get_text() function to extract the text from the HTML data. Then, let's use the print function to visualize our results:

In [None]:
soup = BeautifulSoup(html_string)
speech = soup.get_text()
print(speech)

Let's save our results in a text file:

In [None]:
with open("2023_union.txt","w") as file:
    file.write(speech)

Next, let's read in the text file and also replace line breaks with spaces to because there are line breaks in the middle of sentences.

In [None]:
# Read in text file
text = open("2023_union.txt").read()
# Replace line breaks with spaces
text = text.replace('\n', ' ')

### Import NLTK

Next we need to break the text into sentences.

An easy way to break text into sentences, or to "tokenize" them into sentences, is to use [NLTK](https://www.nltk.org/), a Python library for text analysis natural language processing.

Let's import nltk and download the model that will help us get sentences.

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

To break a string into individual sentences, we can use `nltk.sent_tokenize()`

In [None]:
nltk.sent_tokenize(text)

To get sentence numbers for each sentence, we can use `enumerate()`.

In [None]:
for number, sentence in enumerate(nltk.sent_tokenize(text)):
    print(number, sentence)

### Make DataFrame

For convenience, we can put all of the sentences into a pandas DataFrame. One easy way to make a DataFrame is to first make a list of dictionaries.

Below we loop through the sentences, calculate sentiment scores, and then create a dictionary with the sentence, sentence number, and compound score, which we append to the list `sentence_scores`.

In [None]:
# Break text into sentences
sentences = nltk.sent_tokenize(text)

# Make empty list
sentence_scores = []
# Get each sentence and sentence number, which is what enumerate does
for number, sentence in enumerate(sentences):
    # Use VADER to calculate sentiment
    scores = sentimentAnalyser.polarity_scores(sentence)
    # Make dictionary and append it to the previously empty list
    sentence_scores.append({'sentence': sentence, 'sentence_number': number+1, 'sentiment_score': scores['compound']})

To make this list of dictionaries into a DataFrame, we can simply use `pd.DataFrame()`

In [None]:
pd.DataFrame(sentence_scores)

Let's examine the 10 most negative sentences.

In [None]:
# Assign DataFrame to variable red_df
speech_df = pd.DataFrame(sentence_scores)

# Sort by the column "sentiment_score" and slice for first 10 values
speech_df.sort_values(by='sentiment_score')[:10]

Let's examine the 10 most positive sentences.

In [None]:
# Sort by the column "sentiment_score," this time in descending order, and slice for first 10 values
speech_df.sort_values(by='sentiment_score', ascending=False)[:10]

### Make a Sentiment Plot

To create a data visualization of sentiment over the course of the 2023 State of the Union Address we can plot the sentiment scores over story time (aka sentence number).

In [None]:
import plotly.express as px

In [None]:
fig = px.line(speech_df, x='sentence_number', y="sentiment_score",
             title= "Sentiment Analysis of 2023 State of the Union Address")
fig.show()

We can also get a more generalized view by getting a "rolling average" 5 sentences at a time by using the `.rolling()` method with a specified window and storing the results in a new column "speech_roll":

In [None]:
speech_df['speech_roll'] = speech_df.rolling(5)['sentiment_score'].mean()

In [None]:
speech_df[:25]

In [None]:
fig = px.line(speech_df, x='sentence_number', y="speech_roll",
             title= "Sentiment Analysis of 2023 State of the Union Address")
fig.show()

## Part 5 ‚Äî Student Work: Improve & Explore (Prompts you can use with a GenAI model)

## üé¨ What the "Student-Driven" Hour Is About
During the final hour of class, you‚Äôll take the working pipeline we built together and:
- choose your own topic (e.g., climate news, celebrity content, policy debates)
- collect a small YouTube dataset (videos + comments)
- run VADER sentiment
- **improve one part of the script** using a GenAI assistant (pagination, cleaning, retry logic, parameters, etc.)
- export your results & write a short insight summary

**Your goal isn‚Äôt to build the perfect tool ‚Äî it‚Äôs to experiment, debug, and deepen your understanding.**

Success = a working query, a sentiment summary, one improvement to your code, and 3‚Äì5 bullet insights.

---
Copy/paste any of these and adapt to your notebook:

1) **Pagination helper**  
*‚ÄúGiven my YouTube comments code, write a loop that continues until `nextPageToken` is missing or I reach a cap (e.g., 500 comments). Add docstrings and basic error handling.‚Äù*

2) **Retry + backoff**  
*‚ÄúAdd retries with exponential backoff for HTTP 5xx and a polite sleep for 403/429. Keep dependencies minimal.‚Äù*

3) **Parameters refactor**  
*‚ÄúRefactor my API calls so params live in well‚Äënamed dicts with comments for each parameter.‚Äù*

4) **CSV schema & types**  
*‚ÄúReview my DataFrame dtypes for `videos` and `comments` and cast to sensible types (`Int64`, `datetime64[ns]`). Update the save step.‚Äù*

5) **Text cleaning utility**  
*‚ÄúWrite `clean_text(s)` that removes URLs, collapses whitespace, strips markup, and optionally normalizes emojis.‚Äù*

6) **Per‚Äëvideo rollup**  
*‚ÄúGiven a comments DataFrame with `video_id` + `compound`, return n, mean, std, and pos/neg/neu proportions per video using VADER thresholds.‚Äù*

7) **Visualization**  
*‚ÄúPlot the relationship between `viewCount` and `mean comment sentiment`, labeling outliers by a truncated title.‚Äù*

8) **Time window**  
*‚ÄúModify `search.list` to include `publishedAfter` / `publishedBefore` (RFC3339) and pass them into the workflow.‚Äù*

9) **Topic comparison**  
*‚ÄúCompare two queries (A vs B) and test whether mean comment sentiment differs (Mann‚ÄëWhitney U).‚Äù*

10) **Language handling**  
*‚ÄúDetect non‚ÄëEnglish comments (e.g., `langdetect`) and either filter or tag them; update aggregation to show language mix.‚Äù*

## Appendix A ‚Äî How VADER's *compound* score is computed (and why it's favored)
**High level:** VADER uses a human‚Äëcurated lexicon of words with positive/negative "valence" scores and a set of rules for emphasis, negation, punctuation, capitalization, and emojis. It computes four scores: `pos`, `neu`, `neg` (proportions) and `compound` (a single summary in [-1, 1]).

**Steps (simplified):**
1. **Tokenize** the text and look up each token‚Äôs **valence** in the VADER lexicon.
2. Apply **rules** and **modifiers**:
   - **Booster/intensifier words** (e.g., *very, extremely*) scale valence.
   - **Negation** (e.g., *not, never*) flips or reduces polarity.
   - **Punctuation & capitalization** (e.g., `!!!`, ALL‚ÄëCAPS) amplify valence.
   - **Emojis/emoticons** also carry polarity.
3. **Sum** the adjusted valences across the text ‚Üí call this `S`.
4. **Normalize** to [-1, 1] using:

   \[ \textbf{compound} = \frac{S}{\sqrt{S^2 + \alpha}} \quad \text{with } \alpha = 15 \]

**Why people favor `compound`:**
- It‚Äôs a **single number** summarizing overall sentiment (great for ranking, correlations, and plots).
- The normalization makes scores **comparable** across different sentence lengths and emphasis.
- It‚Äôs easy to threshold: commonly, `> 0.05` ‚Üí positive, `< -0.05` ‚Üí negative, otherwise neutral.

**Caveats:** sarcasm, domain‚Äëspecific slang, and mixed statements can still confuse any lexicon‚Äëbased method.
