## Sentiment Analysis Using VADER
This notebook performs sentiment analysis on financial news headlines and summaries related to a specific stock (e.g., NVDA). The analysis is broken into two major steps:

### 1. Aggregating News to Trading Days
Stock markets are closed on weekends and holidays, but financial news is often published daily. To reflect investor behavior, where decisions are made when the market reopens, news from non-trading days is aggregated and assigned to the next trading day. This provides a more accurate mapping between sentiment and stock movement.

### 2. Analyzing Sentiment with VADER
Sentiment analysis is performed using __VADER__ (Valence Aware Dictionary and sEntiment Reasoner), a rule-based tool designed specifically for short texts such as headlines and social media posts. Although it is not machine-learning based, VADER is highly effective at capturing the nuances of informal language, including slang, emoticons, and punctuation. It is included in the Natural Language Toolkit (NLTK) library in Python and can be easily used through the `SentimentIntensityAnalyzer` class.

### Aggregate and preprocess news data for sentiment analysis

In [None]:
# Aggregate News

import csv
import json
from datetime import datetime, timedelta


def aggregate_news_to_trading_days(news_path="../data/nvda_news.csv",
                                   prices_path="../data/nvda_open_prices.csv",
                                   output_path="../data/nvda_aggregated_news.csv"):
    """
    Aggregate news from non-trading days to the next trading day.
    Args:
        news_path (str): path to raw news file with date and JSON-encoded headlines/summaries
        prices_path (str): path to CSV file containing dates of trading days
        output_path (str): path to save the aggregated output file
    """

    # load all trading days into a set for quick look up
    with open(prices_path, mode="r", encoding="utf-8") as prices_file:
        reader = csv.reader(prices_file)
        next(reader)  # skip the header row
        trading_days = set(row[0] for row in reader)

    # ensure dates are in chronological order
    trading_days = sorted(trading_days)

    # load news data and group headlines/summaries by date
    news_by_date = {}
    with open(news_path, mode="r", encoding="utf-8") as news_file:
        reader = csv.DictReader(news_file)
        for row in reader:
            date = row["date"]
            try:
                content_list = json.loads(row["content"])
            except json.JSONDecodeError:
                print(f"Invalid JSON content on {date}, skipping.")
                continue

            news_by_date.setdefault(date, []).extend(content_list)

    # aggregate news from non-trading days to the next trading days
    aggregated_news = {}
    buffer = []

    # establish the date range to process
    current_date = min(datetime.strptime(d, "%Y-%m-%d") for d in news_by_date.keys())
    max_date = max(datetime.strptime(d, "%Y-%m-%d") for d in news_by_date.keys())

    while current_date <= max_date:
        date_str = current_date.strftime("%Y-%m-%d")

        # accumulate news for each date, if available
        if date_str in news_by_date:
            buffer.extend(news_by_date[date_str])

        # on a trading day, flush accumlated news into the final dictionary
        if date_str in trading_days:
            aggregated_news[date_str] = list(buffer)
            buffer.clear()

        current_date += timedelta(days=1)

    # write aggregated results to CSV
    with open(output_path, mode="w", newline="", encoding="utf-8") as outfile:
        fieldnames = ["date", "content"]
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        for date, texts in aggregated_news.items():
            writer.writerow({
                "date": date,
                "content": json.dumps(texts)  # save list back as JSON string for storage
            })

    print(f"Aggregated news written to {output_path}")

# execute the aggregation function
aggregate_news_to_trading_days()

Aggregated news written to ../data/nvda_aggregated_news.csv


### Perform sentiment analysis

In [None]:
import csv
import json
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# download VADER lexicon
nltk.download('vader_lexicon')

In [9]:
def analyze_sentiment_vader(input_path="../data/nvda_aggregated_news.csv",
                            output_path="../data/nvda_sentiment_scores.csv"):
    """
    Compute daily sentiment scores using VADER and saves average scores per day.
    Args:
        input_path (str): path to the aggregated news file
        output_path (str): path to save the daily sentiment scores
    """

    # initialize sentiment analyzer
    sid = SentimentIntensityAnalyzer()
    results = []

    with open(input_path, mode="r", encoding="utf-8") as infile:
        reader = csv.DictReader(infile)

        for row in reader:
            date = row["date"]

            # load and clean the news text content
            try:
                texts = json.loads(row["content"])
            except json.JSONDecodeError:
                print(f"Skipping malformed JSON for {date}")
                continue

            if not texts:
                continue  # skip if no content
            
            # analyze sentiment for each text using VADER
            scores = [sid.polarity_scores(text) for text in texts]

            # average the compound scores to get a daily sentiment indicator
            avg_compound = sum(s["compound"] for s in scores) / len(scores)

            results.append({
                "date": date,
                "sentiment_score": avg_compound
            })

    # write sentiment scores to CSV
    with open(output_path, mode="w", newline="", encoding="utf-8") as outfile:
        writer = csv.DictWriter(outfile, fieldnames=["date", "sentiment_score"])
        writer.writeheader()
        writer.writerows(results)

    print(f"Saved sentiment scores for {len(results)} days to {output_path}")

# run the sentiment analysis function
analyze_sentiment_vader()

Saved sentiment scores for 37 days to ../data/nvda_sentiment_scores.csv
