### Sentiment Analysis on the news

- Use finance specific Hugging Face model like FinBERT
- Add 2 new columns to news_filtered.csv
  - Sentiment (pos, neut, neg)
  - Sentiment_score (confidence intervals)

In [2]:
# Verify transformers install pip install transformers torch
import transformers, torch
print("Transformers version:", transformers.__version__)
print("Torch version:", torch.__version__)

Transformers version: 4.56.2
Torch version: 2.8.0+cpu


In [3]:
import pandas as pd
from transformers import pipeline
import math

# ----------------------------
# 1. Load dataset
# ----------------------------
news = pd.read_csv("../data/raw/FNSPID/news_filtered.csv")

# Use a summary if available (faster + within token limit), otherwise full article
text_col = news["Textrank_summary"].fillna(news["Article"].astype(str))

# ----------------------------
# 2. Load FinBERT model
# ----------------------------
sentiment_pipeline = pipeline("sentiment-analysis", model="ProsusAI/finbert")

# ----------------------------
# 3. Batch inference
# ----------------------------
batch_size = 32   # adjust if you want faster/slower
results = []

num_batches = math.ceil(len(text_col) / batch_size)
for i in range(0, len(text_col), batch_size):
    batch_texts = text_col.iloc[i:i+batch_size].tolist()
    batch_texts = [t[:512] for t in batch_texts]  # truncate to model limit
    batch_results = sentiment_pipeline(batch_texts)
    results.extend(batch_results)

    if (i // batch_size) % 20 == 0:  # log every ~20 batches
        print(f"Processed {i+len(batch_texts)} / {len(text_col)} articles")

# ----------------------------
# 4. Attach results to dataframe
# ----------------------------
news["Sentiment"] = [r["label"] for r in results]
news["Sentiment_score"] = [r["score"] for r in results]

# ----------------------------
# 5. Save output
# ----------------------------
out_path = "../data/processed/news_with_sentiment.csv"
news.to_csv(out_path, index=False)

print(f"✅ Saved sentiment results to {out_path}")


config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Processed 32 / 13647 articles
Processed 672 / 13647 articles
Processed 1312 / 13647 articles
Processed 1952 / 13647 articles
Processed 2592 / 13647 articles
Processed 3232 / 13647 articles
Processed 3872 / 13647 articles
Processed 4512 / 13647 articles
Processed 5152 / 13647 articles
Processed 5792 / 13647 articles
Processed 6432 / 13647 articles
Processed 7072 / 13647 articles
Processed 7712 / 13647 articles
Processed 8352 / 13647 articles
Processed 8992 / 13647 articles
Processed 9632 / 13647 articles
Processed 10272 / 13647 articles
Processed 10912 / 13647 articles
Processed 11552 / 13647 articles
Processed 12192 / 13647 articles
Processed 12832 / 13647 articles
Processed 13472 / 13647 articles
✅ Saved sentiment results to ../data/processed/news_with_sentiment.csv


- Aggregate daily sentiment -> daily_sentiment.csv
- Merge with stock data -> stock_news_merged.csv


In [1]:
import pandas as pd

# Load sentiment results
news = pd.read_csv("../data/processed/news_with_sentiment.csv")
news["Date"] = pd.to_datetime(news["Date"])

# Map sentiment labels to numeric values for averaging
label_map = {"positive": 1, "neutral": 0, "negative": -1}
news["Sentiment_numeric"] = news["Sentiment"].map(label_map)

# Daily sentiment per stock
daily_sentiment = (
    news.groupby(["Date", "Stock_symbol"])
        .agg(
            avg_sentiment_score=("Sentiment_score", "mean"),
            avg_sentiment_numeric=("Sentiment_numeric", "mean"),
            article_count=("Sentiment", "count")
        )
        .reset_index()
)

# Save
daily_sentiment.to_csv("../data/processed/daily_sentiment.csv", index=False)
print("✅ Saved daily_sentiment.csv with shape:", daily_sentiment.shape)


✅ Saved daily_sentiment.csv with shape: (853, 5)


Each row = one unique (Date, Stock_symbol) combination.

Your dataset spans AAPL + AMZN from 2020-09-16 → 2025-09-15.

That’s ~1,250 trading days total, but not every day has news articles.

### Merge with stock data

In [3]:
# Load filtered stock data
stocks = pd.read_csv("../data/processed/stocks_filtered.csv")

# Ensure both are tz-naive
stocks["Date"] = pd.to_datetime(stocks["Date"]).dt.tz_localize(None)
daily_sentiment["Date"] = pd.to_datetime(daily_sentiment["Date"]).dt.tz_localize(None)

merged = pd.merge(
    stocks,
    daily_sentiment,
    left_on=["Date", "Ticker"],
    right_on=["Date", "Stock_symbol"],
    how="left"
)

# Drop duplicate Stock_symbol column
merged = merged.drop(columns=["Stock_symbol"])

merged.to_csv("../data/processed/stocks_news_merged.csv", index=False)
print("✅ Merged dataset saved with shape:", merged.shape)


✅ Merged dataset saved with shape: (2510, 10)


2510 rows (all trading days x 2 tickers) joined with sentiment