### Sentiment Benchmark with FinancialPhraseBank

This notebook evaluates sentiment classification accuracy on the FinancialPhraseBank dataset using three tools:
- NLTK VADER (lexicon-based)
- FinBERT (finance-domain transformer)
- DistilBERT (general-domain transformer)

Data files are loaded from `data/external/FinancialPhraseBank-v1.0`. We report accuracy for multiple agreement splits (AllAgree, 75%, 66%, 50%). In our runs, FinBERT achieved the highest accuracy; VADER was moderate; DistilBERT underperformed due to its binary label space and limited handling of "neutral".

In [1]:
# Delegate to models/sentiment_analysis.py
import sys
from pathlib import Path

PROJECT_ROOT = Path('/Users/hwang-yejin/Desktop/Summer1/Proposal/code_cleaning/Financial Time Series Forecasting with Deep Learning Models and Social Media Sentiment')
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from models.sentiment_analysis import run_sentiment_benchmark

DATASET_DIR = PROJECT_ROOT / 'data' / 'external' / 'FinancialPhraseBank-v1.0'

metrics = run_sentiment_benchmark(
    dataset_dir=str(DATASET_DIR),
    focus_filename='Sentences_AllAgree.txt',
    extra_filenames=('Sentences_75Agree.txt', 'Sentences_66Agree.txt', 'Sentences_50Agree.txt'),
    run_vader=True,
    run_finbert=True,
    run_distilbert=True,
    verbose=True,
)

metrics



--- Evaluating models on Sentences_AllAgree.txt ---
VADER Accuracy: 0.5707


Device set to use mps:0
Device set to use mps:0


FinBERT Accuracy: 0.9717
DistilBERT Accuracy: 0.2584 (binary model, neutral handling may differ)

--- Evaluating models on Sentences_75Agree.txt ---
VADER Accuracy: 0.5627


Device set to use mps:0
Device set to use mps:0


FinBERT Accuracy: 0.9473
DistilBERT Accuracy: 0.2667 (binary model, neutral handling may differ)

--- Evaluating models on Sentences_66Agree.txt ---
VADER Accuracy: 0.5563


Device set to use mps:0
Device set to use mps:0


FinBERT Accuracy: 0.9182
DistilBERT Accuracy: 0.2912 (binary model, neutral handling may differ)

--- Evaluating models on Sentences_50Agree.txt ---
VADER Accuracy: 0.5429


Device set to use mps:0


FinBERT Accuracy: 0.8896


Device set to use mps:0


DistilBERT Accuracy: 0.2992 (binary model, neutral handling may differ)


{'Sentences_AllAgree.txt': {'vader': 0.5706713780918727,
  'finbert': 0.9717314487632509,
  'distilbert': 0.2583922261484099},
 'Sentences_75Agree.txt': {'vader': 0.562699102229945,
  'finbert': 0.9472922096727483,
  'distilbert': 0.26672458731537796},
 'Sentences_66Agree.txt': {'vader': 0.5563196585250177,
  'finbert': 0.9181882855110268,
  'distilbert': 0.29120227649988145},
 'Sentences_50Agree.txt': {'vader': 0.5429219975237309,
  'finbert': 0.8895996698307883,
  'distilbert': 0.2992158481221626}}