# 06 — Sentiment Analysis (Phase 2)

This notebook demonstrates the full Phase 2 sentiment pipeline:
1. Fetch financial news from multiple sources
2. Score headlines with FinBERT
3. Aggregate into daily sentiment features
4. Merge with price data and evaluate impact on predictions

**Prerequisites:** `pip install transformers torch newsapi-python requests`

In [None]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
%matplotlib inline

## 1. Fetch News Headlines

We use 3 sources (all free):
- **yfinance**: built-in, ~10-15 recent headlines
- **GDELT**: 3-month history, unlimited API
- **NewsAPI** (optional): needs free key from newsapi.org

In [None]:
from src.data.news_fetcher import (
    YFinanceNewsFetcher,
    GDELTNewsFetcher,
    AggregateNewsFetcher,
)

TICKER = 'AAPL'
COMPANY = 'Apple'  # Broader search term for news APIs

# Individual sources
yf_fetcher = YFinanceNewsFetcher()
yf_news = yf_fetcher.fetch_news(TICKER)
print(f'yfinance: {len(yf_news)} headlines')
for item in yf_news[:3]:
    print(f'  [{item["source"]}] {item["title"][:80]}')

In [None]:
gdelt_fetcher = GDELTNewsFetcher()
gdelt_news = gdelt_fetcher.fetch_news(
    COMPANY, 
    start_date=(datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
    max_results=50
)
print(f'GDELT: {len(gdelt_news)} headlines')
for item in gdelt_news[:3]:
    print(f'  [{item["source"]}] {item["title"][:80]}')

In [None]:
# Aggregate all sources
# Set your NewsAPI key here (optional — get free at newsapi.org/register)
NEWSAPI_KEY = None  # e.g. 'your_key_here'

agg_fetcher = AggregateNewsFetcher(newsapi_key=NEWSAPI_KEY, use_gdelt=True)
news_df = agg_fetcher.fetch_all(
    query=COMPANY,
    start_date=(datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
)

print(f'\nTotal unique headlines: {len(news_df)}')
print(f'Sources: {news_df["source"].value_counts().to_dict()}')
news_df.head(10)

## 2. FinBERT Sentiment Scoring

FinBERT (ProsusAI/finbert) classifies each headline as:
- **Positive** (+1.0): bullish sentiment
- **Negative** (-1.0): bearish sentiment  
- **Neutral** (0.0): no directional signal

First run downloads ~420 MB model (cached locally after).

In [None]:
from src.features.sentiment import FinBERTSentimentScorer

scorer = FinBERTSentimentScorer(device='cpu')  # 'cuda' if you have GPU

# Score all headlines
scored_df = scorer.score_dataframe(news_df)

print(f'Scored {len(scored_df)} headlines')
print(f'\nSentiment distribution:')
print(scored_df['sentiment_label'].value_counts())
print(f'\nMean sentiment score: {scored_df["sentiment_score"].mean():.3f}')

In [None]:
# Show some scored headlines
display_cols = ['title', 'sentiment_label', 'sentiment_score', 'confidence']

print('\n--- Most POSITIVE headlines ---')
pos = scored_df[scored_df['sentiment_label'] == 'positive'].nlargest(5, 'confidence')
for _, row in pos.iterrows():
    print(f'  [{row["confidence"]:.2f}] {row["title"][:90]}')

print('\n--- Most NEGATIVE headlines ---')
neg = scored_df[scored_df['sentiment_label'] == 'negative'].nlargest(5, 'confidence')
for _, row in neg.iterrows():
    print(f'  [{row["confidence"]:.2f}] {row["title"][:90]}')

In [None]:
# Sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
counts = scored_df['sentiment_label'].value_counts()
colors = {'positive': '#2ecc71', 'negative': '#e74c3c', 'neutral': '#95a5a6'}
axes[0].pie(
    counts.values, labels=counts.index, 
    colors=[colors.get(l, '#ccc') for l in counts.index],
    autopct='%1.1f%%', startangle=90
)
axes[0].set_title(f'Sentiment Distribution ({COMPANY})')

# Confidence histogram
for label in ['positive', 'negative', 'neutral']:
    subset = scored_df[scored_df['sentiment_label'] == label]
    if not subset.empty:
        axes[1].hist(subset['confidence'], bins=20, alpha=0.6, 
                     label=label, color=colors.get(label, '#ccc'))
axes[1].set_xlabel('Confidence')
axes[1].set_ylabel('Count')
axes[1].set_title('FinBERT Confidence Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

## 3. Daily Sentiment Aggregation

Convert per-headline scores into daily trading features.

In [None]:
from src.features.sentiment import aggregate_daily_sentiment

daily_sentiment = aggregate_daily_sentiment(scored_df)
print(f'Daily sentiment features: {daily_sentiment.shape}')
daily_sentiment.tail(10)

In [None]:
# Plot daily sentiment over time
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Mean sentiment
axes[0].bar(daily_sentiment.index, daily_sentiment['sentiment_mean'],
            color=['#2ecc71' if x > 0 else '#e74c3c' for x in daily_sentiment['sentiment_mean']],
            alpha=0.7)
axes[0].axhline(y=0, color='black', linewidth=0.5)
axes[0].set_ylabel('Mean Sentiment')
axes[0].set_title(f'Daily Sentiment Features — {COMPANY}')

# Positive / negative ratio
axes[1].fill_between(daily_sentiment.index, daily_sentiment['sentiment_positive_ratio'],
                     alpha=0.5, color='#2ecc71', label='Positive ratio')
axes[1].fill_between(daily_sentiment.index, -daily_sentiment['sentiment_negative_ratio'],
                     alpha=0.5, color='#e74c3c', label='Negative ratio')
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].set_ylabel('Sentiment Ratio')
axes[1].legend()

# News volume & momentum
axes[2].bar(daily_sentiment.index, daily_sentiment['sentiment_count'],
            alpha=0.4, color='gray', label='Headline count')
ax2 = axes[2].twinx()
ax2.plot(daily_sentiment.index, daily_sentiment['sentiment_momentum_5d'],
         color='blue', linewidth=2, label='5d Momentum')
axes[2].set_ylabel('Headlines')
ax2.set_ylabel('5d Sentiment Momentum')
axes[2].legend(loc='upper left')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

## 4. Merge with Price Data & Evaluate Impact

In [None]:
from src.data.fetcher import YFinanceFetcher
from src.data.preprocessing import preprocess_ohlcv
from src.features.technical import compute_technical_indicators
from src.features.returns import compute_return_features
from src.features.labels import generate_labels, get_clean_features_and_labels
from src.features.sentiment import merge_sentiment_features
from src.data.market_config import load_market_config, load_strategy_config

market_config = load_market_config('stocks')
strategy_config = load_strategy_config('short_term')

# Fetch price data
fetcher = YFinanceFetcher()
end = datetime.now().strftime('%Y-%m-%d')
start = (datetime.now() - timedelta(days=5*365)).strftime('%Y-%m-%d')

raw = fetcher.fetch(TICKER, start=start, end=end)
df = preprocess_ohlcv(raw, market_config=market_config)
df = compute_technical_indicators(df)
df = compute_return_features(df)

# Merge sentiment
df = merge_sentiment_features(df, daily_sentiment)

# Labels
df = generate_labels(df, horizon=1, label_type='classification', num_classes=3, threshold=0.01)
X, y = get_clean_features_and_labels(df)

print(f'Features: {X.shape[1]} (technical + return + sentiment)')
sentiment_cols = [c for c in X.columns if 'sentiment' in c]
print(f'Sentiment features: {sentiment_cols}')

In [None]:
# Compare: with vs without sentiment features
from src.models.xgboost_classifier import MarketPulseXGBClassifier
from src.utils.validation import WalkForwardValidator
from sklearn.metrics import f1_score, accuracy_score

validator = WalkForwardValidator.from_strategy_config(strategy_config)
folds = validator.split(X)

# Without sentiment
X_no_sent = X.drop(columns=sentiment_cols, errors='ignore')
scores_no = []
for fold in folds[-10:]:
    X_tr, y_tr, X_te, y_te = validator.get_fold_data(X_no_sent, y, fold)
    m = MarketPulseXGBClassifier(num_classes=3)
    m.fit(X_tr, y_tr, balance_classes=True)
    pred = m.predict(X_te)
    scores_no.append(f1_score(y_te.astype(int), pred, average='macro', zero_division=0))

# With sentiment
scores_with = []
for fold in folds[-10:]:
    X_tr, y_tr, X_te, y_te = validator.get_fold_data(X, y, fold)
    m = MarketPulseXGBClassifier(num_classes=3)
    m.fit(X_tr, y_tr, balance_classes=True)
    pred = m.predict(X_te)
    scores_with.append(f1_score(y_te.astype(int), pred, average='macro', zero_division=0))

print(f'Without sentiment: F1 = {np.mean(scores_no):.4f}')
print(f'With sentiment:    F1 = {np.mean(scores_with):.4f}')
print(f'Delta:             {np.mean(scores_with) - np.mean(scores_no):+.4f}')

In [None]:
# Sentiment feature importance in the model
model = MarketPulseXGBClassifier(num_classes=3)
model.fit(X, y, balance_classes=True)
importance = model.get_feature_importance()

# Highlight sentiment features
top20 = importance.head(20)
colors = ['#e74c3c' if 'sentiment' in f else '#3498db' for f in top20.index]

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(range(len(top20)), top20.values, color=colors)
ax.set_yticks(range(len(top20)))
ax.set_yticklabels(top20.index)
ax.set_title('Top 20 Feature Importance (red = sentiment)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

## 5. Sentiment Surprise as a Trading Signal

**Sentiment surprise** = today's sentiment − 5-day rolling mean.

A large positive surprise means unexpectedly bullish news → potential UP signal.

In [None]:
# Sentiment surprise vs next-day returns
if 'sentiment_surprise' in df.columns and 'fwd_return_1d' in df.columns:
    analysis = df[['sentiment_surprise', 'fwd_return_1d']].dropna()
    
    if len(analysis) > 10:
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Scatter
        axes[0].scatter(analysis['sentiment_surprise'], analysis['fwd_return_1d'],
                        alpha=0.3, s=10)
        axes[0].set_xlabel('Sentiment Surprise')
        axes[0].set_ylabel('Next-Day Return')
        axes[0].set_title('Sentiment Surprise vs Forward Return')
        
        # Bucketed analysis
        analysis['surprise_bucket'] = pd.qcut(
            analysis['sentiment_surprise'], q=5, labels=['Very Neg', 'Neg', 'Neutral', 'Pos', 'Very Pos'],
            duplicates='drop'
        )
        bucket_returns = analysis.groupby('surprise_bucket')['fwd_return_1d'].mean()
        bucket_returns.plot(kind='bar', ax=axes[1], color='#3498db')
        axes[1].set_xlabel('Sentiment Surprise Quintile')
        axes[1].set_ylabel('Mean Next-Day Return')
        axes[1].set_title('Forward Returns by Sentiment Surprise Quintile')
        axes[1].tick_params(axis='x', rotation=0)
        
        plt.tight_layout()
        plt.show()
    else:
        print('Not enough overlapping data between sentiment and returns.')
else:
    print('Columns not available. This works best with more news history.')

## Summary

**Phase 2 adds 8 sentiment features** to the model:
- `sentiment_mean/std`: daily polarity and dispersion
- `sentiment_positive/negative_ratio`: bullish/bearish headline percentages
- `sentiment_count`: news volume (more volume often = more volatility)
- `sentiment_momentum_3d/5d`: rolling sentiment trends
- `sentiment_surprise`: contrarian signal (unexpected shifts)

**Usage in production:**
```python
python -m src.models.trainer --tickers AAPL --sentiment
```