# Task 3: Correlation Analysis between News Sentiment and Stock Price Movements

This notebook demonstrates how to analyze the relationship between news sentiment and stock price movements using the modular codebase.  
We will:
- Score news headlines for sentiment
- Aggregate sentiment by date
- Merge with stock price and indicator data
- Compute and visualize correlations

In [5]:
import sys
import os

# Add the src directory to sys.path
src_path = os.path.abspath(os.path.join(os.getcwd(), '..', 'src'))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

In [6]:
import pandas as pd
from sentiment_analysis import score_headlines_vader
from correlation_analysis import (
    merge_sentiment_price,
    compute_correlations,
    plot_correlation_heatmap
)
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Load News and Price Data

Replace the file paths and column names as needed for your data.

In [None]:
# Load news data (update the path as needed)
news_df = pd.read_csv('../data/raw_analys_ratings.csv')

# Load price/indicator data (update the path as needed)
price_df = pd.read_csv('../data/yfinance_data')

# Preview the data
display(news_df.head())
display(price_df.head())

## 2. Score News Headlines for Sentiment

We use the VADER sentiment analyzer to score each headline.  
This will add four columns: `sentiment_neg`, `sentiment_neu`, `sentiment_pos`, and `sentiment_compound`.

In [None]:
# Score sentiment for each headline (update 'headline' if your column is named differently)
news_df = score_headlines_vader(news_df, text_col='headline')

# Preview sentiment columns
news_df[['headline', 'sentiment_neg', 'sentiment_neu', 'sentiment_pos', 'sentiment_compound']].head()

## 3. Aggregate Sentiment by Date

Aggregate sentiment scores to the daily level to match the granularity of price data.

In [None]:
# Ensure 'date' column is datetime
news_df['date'] = pd.to_datetime(news_df['date'])

# Aggregate: mean sentiment per day
daily_sentiment = news_df.groupby('date').agg({
    'sentiment_neg': 'mean',
    'sentiment_neu': 'mean',
    'sentiment_pos': 'mean',
    'sentiment_compound': 'mean'
}).reset_index()

daily_sentiment.head()

## 4. Prepare Price Data

Ensure price data has a 'date' column and relevant indicators/returns.

In [None]:
# Ensure 'date' column is datetime
price_df['date'] = pd.to_datetime(price_df['date'])

# Preview columns to select which indicators/returns to use
print(price_df.columns)
price_df.head()

## 5. Merge Sentiment and Price Data

Use the provided merge function to align sentiment and price data by date.

In [None]:
# Merge on 'date'
merged_df = merge_sentiment_price(daily_sentiment, price_df, date_col='date')

# Preview merged data
merged_df.head()

## 6. Compute Correlations

Choose which sentiment and price columns to correlate.  
We will compute the Pearson correlation matrix.

In [None]:
# Define columns to correlate
sentiment_cols = ['sentiment_neg', 'sentiment_neu', 'sentiment_pos', 'sentiment_compound']
# Update these with your actual indicator/return column names
price_cols = ['daily_return', 'SMA_20', 'EMA_20', 'RSI_14', 'MACD']

# Compute Pearson correlation
corr_matrix = compute_correlations(merged_df, sentiment_cols, price_cols, method='pearson')
corr_matrix

## 7. Visualize Correlations

Plot a heatmap for easy interpretation of the correlation matrix.

In [None]:
plot_correlation_heatmap(corr_matrix, title='Sentiment vs Price Indicator Correlations (Pearson)')

## 8. (Optional) Spearman Correlation

Spearman correlation can capture non-linear relationships.

In [None]:
# Compute Spearman correlation
corr_matrix_spearman = compute_correlations(merged_df, sentiment_cols, price_cols, method='spearman')
display(corr_matrix_spearman)

# Plot
plot_correlation_heatmap(corr_matrix_spearman, title='Sentiment vs Price Indicator Correlations (Spearman)')

## 9. Lagged Correlation Analysis

Lagged correlation helps determine if sentiment on day *t* is correlated with price movements on day *t+1*, *t+2*, etc.
We will shift the sentiment columns forward and compute correlations with future price indicators.

In [None]:
# Number of days to lag (e.g., 1 for next day, 2 for two days ahead)
lag_days = 1

# Create lagged sentiment columns (shift forward, so sentiment leads price)
for col in sentiment_cols:
    merged_df[f'{col}_lag{lag_days}'] = merged_df[col].shift(lag_days)

# Drop rows with NaN due to shifting
lagged_df = merged_df.dropna(subset=[f'{col}_lag{lag_days}' for col in sentiment_cols])

# Preview
lagged_df[[f'{col}_lag{lag_days}' for col in sentiment_cols] + price_cols].head()

# Compute correlation between lagged sentiment and price indicators
lagged_sentiment_cols = [f'{col}_lag{lag_days}' for col in sentiment_cols]
lagged_corr_matrix = compute_correlations(lagged_df, lagged_sentiment_cols, price_cols, method='pearson')
lagged_corr_matrix

In [None]:
plot_correlation_heatmap(
    lagged_corr_matrix,
    title=f'Lagged Sentiment (t-{lag_days}) vs Price Indicator Correlations (Pearson)'
)

### Try Multiple Lags

You can repeat the above for different values of `lag_days` (e.g., 1, 2, 3) to see how predictive power changes over time.

## 10. Granger Causality Test

Granger causality tests whether past values of sentiment help predict future values of a price indicator, beyond what past values of the indicator itself can predict.

We use the `statsmodels` library for this test.

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests

# Choose a price indicator and a sentiment column (not lagged)
target_price_col = 'daily_return'  # Update as needed
sentiment_col = 'sentiment_compound'

# Prepare the data for Granger causality test
# The test expects a 2D array: [target, predictor]
gc_df = merged_df[[target_price_col, sentiment_col]].dropna()

# Maximum number of lags to test (e.g., 1 to 3 days)
max_lag = 3

# Run Granger causality test
print(f'Granger causality test: Does {sentiment_col} Granger-cause {target_price_col}?')
granger_result = grangercausalitytests(gc_df, maxlag=max_lag, verbose=True)

### Interpreting Granger Causality Results

- For each lag, look at the p-value for the F-test (`ssr_ftest`).
- If the p-value is **less than 0.05**, there is evidence that sentiment Granger-causes the price indicator at that lag.
- Try different sentiment and price columns for a comprehensive analysis.

---

## Summary

- We explored both contemporaneous and lagged correlations between news sentiment and stock price indicators.
- We used Granger causality to test if sentiment helps predict future price movements.
- Next, interpret your findings in the context of your business question and consider further modeling (e.g., regression, machine learning) if predictive relationships are found.

---