# EDA: Sentiment and Return Analysis

This notebook performs exploratory data analysis (EDA) on engineered features related to news sentiment and next-day stock returns. It includes descriptive statistics, visualizations, correlation metrics, and rolling analysis.


## 1. Import Required Libraries

Import libraries for data manipulation (`pandas`, `numpy`), visualization (`matplotlib`, `seaborn`), and correlation computation (`scipy.stats`).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr, spearmanr

## 2. Set Plotting Style

Configure visual styles for consistent and clean plots.


In [None]:
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 3. Load Engineered Dataset

Load the final engineered features CSV file, parsing the `Date` column as a datetime object.


In [None]:
data_frame = pd.read_csv("../data/final/engineered_features.csv", parse_dates=["Date"])

## 4. Descriptive Statistics

Generate and export summary statistics of the dataset to understand basic properties like mean, standard deviation, and quartiles.


In [None]:
describe_data_frame = data_frame.describe(percentiles=[.25, .5, .75])
describe_data_frame.to_csv("../reports/descriptive_stats.csv")
describe_data_frame

## 5. Scatter Plot: Sentiment vs. Next Day Return

Create a scatter plot to visualize the relationship between average sentiment and next day return.


In [None]:
sns.scatterplot(data=merged, x="avg_sentiment", y="Next Day Return")
plt.title("Sentiment vs. Next Day Return")
plt.savefig("../reports/scatter_sentiment_return.png")
plt.clf()

## 6. Time Series Plot: Sentiment and Return Over Time

Plot sentiment and next day return values over time using a dual-axis line chart.


In [None]:
fig, ax1 = plt.subplots()

ax1.set_xlabel('Date')
ax1.set_ylabel('Sentiment', color='tab:blue')
ax1.plot(merged['Date'], merged['sentiment'], color='tab:blue', label='Sentiment')
ax1.tick_params(axis='y', labelcolor='tab:blue')

ax2 = ax1.twinx()
ax2.set_ylabel('Next Day Return', color='tab:red')
ax2.plot(merged['Date'], merged['next_day_return'], color='tab:red', alpha=0.6, label='Return')
ax2.tick_params(axis='y', labelcolor='tab:red')

plt.title("Sentiment & Next Day Return over Time")
plt.savefig("../reports/figures/sentiment_return_over_time.png")
plt.clf()

## 7. Distribution Analysis

Plot histograms with KDE to visualize the distributions of sentiment and next day return.


In [None]:

sns.histplot(merged['sentiment'], kde=True, bins=30)
plt.title("Sentiment Distribution")
plt.savefig("../reports/figures/sentiment_distribution.png")
plt.clf()

sns.histplot(merged['next_day_return'], kde=True, bins=30)
plt.title("Next Day Return Distribution")
plt.savefig("../reports/figures/return_distribution.png")
plt.clf()

## 8. Correlation Coefficients

Compute Pearson and Spearman correlation between sentiment and next day return.


In [None]:
pearson_corr, _ = pearsonr(merged['sentiment'], merged['next_day_return'])
spearman_corr, _ = spearmanr(merged['sentiment'], merged['next_day_return'])

print(f"Pearson correlation: {pearson_corr:.3f}")
print(f"Spearman correlation: {spearman_corr:.3f}")

## 9. Rolling Correlation (30-Day Window)

Calculate and plot a rolling window (30-day) correlation between sentiment and next day return to observe time-dependent relationships.


In [None]:
merged['rolling_corr'] = merged['sentiment'].rolling(30).corr(merged['next_day_return'])

plt.plot(merged['Date'], merged['rolling_corr'])
plt.title("30-Day Rolling Correlation: Sentiment vs Return")
plt.ylabel("Correlation")
plt.xlabel("Date")
plt.savefig("../reports/figures/rolling_corr_sentiment_return.png")
plt.clf()

## 10. Save Processed Data

Export the merged dataset with rolling correlation to a new CSV file for further analysis or modeling.


In [None]:
merged.to_csv("../data/processed/merged_with_corr.csv", index=False)

print("EDA kész ✅ Az ábrák mentve a reports/figures mappába.")