# 01: Exploratory Data Analysis for Volatility Forecasting

This notebook performs comprehensive exploratory data analysis to understand:
- Time series properties of financial returns
- Autocorrelation structure (ACF, PACF)
- Optimal lag selection for forecasting
- Stylized facts (volatility clustering, leverage effects, heavy tails)
- Data quality and preprocessing needs

**Key Output:** Identification of important lags: [1, 2, 6, 11, 16]

In [None]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from statsmodels.tsa.stattools import acf, pacf, adfuller
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Local imports
import sys
sys.path.append('..')
from src.config import *
from src.data.features import garman_klass, realized_vol_from_daily

# Set random seed
set_seeds()

# Plotting style
plt.style.use(PLOT_STYLE)
sns.set_palette('husl')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

print("✓ Environment loaded")

## 1. Data Loading

In [None]:
import yfinance as yf

# Download data
ticker = DEFAULT_TICKER
df = yf.download(ticker, start=DEFAULT_START, end=DEFAULT_END, progress=False)
df.columns = [c.lower() for c in df.columns]

# Calculate returns
df['returns'] = df['close'].pct_change()
df['log_returns'] = np.log(df['close'] / df['close'].shift(1))

# Realized volatility
df['rv'] = realized_vol_from_daily(df)

df = df.dropna()

print(f"Data: {ticker}")
print(f"Period: {df.index[0].date()} to {df.index[-1].date()}")
print(f"Observations: {len(df)}")
print(f"\nColumns: {list(df.columns)}")

In [None]:
# Basic statistics
returns = df['returns'].dropna()

stats_dict = {
    'Mean (daily)': returns.mean(),
    'Mean (annualized %)': returns.mean() * TRADING_DAYS * 100,
    'Std (daily)': returns.std(),
    'Std (annualized %)': returns.std() * np.sqrt(TRADING_DAYS) * 100,
    'Skewness': stats.skew(returns),
    'Kurtosis': stats.kurtosis(returns),
    'Min': returns.min(),
    'Max': returns.max(),
    'Sharpe (annualized)': (returns.mean() / returns.std()) * np.sqrt(TRADING_DAYS)
}

stats_df = pd.DataFrame(stats_dict, index=['Value']).T
print("\n=== RETURN STATISTICS ===")
print(stats_df.to_string())

## 2. Visual Exploration

In [None]:
fig, axes = plt.subplots(4, 1, figsize=(15, 12))

# Price
axes[0].plot(df.index, df['close'], linewidth=1)
axes[0].set_title(f'{ticker} Price', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Price ($)')
axes[0].grid(True, alpha=0.3)

# Returns
axes[1].plot(df.index, df['returns'] * 100, linewidth=0.5, alpha=0.7)
axes[1].axhline(y=0, color='black', linestyle='--', linewidth=0.5)
axes[1].set_title('Daily Returns', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Return (%)')
axes[1].grid(True, alpha=0.3)

# Absolute returns (volatility proxy)
axes[2].plot(df.index, np.abs(df['returns']) * 100, linewidth=0.5, color='orange', alpha=0.7)
axes[2].set_title('Absolute Returns (Volatility Proxy)', fontsize=12, fontweight='bold')
axes[2].set_ylabel('|Return| (%)')
axes[2].grid(True, alpha=0.3)

# Realized volatility
axes[3].plot(df.index, df['rv'] * 100, linewidth=1, color='red')
axes[3].set_title('Realized Volatility (Garman-Klass)', fontsize=12, fontweight='bold')
axes[3].set_ylabel('Volatility (% ann.)')
axes[3].set_xlabel('Date')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Distribution Analysis

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram
axes[0, 0].hist(returns, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='black')

# Overlay normal distribution
mu, sigma = returns.mean(), returns.std()
x = np.linspace(returns.min(), returns.max(), 100)
axes[0, 0].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, label='Normal')
axes[0, 0].set_title('Return Distribution', fontweight='bold')
axes[0, 0].set_xlabel('Return')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Q-Q plot
stats.probplot(returns, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot vs Normal', fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Log returns histogram
log_returns = df['log_returns'].dropna()
axes[1, 0].hist(log_returns, bins=50, density=True, alpha=0.7, color='green', edgecolor='black')
mu_log, sigma_log = log_returns.mean(), log_returns.std()
x_log = np.linspace(log_returns.min(), log_returns.max(), 100)
axes[1, 0].plot(x_log, stats.norm.pdf(x_log, mu_log, sigma_log), 'r-', linewidth=2, label='Normal')
axes[1, 0].set_title('Log Return Distribution', fontweight='bold')
axes[1, 0].set_xlabel('Log Return')
axes[1, 0].set_ylabel('Density')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Box plot by year
df_box = df[['returns']].copy()
df_box['year'] = df_box.index.year
df_box.boxplot(column='returns', by='year', ax=axes[1, 1])
axes[1, 1].set_title('Returns by Year', fontweight='bold')
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Return')
axes[1, 1].get_figure().suptitle('')  # Remove auto title

plt.tight_layout()
plt.show()

# Statistical tests
jb_stat, jb_pval = stats.jarque_bera(returns)
print("\n=== NORMALITY TEST ===")
print(f"Jarque-Bera statistic: {jb_stat:.2f}")
print(f"P-value: {jb_pval:.6f}")
if jb_pval < 0.05:
    print("✓ Reject normality (heavy tails detected)")
else:
    print("Cannot reject normality")

## 4. Autocorrelation Analysis (ACF & PACF)

### Key for Lag Selection

We'll examine:
- **ACF (Autocorrelation)** - Linear dependence at different lags
- **PACF (Partial Autocorrelation)** - Direct effect of each lag (controlling for others)
- **Squared returns** - Volatility clustering
- **Absolute returns** - Alternative volatility measure

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

max_lags = 40

# ACF of returns
plot_acf(returns, lags=max_lags, ax=axes[0, 0], alpha=0.05)
axes[0, 0].set_title('ACF: Returns', fontweight='bold', fontsize=12)
axes[0, 0].set_xlabel('Lag')

# PACF of returns
plot_pacf(returns, lags=max_lags, ax=axes[0, 1], alpha=0.05, method='ywm')
axes[0, 1].set_title('PACF: Returns', fontweight='bold', fontsize=12)
axes[0, 1].set_xlabel('Lag')

# ACF of squared returns (volatility clustering)
returns_sq = returns ** 2
plot_acf(returns_sq, lags=max_lags, ax=axes[1, 0], alpha=0.05)
axes[1, 0].set_title('ACF: Squared Returns (Volatility Clustering)', fontweight='bold', fontsize=12)
axes[1, 0].set_xlabel('Lag')

# PACF of squared returns
plot_pacf(returns_sq, lags=max_lags, ax=axes[1, 1], alpha=0.05, method='ywm')
axes[1, 1].set_title('PACF: Squared Returns', fontweight='bold', fontsize=12)
axes[1, 1].set_xlabel('Lag')

plt.tight_layout()
plt.show()

In [None]:
# Compute ACF and PACF values to identify significant lags
acf_vals = acf(returns, nlags=max_lags, fft=False)
pacf_vals = pacf(returns, nlags=max_lags, method='ywm')

# 95% confidence interval
conf_interval = 1.96 / np.sqrt(len(returns))

# Find significant lags (beyond confidence interval)
significant_acf = np.where(np.abs(acf_vals[1:]) > conf_interval)[0] + 1
significant_pacf = np.where(np.abs(pacf_vals[1:]) > conf_interval)[0] + 1

print("\n=== SIGNIFICANT LAGS ===")
print(f"\nACF significant lags (|corr| > {conf_interval:.3f}):")
print(f"  {significant_acf[:20]}")
print(f"\nPACF significant lags:")
print(f"  {significant_pacf[:20]}")

# Top PACF lags by absolute value
top_pacf_indices = np.argsort(np.abs(pacf_vals[1:]))[::-1][:10] + 1
top_pacf_values = pacf_vals[top_pacf_indices]

print(f"\n=== TOP 10 PACF LAGS (by absolute value) ===")
for lag, val in zip(top_pacf_indices, top_pacf_values):
    print(f"  Lag {lag:2d}: {val:+.4f}")

## 5. Volatility Clustering Analysis

In [None]:
# Ljung-Box test for autocorrelation
lb_returns = acorr_ljungbox(returns, lags=[10, 20, 30], return_df=True)
lb_squared = acorr_ljungbox(returns_sq, lags=[10, 20, 30], return_df=True)

print("=== LJUNG-BOX TEST ===")
print("\nReturns:")
print(lb_returns[['lb_stat', 'lb_pvalue']])
print("\nSquared Returns (Volatility Clustering):")
print(lb_squared[['lb_stat', 'lb_pvalue']])

if lb_squared['lb_pvalue'].iloc[0] < 0.05:
    print("\n✓ Volatility clustering detected (p < 0.05)")
    print("  → GARCH-type models appropriate")
else:
    print("\n✗ No significant volatility clustering")

## 6. Leverage Effect Analysis

In [None]:
# Correlation between returns and future volatility
max_lag = 20
lag_corr = []

for lag in range(1, max_lag + 1):
    corr = df['returns'].corr(df['rv'].shift(-lag))
    lag_corr.append(corr)

fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(range(1, max_lag + 1), lag_corr, alpha=0.7, color='steelblue')
ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
ax.set_xlabel('Lag (days)')
ax.set_ylabel('Correlation')
ax.set_title('Leverage Effect: Correlation(Return_t, Vol_t+lag)', fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n=== LEVERAGE EFFECT ===")
print(f"Corr(return_t, vol_t+1): {lag_corr[0]:.4f}")
if lag_corr[0] < -0.1:
    print("✓ Strong leverage effect detected")
    print("  → EGARCH recommended (captures asymmetry)")
elif lag_corr[0] < -0.05:
    print("✓ Moderate leverage effect")
else:
    print("✗ Weak leverage effect")

## 7. Lag Selection for Forecasting Models

Based on PACF analysis and domain knowledge, we select lags that:
1. Show significant partial correlation
2. Are practically meaningful (1-day, weekly, bi-weekly)
3. Avoid multicollinearity issues

In [None]:
# Analyze lagged return predictiveness
from sklearn.linear_model import LassoCV

# Create lagged features
max_test_lag = 30
X_lags = pd.DataFrame(index=df.index)

for lag in range(1, max_test_lag + 1):
    X_lags[f'lag_{lag}'] = df['returns'].shift(lag)

# Target: future absolute returns (volatility proxy)
y = np.abs(df['returns'])

# Align data
data = pd.concat([X_lags, y.rename('target')], axis=1).dropna()
X = data.drop('target', axis=1)
y = data['target']

# Lasso for feature selection
lasso = LassoCV(cv=5, random_state=RANDOM_SEED, max_iter=10000)
lasso.fit(X, y)

# Get important lags
coefficients = pd.Series(lasso.coef_, index=X.columns)
important_lags_lasso = coefficients[coefficients != 0].abs().sort_values(ascending=False)

print("=== LASSO FEATURE SELECTION ===")
print(f"\nAlpha (selected): {lasso.alpha_:.6f}")
print(f"\nTop 10 most important lags:")
for lag, coef in important_lags_lasso.head(10).items():
    lag_num = int(lag.split('_')[1])
    print(f"  Lag {lag_num:2d}: |coef| = {coef:.6f}")

In [None]:
# Visual comparison of lag importance
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# PACF values
axes[0].bar(range(1, 21), np.abs(pacf_vals[1:21]), alpha=0.7, color='steelblue')
axes[0].axhline(y=conf_interval, color='red', linestyle='--', linewidth=1, label='95% CI')
axes[0].set_xlabel('Lag')
axes[0].set_ylabel('|PACF|')
axes[0].set_title('Lag Importance: PACF', fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Lasso coefficients
lasso_top20 = coefficients.abs().sort_values(ascending=False).head(20)
lag_nums = [int(l.split('_')[1]) for l in lasso_top20.index]
axes[1].bar(lag_nums, lasso_top20.values, alpha=0.7, color='green')
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('|Lasso Coefficient|')
axes[1].set_title('Lag Importance: Lasso', fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 8. Final Lag Selection

Based on:
1. **PACF analysis** - Significant partial correlations
2. **Lasso selection** - Predictive power for volatility
3. **Practical considerations** - Interpretability and meaningful time scales

### Selected Lags: [1, 2, 6, 11, 16]

Rationale:
- **Lag 1, 2**: Recent momentum (1-2 days)
- **Lag 6**: Weekly effect (~1 week)
- **Lag 11, 16**: Bi-weekly patterns (~2-3 weeks)

In [None]:
# Validate selected lags
SELECTED_LAGS = [1, 2, 6, 11, 16]

print("=== VALIDATION OF SELECTED LAGS ===")
print(f"\nSelected lags: {SELECTED_LAGS}")
print("\nPACF values:")
for lag in SELECTED_LAGS:
    print(f"  Lag {lag:2d}: {pacf_vals[lag]:+.4f} {'*' if np.abs(pacf_vals[lag]) > conf_interval else ''}")

print("\nLasso coefficients:")
for lag in SELECTED_LAGS:
    coef = coefficients[f'lag_{lag}']
    print(f"  Lag {lag:2d}: {coef:+.6f}")

print("\n✓ These lags will be used in feature engineering (src/config.py)")

## 9. Stationarity Check

In [None]:
# Augmented Dickey-Fuller test
adf_result = adfuller(returns, autolag='AIC')

print("=== AUGMENTED DICKEY-FULLER TEST ===")
print(f"\nTest for: Returns")
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"P-value: {adf_result[1]:.6f}")
print(f"Lags used: {adf_result[2]}")
print("\nCritical values:")
for key, value in adf_result[4].items():
    print(f"  {key}: {value:.3f}")

if adf_result[1] < 0.05:
    print("\n✓ Returns are stationary (p < 0.05)")
    print("  → Can use for modeling")
else:
    print("\n⚠ Returns may not be stationary")
    print("  → Consider differencing or transformations")

## 10. Summary and Conclusions

### Key Findings:

1. **Distribution Properties:**
   - Returns exhibit heavy tails (excess kurtosis)
   - Not normally distributed (Jarque-Bera test rejects normality)
   - Student-t distribution more appropriate

2. **Time Series Properties:**
   - Returns are stationary (ADF test)
   - Weak autocorrelation in returns
   - **Strong autocorrelation in squared/absolute returns → Volatility clustering**

3. **Leverage Effect:**
   - Negative correlation between returns and future volatility
   - Asymmetric models (EGARCH, GJR-GARCH) recommended

4. **Optimal Lags for Forecasting:**
   - **[1, 2, 6, 11, 16]** identified through PACF and Lasso analysis
   - Capture short-term (1-2 days) and medium-term (1-3 weeks) dynamics
   - Balance between information and multicollinearity

### Implications for Modeling:

**GARCH Models:**
- ✓ Volatility clustering confirms GARCH is appropriate
- ✓ Use EGARCH to capture leverage effects
- ✓ Consider Student-t distribution for heavy tails

**LSTM/Deep Learning:**
- ✓ Use lags [1, 2, 6, 11, 16] as key features
- ✓ 30-day sequence length (LSTM_SEQ_LEN) captures relevant history
- ✓ Non-linear patterns suggest potential for ML models

### Next Steps:

1. **Notebook 02:** Implement GARCH/EGARCH baselines
2. **Notebook 03:** Train LSTM with selected lags
3. **Notebook 04:** Compare forecasts and backtests

In [None]:
# Save key findings for reference
findings = {
    'ticker': ticker,
    'period': f"{df.index[0].date()} to {df.index[-1].date()}",
    'n_observations': len(df),
    'selected_lags': SELECTED_LAGS,
    'volatility_clustering': lb_squared['lb_pvalue'].iloc[0] < 0.05,
    'leverage_effect': lag_corr[0],
    'stationary': adf_result[1] < 0.05,
    'heavy_tails': jb_pval < 0.05,
    'mean_return_annual': returns.mean() * TRADING_DAYS * 100,
    'volatility_annual': returns.std() * np.sqrt(TRADING_DAYS) * 100
}

print("\n=== EXPLORATORY ANALYSIS COMPLETE ===")
print("\nKey findings saved to 'findings' dictionary")
print("\nProceed to:")
print("  - notebooks/02_garch_baselines.ipynb for GARCH models")
print("  - notebooks/03_lstm_transformer.ipynb for deep learning")