# 01 — Data Exploration

**MarketPulse Phase 1**

This notebook explores the raw financial data we use as inputs to the model.  
We will:
1. Fetch OHLCV data for multiple stocks using the project's own data pipeline
2. Inspect data quality (missing values, date gaps)
3. Visualize price history, volume, and return distributions
4. Compare different markets (stocks, indices)

In [None]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from datetime import datetime, timedelta

from src.data.market_config import load_market_config
from src.data.fetcher import YFinanceFetcher
from src.data.preprocessing import preprocess_ohlcv

sns.set_theme(style='whitegrid')
%matplotlib inline

## 1. Load Configuration & Fetch Data

Our YAML-based config system drives everything. Let's load the US stocks config and fetch 5 years of data for a handful of tickers.

In [None]:
# Load market config from YAML
stock_config = load_market_config('stocks')

print(f"Market:           {stock_config.display_name}")
print(f"Ticker format:    {stock_config.ticker_format}")
print(f"Trading days/yr:  {stock_config.trading_days_per_year}")
print(f"Calendar:         {stock_config.calendar}")
print(f"Use adj close:    {stock_config.use_adjusted_close}")
print(f"Default tickers:  {stock_config.default_tickers}")

In [None]:
# Fetch data for selected tickers
tickers = ['AAPL', 'MSFT', 'TSLA', 'JPM', 'XOM']
fetcher = YFinanceFetcher(market_config=stock_config)

end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=5*365)).strftime('%Y-%m-%d')

raw_data = fetcher.fetch_multiple(tickers, start=start_date, end=end_date)

for ticker, df in raw_data.items():
    print(f"{ticker}: {len(df)} rows, {df.index[0].date()} → {df.index[-1].date()}")

## 2. Data Quality Inspection

Before modelling we need to check for: missing values, duplicate dates, suspicious values (negative prices, zero volume).

In [None]:
# Pick one ticker for detailed inspection
df_raw = raw_data['AAPL'].copy()

print("=== AAPL Raw Data Quality ===")
print(f"Shape: {df_raw.shape}")
print(f"\nMissing values per column:")
print(df_raw.isnull().sum())
print(f"\nDuplicate dates: {df_raw.index.duplicated().sum()}")
print(f"\nDate range: {df_raw.index[0].date()} to {df_raw.index[-1].date()}")
print(f"Expected business days: {len(pd.bdate_range(df_raw.index[0], df_raw.index[-1]))}")
print(f"Actual trading days: {len(df_raw)}")
print(f"Coverage: {len(df_raw) / len(pd.bdate_range(df_raw.index[0], df_raw.index[-1])):.1%}")

print(f"\nDescriptive Statistics:")
df_raw.describe().round(2)

In [None]:
# Check for gaps > 3 business days (holidays excluded)
date_diffs = df_raw.index.to_series().diff().dt.days
large_gaps = date_diffs[date_diffs > 4]  # > 4 calendar days = potential issue

print(f"Gaps > 4 calendar days: {len(large_gaps)}")
if len(large_gaps) > 0:
    print(large_gaps.head(10))

## 3. Preprocessing

Our preprocessing pipeline handles: deduplication, forward-filling, adjusted close correction, volume normalization, and returns computation.

In [None]:
# Preprocess all tickers
processed = {}
for ticker, df in raw_data.items():
    processed[ticker] = preprocess_ohlcv(df, market_config=stock_config)
    print(f"{ticker}: {len(raw_data[ticker])} raw → {len(processed[ticker])} clean rows")

df = processed['AAPL']
print(f"\nColumns after preprocessing: {list(df.columns)}")
df.tail()

## 4. Price History Visualization

In [None]:
# Interactive candlestick chart for AAPL
df_plot = processed['AAPL'].tail(252)  # Last 1 year

fig = go.Figure(data=[go.Candlestick(
    x=df_plot.index,
    open=df_plot['open'], high=df_plot['high'],
    low=df_plot['low'], close=df_plot['close'],
    name='AAPL'
)])
fig.update_layout(
    title='AAPL — Last 12 Months (Candlestick)',
    yaxis_title='Price ($)', xaxis_title='Date',
    template='plotly_white', height=500
)
fig.show()

In [None]:
# Normalized price comparison (all tickers rebased to 100)
fig, ax = plt.subplots(figsize=(14, 6))

for ticker, df_t in processed.items():
    normalized = df_t['close'] / df_t['close'].iloc[0] * 100
    ax.plot(normalized.index, normalized, label=ticker, linewidth=1.5)

ax.set_title('Normalized Price Comparison (Base = 100)', fontsize=14)
ax.set_ylabel('Normalized Price')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Return Distribution Analysis

Understanding the distribution of daily returns is fundamental. Financial returns are known to be:
- Approximately normal but with **fat tails** (excess kurtosis)
- Slightly **negatively skewed** (crashes are larger than rallies)
- **Non-stationary** (volatility changes over time)

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

for i, (ticker, df_t) in enumerate(processed.items()):
    ax = axes[i // 3, i % 3]
    returns = df_t['returns'].dropna()
    
    ax.hist(returns, bins=80, density=True, alpha=0.7, color='steelblue', edgecolor='white')
    
    # Overlay normal distribution for comparison
    x = np.linspace(returns.min(), returns.max(), 200)
    from scipy.stats import norm
    ax.plot(x, norm.pdf(x, returns.mean(), returns.std()), 'r-', lw=2, label='Normal')
    
    ax.set_title(f'{ticker}  (skew={returns.skew():.2f}, kurt={returns.kurtosis():.2f})')
    ax.legend(fontsize=8)
    ax.set_xlabel('Daily Return')

# Remove empty subplot
axes[1, 2].set_visible(False)

plt.suptitle('Daily Return Distributions vs Normal', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Summary statistics table
stats = []
for ticker, df_t in processed.items():
    r = df_t['returns'].dropna()
    stats.append({
        'Ticker': ticker,
        'Ann. Return': f"{r.mean() * 252:.1%}",
        'Ann. Vol': f"{r.std() * np.sqrt(252):.1%}",
        'Sharpe': f"{(r.mean() / r.std()) * np.sqrt(252):.2f}",
        'Skewness': f"{r.skew():.3f}",
        'Kurtosis': f"{r.kurtosis():.2f}",
        'Max DD': f"{((1+r).cumprod() / (1+r).cumprod().expanding().max() - 1).min():.1%}",
        'Best Day': f"{r.max():.1%}",
        'Worst Day': f"{r.min():.1%}",
    })

pd.DataFrame(stats).set_index('Ticker')

## 6. Volume Analysis

In [None]:
df_aapl = processed['AAPL'].tail(252)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), sharex=True,
                                gridspec_kw={'height_ratios': [3, 1]})

ax1.plot(df_aapl.index, df_aapl['close'], color='steelblue', linewidth=1.5)
ax1.set_ylabel('Close Price ($)')
ax1.set_title('AAPL — Price and Volume (Last 12 Months)')
ax1.grid(True, alpha=0.3)

colors = ['green' if r > 0 else 'red' for r in df_aapl['returns']]
ax2.bar(df_aapl.index, df_aapl['volume'] / 1e6, color=colors, alpha=0.7, width=1)
ax2.set_ylabel('Volume (M)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Rolling Statistics

Volatility, Sharpe ratio, and correlation change over time. These time-varying dynamics are what our model tries to capture.

In [None]:
df_aapl = processed['AAPL']

fig, axes = plt.subplots(3, 1, figsize=(14, 12), sharex=True)

# Rolling 20-day volatility (annualized)
rolling_vol = df_aapl['returns'].rolling(20).std() * np.sqrt(252)
axes[0].plot(rolling_vol.index, rolling_vol, color='orange', linewidth=1)
axes[0].set_ylabel('Annualized Volatility')
axes[0].set_title('AAPL Rolling Statistics')
axes[0].axhline(rolling_vol.mean(), color='gray', linestyle='--', alpha=0.5)
axes[0].grid(True, alpha=0.3)

# Rolling 60-day Sharpe ratio
rolling_sharpe = (
    df_aapl['returns'].rolling(60).mean() / 
    df_aapl['returns'].rolling(60).std()
) * np.sqrt(252)
axes[1].plot(rolling_sharpe.index, rolling_sharpe, color='green', linewidth=1)
axes[1].axhline(0, color='red', linestyle='-', alpha=0.3)
axes[1].set_ylabel('Rolling Sharpe (60d)')
axes[1].grid(True, alpha=0.3)

# Cumulative returns
cum_ret = (1 + df_aapl['returns']).cumprod()
peak = cum_ret.expanding().max()
drawdown = cum_ret / peak - 1
axes[2].fill_between(drawdown.index, drawdown, 0, color='red', alpha=0.3)
axes[2].set_ylabel('Drawdown')
axes[2].set_xlabel('Date')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Correlation Matrix

Correlations between stocks affect portfolio diversification and clustering behavior.

In [None]:
# Build returns matrix
returns_df = pd.DataFrame({
    ticker: df_t['returns'] for ticker, df_t in processed.items()
}).dropna()

corr = returns_df.corr()

fig, ax = plt.subplots(figsize=(8, 6))
mask = np.triu(np.ones_like(corr, dtype=bool), k=1)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, ax=ax, square=True)
ax.set_title('Return Correlation Matrix', fontsize=14)
plt.tight_layout()
plt.show()

## Key Takeaways

1. **Data quality** is high — yfinance provides clean OHLCV data with minimal gaps for US stocks.
2. **Return distributions** have fat tails (excess kurtosis), confirming that extreme events are more common than a normal distribution would predict.
3. **Volatility clusters** — periods of high volatility tend to be followed by more high volatility (autocorrelation in variance).
4. **Cross-stock correlations** exist, which is why our clustering module can find meaningful groups.

These properties inform our feature engineering choices (rolling vol, z-scores, momentum) in notebook 02.