# Notebook 00 — Historical Backfill Engine (Polygon → DuckDB)

This notebook builds a **real historical dataset** for the Volatility Alpha Engine (VAE).

We pull 60–90 days of market data from **Polygon**, compute basic volatility stats,
and store everything in **DuckDB**. The other notebooks (01–06) and the RL system
will read from this same DuckDB file.

**Why this matters**

- Makes our project look like real quant research (not toy, single-day data).
- Gives us enough history for EDA, features, regimes, and RL training.
- Shows we can build a reproducible data pipeline.

## 1. Imports and DuckDB connection

**What this cell does**

- Imports Python libraries we need (dates, dataframes, DuckDB).
- Imports our Polygon helper functions from `src/polygon_client.py`.
- Connects to the main DuckDB database file for VAE.

**Why this matters**

- DuckDB is our **single source of truth** for all notebooks.
- Using one DB file keeps the pipeline clean and reproducible.
- We see a real data-engineering pattern, not ad-hoc CSVs.

In [1]:
from pathlib import Path
import sys
from datetime import datetime, timedelta

import duckdb
import pandas as pd

# --- Make sure we can import from the project root ---
PROJECT_ROOT = Path.cwd().parent  # notebooks/ -> project root
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

# Now this import will work
from src.polygon_client import get_underlying_bars, compute_realized_vol

# --- Use the SAME DuckDB file as notebooks 1–6 ---
DB_PATH = (PROJECT_ROOT / "data" / "volatility_alpha.duckdb").as_posix()
con = duckdb.connect(DB_PATH)

print("Using DB:", DB_PATH)

Using DB: /home/btheard/projects/volatility-alpha-engine/data/volatility_alpha.duckdb


## 2. Choose tickers and backfill window

**What this cell does**

- Defines the small universe of tickers we care about right now.
- Sets a date window (last ~90 calendar days) to backfill.

**Why this matters**

- ~90 calendar days ≈ 60 trading days, which is enough for:
  - 20-day and 60-day realized volatility
  - Stable features and RL training episodes
- We keep the universe small at first to avoid hitting Polygon rate limits.

In [2]:
tickers = ["SPY", "QQQ", "TSLA", "NVDA", "AMD"]

end_date = datetime.now()
start_date = end_date - timedelta(days=90)

start_date_str = start_date.strftime("%Y-%m-%d")
end_date_str = end_date.strftime("%Y-%m-%d")

start_date_str, end_date_str, tickers

('2025-09-04', '2025-12-03', ['SPY', 'QQQ', 'TSLA', 'NVDA', 'AMD'])

## 3. Download daily OHLC bars from Polygon

**What this cell does**

- Loops over each ticker.
- Calls `get_underlying_bars()` to fetch roughly 90 days of daily bars.
- Attaches a `ticker` column and collects all rows in a list.

**Why this matters**

- OHLC (Open, High, Low, Close, Volume) is the core price data
  used in most trading and RL systems.
- This is our **raw market tape** that everything else builds on
  (features, regimes, signals, RL environment).
- We print basic status so we can see which tickers succeeded or failed.

In [3]:
all_rows = []

for symbol in tickers:
    try:
        bars = get_underlying_bars(symbol, days=90)
        if bars is None or bars.empty:
            print(f"⚠ No bars for {symbol}")
            continue

        bars = bars.copy()
        bars["ticker"] = symbol
        all_rows.append(bars)

        print(f"✓ Loaded {len(bars)} bars for {symbol}")

    except Exception as e:
        print(f"❌ Error loading {symbol}: {e}")

✓ Loaded 90 bars for SPY
✓ Loaded 90 bars for QQQ
✓ Loaded 90 bars for TSLA
✓ Loaded 90 bars for NVDA
✓ Loaded 90 bars for AMD


## 4. Combine and clean the raw bar data

**What this cell does**

- Concatenates all per-ticker DataFrames into a single `df_bars`.
- Converts Polygon `timestamp` to a clean `date` column.
- Shows a preview of the data.

**Why this matters**

- Having all tickers in one DataFrame makes it easy to write to DuckDB.
- A clean `date` column is essential for:
  - Grouping by day
  - Computing returns
  - Aligning features and RL transitions
- This is the **canonical raw table** for downstream notebooks.

In [4]:
# Combine and clean the raw bar data

# What this cell does
# - Concatenate all per-ticker DataFrames into a single df_bars
# - Move Polygon’s time field (index or column) into a proper 'date' column
# - Normalize to calendar dates (YYYY-MM-DD)

if not all_rows:
    raise RuntimeError("No data returned from Polygon. Check API key or rate limits.")

# 1) Combine all tickers' bars into one DataFrame WITHOUT dropping the index
df_bars = pd.concat(all_rows)

print("Backfill bars columns:", list(df_bars.columns))
print("Backfill bars index name:", df_bars.index.name)

cols = df_bars.columns

# 2) Ensure we have a 'date' column from whatever time field Polygon gave us
if df_bars.index.name in ("timestamp", "t"):
    # Time is stored in the index (common with Polygon)
    df_bars = df_bars.reset_index()
    time_col = df_bars.columns[0]          # former index column
    df_bars.rename(columns={time_col: "date"}, inplace=True)
elif "timestamp" in cols:
    df_bars["date"] = df_bars["timestamp"]
elif "t" in cols:
    df_bars["date"] = df_bars["t"]
elif "date" in cols:
    # Already have a date-like column; reuse it
    df_bars["date"] = df_bars["date"]
else:
    raise RuntimeError(
        f"Expected a time column or index in bars, but got columns={list(cols)}, index={df_bars.index.name}"
    )

# 3) Convert to proper datetime, then to calendar date
df_bars["date"] = pd.to_datetime(df_bars["date"], unit="ms", errors="coerce")
if df_bars["date"].isna().all():
    # Fallback if it's already datetime and unit="ms" was wrong
    df_bars["date"] = pd.to_datetime(df_bars["date"], errors="coerce")

if df_bars["date"].isna().all():
    raise RuntimeError("Failed to convert time field to datetime; inspect df_bars.head().")

df_bars["date"] = df_bars["date"].dt.date

df_bars.head()


Backfill bars columns: ['open', 'high', 'low', 'close', 'volume', 'ticker']
Backfill bars index name: timestamp


Unnamed: 0,date,open,high,low,close,volume,ticker
0,2025-07-28,637.48,638.04,635.54,636.94,54917102.0,SPY
1,2025-07-29,638.35,638.67,634.335,635.26,60556278.0,SPY
2,2025-07-30,635.92,637.68,631.54,634.46,80418851.0,SPY
3,2025-07-31,639.455,639.85,630.765,632.08,103385246.0,SPY
4,2025-08-01,626.3,626.34,619.29,621.72,140103572.0,SPY


## 5. Compute 20-day and 60-day realized volatility per ticker

**What this cell does**

- For each ticker, sorts the bars by date.
- Uses `compute_realized_vol()` to estimate:
  - 20-day realized volatility (RV20)
  - 60-day realized volatility (RV60)
- Stores one row per ticker in a `df_rv` snapshot table.

**Why this matters**

- Volatility is the heart of this project:
  - It feeds our **Edge Score**.
  - It defines **volatility regimes**.
  - It shapes the **RL state and reward**.
- Having a per-ticker RV snapshot is useful for:
  - Feature sanity checks
  - Comparisons across names
  - UI metrics (e.g., “Avg RV20 across universe”).

In [5]:
rv_rows = []

for symbol in tickers:
    sub = df_bars[df_bars["ticker"] == symbol].sort_values("date")

    if sub.empty:
        print(f"⚠ No data for {symbol}, skipping RV calc.")
        continue

    try:
        rv20 = compute_realized_vol(sub, window=20)
        rv60 = compute_realized_vol(sub, window=60)
    except Exception as e:
        print(f"❌ RV calc failed for {symbol}: {e}")
        rv20, rv60 = float("nan"), float("nan")

    rv_rows.append(
        {
            "ticker": symbol,
            "rv20": rv20,
            "rv60": rv60,
            "as_of": max(sub["date"]),
        }
    )

df_rv = pd.DataFrame(rv_rows)
df_rv

Unnamed: 0,ticker,rv20,rv60,as_of
0,SPY,15.045246,12.448073,2025-12-02
1,QQQ,21.56085,17.365192,2025-12-02
2,TSLA,50.368334,50.858019,2025-12-02
3,NVDA,41.892757,37.772929,2025-12-02
4,AMD,68.514826,73.059756,2025-12-02


## 6. Write OHLC and volatility tables to DuckDB

**What this cell does**

- Creates `ohlc_bars` table (if it doesn't exist) matching `df_bars` schema.
- Clears any old rows from `ohlc_bars` and inserts the new data.
- Creates `daily_rv` table (if it doesn't exist) matching `df_rv`.
- Clears and refills `daily_rv`.

**Why this matters**

- DuckDB now holds our **canonical raw tables**:
  - `ohlc_bars` = price and volume history
  - `daily_rv` = per-ticker vol snapshot
- All other notebooks (01–06) can reliably read from the same DB.
- This looks like a real quant data warehouse pattern, not ad-hoc CSV dumping.

In [6]:
# Create / replace ohlc_bars
con.execute("""
CREATE TABLE IF NOT EXISTS ohlc_bars AS
SELECT * FROM df_bars LIMIT 0;
""")

con.execute("DELETE FROM ohlc_bars;")
con.execute("INSERT INTO ohlc_bars SELECT * FROM df_bars;")

# Create / replace daily_rv
con.execute("""
CREATE TABLE IF NOT EXISTS daily_rv AS
SELECT * FROM df_rv LIMIT 0;
""")

con.execute("DELETE FROM daily_rv;")
con.execute("INSERT INTO daily_rv SELECT * FROM df_rv;")

con.execute("SELECT COUNT(*) AS n_rows FROM ohlc_bars;").fetchdf()

Unnamed: 0,n_rows
0,450


## 7. Quick sanity checks

**What this cell does**

- Checks the date range stored in `ohlc_bars`.
- Counts the number of rows per ticker.

**Why this matters**

- Confirms we **actually** have ~60–90 days of data.
- Confirms that all tickers are present and non-empty.
- If something looks off here, we know to fix the data before touching RL.

In [7]:
con.execute("""
SELECT 
    MIN(date) AS min_date,
    MAX(date) AS max_date,
    COUNT(*) AS n_rows
FROM ohlc_bars;
""").fetchdf()

Unnamed: 0,min_date,max_date,n_rows
0,2025-07-28,2025-12-02,450


In [8]:
con.execute("""
SELECT 
    ticker,
    COUNT(*) AS n_rows
FROM ohlc_bars
GROUP BY ticker
ORDER BY ticker;
""").fetchdf()

Unnamed: 0,ticker,n_rows
0,AMD,90
1,NVDA,90
2,QQQ,90
3,SPY,90
4,TSLA,90


## 8. Wrap-up: what this notebook proves

**What we did**

- Pulled ~60–90 days of OHLC data from Polygon for a small ticker universe.
- Computed simple realized volatility summaries (RV20, RV60).
- Stored everything in DuckDB tables:
  - `ohlc_bars`
  - `daily_rv`

**Why it matters for VAE and RL**

- All downstream notebooks (01–06) and the RL engine now run on **real history**.
- This enables:
  - More honest EDA and signal analysis
  - Realistic regime detection
  - RL training on multi-day episodes instead of toy examples

**Why it matters**

- Shows you can:
  - Ingest external market data APIs
  - Design a small but solid data model
  - Use DuckDB for analytics
  - Build a reproducible research pipeline

Next step:  
Run notebooks **01 → 06** so the entire VAE pipeline uses this history.
