# DATA PIPELINE 
**Goal:** Produce a single clean dataset the env (and later models) can use.

## Deliverables
- `data/prices_returns.csv` with columns:
  - `date` (`YYYY-MM-DD`, UTC-naive)
  - `asset` (ticker, e.g., SPY, TLT, GLD, QQQ)
  - `close` (Adjusted Close)
  - `ret` (log return `ln(close_t/close_{t-1})`)
- Print a summary (assets, date range, rows, per-asset mean/vol of `ret`).

## Scope (MVP)
- Daily data on a **common NYSE calendar** (avoid BTC for now).
- Date range: **2016-01-01 → 2025-10-31** (configurable).
- **Inner-join** dates across assets so every asset shares the exact same dates.
- Strict: **no NaNs/inf** in `ret` (fail fast).

## What to build (step-by-step)
1. **Config cell:** `TICKERS`, `START`, `END`, `OUT_PATH`.
2. **Loader** `load_prices(tickers, start, end) → DataFrame[date, asset, close]`
   - Use your source (e.g., yfinance). Return a **long** table.
3. **Calendar aligner** `align_calendar(df) → DataFrame`
   - Keep only dates present for **all** assets (intersection).
   - Sort by `date, asset`.
4. **Returns** `to_returns(df) → DataFrame(date, asset, close, ret)`
   - Compute **log returns** per asset.
   - Drop the first row per asset (NaN return).
   - Assert no NaN/inf remain.
5. **Writer & summary**
   - Save to `data/prices_returns.csv`.
   - Print summary: assets list, date range, row count, per-asset mean/vol.


## Quick checks
- Plot closing prices for assets.



