# Notebook 01 — Data & Feature Engineering

This notebook validates that the processed dataset is **ready for RL training**.

We do *not* train a model here. We verify:
- Files load correctly and paths are reproducible
- Universe symbols and panel symbols line up
- Time-series ordering looks sane (no obvious gaps / reversals)
- The TradingEnv-required columns exist and contain reasonable values
- Daily return distributions don’t look corrupted (NaNs/Infs/outliers)

## Imports & reproducible project paths

Purpose:
- Import the core libraries used throughout the notebook.
- Define `ROOT` as the repo root so the notebook runs correctly whether launched from `/notebooks` or the project root.
- Build absolute paths to the processed artifacts (`panel.parquet`, `universe_top200.csv`) and confirm they exist.

Why it matters:
RL pipelines fail in annoying ways when paths are brittle. This cell makes the notebook reproducible and “fail-fast” if the processed data isn’t present.

In [5]:
from pathlib import Path
import os
import pandas as pd
import numpy as np

ROOT = Path.cwd().resolve()
if ROOT.name == "notebooks":
    ROOT = ROOT.parent

DATA_DIR = ROOT / "data" / "processed"

panel_path = DATA_DIR / "panel.parquet"
universe_path = DATA_DIR / "universe_top200.csv"

os.chdir(ROOT)

print("ROOT:", ROOT)
print("panel exists:", panel_path.exists())
print("universe exists:", universe_path.exists())

ROOT: /home/btheard/projects/earningsedge-rl
panel exists: True
universe exists: True


## Load processed panel + universe

Purpose:
- Load the processed panel dataset (prices + engineered earnings features).
- Load the universe (the ticker pool used for training/evaluation sampling).

Why it matters:
If these don’t load cleanly, nothing downstream (environment, training, evaluation) is trustworthy. We also inspect the first few rows to confirm schema and basic sanity.

In [6]:
panel = pd.read_parquet(panel_path)
universe = pd.read_csv(universe_path)

panel.head(), universe.head()

(  symbol       date   open   high    low  close  adj_close    volume  \
 0      A 1999-11-18  45.50  50.00  40.00  44.00    29.6303  44739900   
 1      A 1999-11-19  42.94  43.00  39.81  40.38    27.1926  10897100   
 2      A 1999-11-22  41.31  44.00  40.06  44.00    29.6303   4705200   
 3      A 1999-11-23  42.50  43.63  40.25  40.25    27.1050   4274400   
 4      A 1999-11-24  40.13  41.94  40.00  41.06    27.6505   3464400   
 
    split_coefficient next_earnings_date prev_earnings_date  days_to_earnings  \
 0                1.0         2009-05-14                NaT              3465   
 1                1.0         2009-05-14                NaT              3464   
 2                1.0         2009-05-14                NaT              3461   
 3                1.0         2009-05-14                NaT              3460   
 4                1.0         2009-05-14                NaT              3459   
 
    days_since_earnings  is_earnings_window  
 0                99999   

## Dataset overview (size + schema)

Purpose:
- Confirm dataset size (rows/columns) is in the expected range.
- Inspect column names to verify key engineered features exist (especially those used by `TradingEnv`).

Why it matters:
Schema drift (renames, missing columns, wrong dtypes) is a top cause of silent RL bugs. This is the quick checkpoint before deeper validation.

In [7]:
print("Panel shape:", panel.shape)
print("Universe symbols:", universe.shape[0])

panel.columns

Panel shape: (24335095, 14)
Universe symbols: 200


Index(['symbol', 'date', 'open', 'high', 'low', 'close', 'adj_close', 'volume',
       'split_coefficient', 'next_earnings_date', 'prev_earnings_date',
       'days_to_earnings', 'days_since_earnings', 'is_earnings_window'],
      dtype='object')

## Symbol coverage (universe vs panel)

Purpose:
- Compare the set of tickers in the universe file vs the tickers that actually exist in the panel.
- Compute the intersection of the two sets.

Why it matters:
`TradingEnv` samples tickers from the universe. If the universe includes symbols missing from the panel, training/evaluation can crash or silently skew results. This ensures sampling is valid.

In [8]:
panel_symbols = set(panel["symbol"].unique())
universe_symbols = set(universe["symbol"].astype(str))

common = panel_symbols & universe_symbols

print("Symbols in panel:", len(panel_symbols))
print("Symbols in universe:", len(universe_symbols))
print("Symbols in both:", len(common))

Symbols in panel: 7485
Symbols in universe: 200
Symbols in both: 200


## Time-series integrity spot check

Purpose:
- Pull a single symbol’s first few rows for a quick sanity check:
  - dates look ordered
  - price/volume columns are populated
  - earnings-related columns exist

Why it matters:
This is not a full audit, but it catches obvious data issues fast (e.g., missing dates, duplicate rows, weird ordering).

In [14]:
sample_sym = panel["symbol"].iloc[0]
sample_df = panel.loc[panel["symbol"] == sample_sym].sort_values("date")
sample_df[["symbol", "date", "adj_close", "volume", "days_to_earnings", "is_earnings_window"]].head(10)

Unnamed: 0,symbol,date,adj_close,volume,days_to_earnings,is_earnings_window
0,A,1999-11-18,29.6303,44739900,3465,False
1,A,1999-11-19,27.1926,10897100,3464,False
2,A,1999-11-22,29.6303,4705200,3461,False
3,A,1999-11-23,27.105,4274400,3460,False
4,A,1999-11-24,27.6505,3464400,3459,False
5,A,1999-11-26,27.738,1237100,3457,False
6,A,1999-11-29,28.371,2914700,3454,False
7,A,1999-11-30,28.4114,3083000,3453,False
8,A,1999-12-01,28.9165,2115400,3452,False
9,A,1999-12-02,29.7179,2195900,3451,False


## Required feature availability (environment readiness)

Purpose:
- Define the minimum set of columns required by `TradingEnv`.
- Report any missing columns.

Why it matters:
The environment assumes these fields exist. Missing columns can break training or (worse) create incorrect observations/rewards without throwing obvious errors.

In [10]:
required_cols = [
    "adj_close",
    "volume",
    "days_to_earnings",
    "days_since_earnings",
    "is_earnings_window",
]

missing = [c for c in required_cols if c not in panel.columns]
missing

[]

## Earnings feature sanity checks (summary statistics)

Purpose:
- Generate descriptive statistics for:
  - `days_to_earnings`
  - `days_since_earnings`
  - `is_earnings_window`

Why it matters:
These features directly drive the earnings-aware behavior in the environment (observations + penalties). This confirms they’re not empty, constant, or wildly invalid.

In [11]:
panel[["days_to_earnings", "days_since_earnings", "is_earnings_window"]].describe()

Unnamed: 0,days_to_earnings,days_since_earnings
count,24335100.0,24335100.0
mean,25095.12,54833.33
std,42576.75,49731.09
min,0.0,0.0
25%,53.0,54.0
50%,764.0,99999.0
75%,5126.0,99999.0
max,99999.0,99999.0


## Return distribution sanity check (numerical stability)

Purpose:
- Sort the panel by (symbol, date) to ensure correct time ordering.
- Compute daily returns (`pct_change`) per symbol.
- Count NaNs/Infs and summarize the cleaned return distribution.

Why it matters:
RL training is extremely sensitive to NaNs/Infs. This cell catches corrupted price series (e.g., zeros, missing values) that can silently poison rewards and destabilize learning.

In [13]:
panel = panel.sort_values(["symbol", "date"]).copy()

# Use adj_close (env canonical)
px = panel["adj_close"].astype(float)

# Daily returns per symbol
panel["ret_1"] = panel.groupby("symbol")["adj_close"].pct_change()

# Count bad values
n_nan = panel["ret_1"].isna().sum()
n_inf = np.isinf(panel["ret_1"]).sum()

print("ret_1 NaNs:", n_nan)
print("ret_1 Infs:", n_inf)

# Clean view (ignore first row per symbol + any bad points)
clean = panel["ret_1"].replace([np.inf, -np.inf], np.nan).dropna()

print(clean.describe(percentiles=[0.01, 0.05, 0.5, 0.95, 0.99]))

ret_1 NaNs: 8680
ret_1 Infs: 72
count    2.432634e+07
mean     6.065009e+00
std      1.826368e+04
min     -4.996910e+03
1%      -1.000000e-01
5%      -4.568902e-02
50%      0.000000e+00
95%      4.729730e-02
99%      1.148309e-01
max      7.314262e+07
Name: ret_1, dtype: float64


## Summary

What we confirmed:
- The panel + universe load correctly using repo-relative paths
- The dataset contains the expected columns and symbol coverage
- Time-series ordering looks sane for a sample symbol
- Earnings timing fields exist and are populated
- Returns look numerically reasonable after filtering NaNs/Infs

Next:
- Notebook 02 will validate **environment behavior** (reset/step), reward mechanics, and baseline policies.