# Notebook 04 ‚Äî RL Environment Skeleton

This notebook turns the engineered feature table into a tiny reinforcement learning (RL) playground.

We will:

- Load the `screener_features` table from DuckDB.
- Build a clean RL dataset where:
  - **State** = volatility + edge features for a single ticker-day.
  - **Action** = 0 (stay flat) or 1 (take a one-day long position).
  - **Reward** = next-day return for that ticker.
- Implement a simple `VAETradingEnv` class with `reset()` and `step()` methods.
- Smoke-test the environment with a random policy.

---

### What this notebook proves

1. The Volatility Alpha Engine can expose its signals as an RL-ready dataset.
2. We can simulate trades day-by-day using engineered features only (no live API calls).
3. The environment is modular, so later we can plug in:
   - Rule-based policies,
   - Supervised models,
   - Full RL agents (e.g., DQN, PPO).

## 0. Imports and Setup

In [1]:
from pathlib import Path
import duckdb
import numpy as np
import pandas as pd

# Use the exact same DB as Notebooks 1‚Äì3
DB_PATH = (Path.cwd().parent / "data" / "volatility_alpha.duckdb").as_posix()
print("Using DB:", DB_PATH)

# Close old connections if necessary
try:
    con.close() # type: ignore
except:
    pass

con = duckdb.connect(DB_PATH)

# Sanity check: list tables
con.sql("SHOW TABLES").df()

Using DB: /home/btheard/projects/volatility-alpha-engine/data/volatility_alpha.duckdb


Unnamed: 0,name
0,daily_rv
1,ohlc_bars
2,screener_features
3,screener_returns
4,screener_returns_with_target
5,screener_signals
6,screener_snapshots


## 1. Inspect engineered feature table

**What this block does**

- Reads the schema for `screener_features`.
- Confirms which columns are available as inputs to the RL state and reward.
- Sanity-checks that types (DOUBLE, VARCHAR, TIMESTAMP) look reasonable.

**Why this matters**

Before we build an RL dataset, we need to know exactly what the environment ‚Äúsees‚Äù:

- Numeric features (returns, volatility, rolling stats).
- Categorical features (regimes, buckets).
- The time index (`run_date`) and identifier (`ticker`).

This prevents silent bugs later where the agent is missing key signals or using the wrong target column.

In [2]:
# Look at schema
con.sql("PRAGMA table_info('screener_features')").df()

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,run_date,TIMESTAMP,False,,False
1,1,ticker,VARCHAR,False,,False
2,2,last_price,DOUBLE,False,,False
3,3,day_pct,DOUBLE,False,,False
4,4,volume,DOUBLE,False,,False
5,5,rv_20d,DOUBLE,False,,False
6,6,rv_60d,DOUBLE,False,,False
7,7,edge_score,DOUBLE,False,,False
8,8,move_vs_rv20,DOUBLE,False,,False
9,9,rv_trend,DOUBLE,False,,False


## 2. Build the RL dataset

**What this block does**

- Loads all engineered features from `screener_features`.
- Sorts the data by `ticker` and `run_date`.
- Creates `next_day_pct` as the label:
  - For each ticker, we shift the daily return forward by one day.
  - Drops rows where we don‚Äôt yet know tomorrow‚Äôs return (the last date per ticker).

**Columns used**

- Features from Notebook 02 (volatility, trend, edge, liquidity).
- Target: `next_day_pct` = next-day percentage move.

**Why this matters**

This is the core RL dataset:

- Each row is a **decision point** (ticker, run_date).
- Features describe the current regime.
- `next_day_pct` tells us the payoff for taking a one-day long position.

Every later notebook (baseline policies, Q-learning) assumes this table is correct, so we validate it here.

In [3]:
# Load all engineered features
df = con.sql("""
    SELECT *
    FROM screener_features
    ORDER BY ticker, run_date
""").df()

# Make sure run_date is proper datetime
df["run_date"] = pd.to_datetime(df["run_date"])

print("Rows:", len(df))
df.head()

Rows: 1200


Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket
0,2025-06-12,AAPL,199.2,0.211289,43904635.0,20.597902,51.247261,1.025779,0.010258,-30.649358,,,normal,quiet,normal
1,2025-06-13,AAPL,196.45,-1.380522,51447349.0,20.945016,51.249474,6.591172,-0.065912,-30.304458,-0.584617,1.12558,normal,quiet,normal
2,2025-06-16,AAPL,198.42,1.0028,43020691.0,21.483382,51.291068,4.667793,0.046678,-29.807686,-0.055478,1.213849,normal,quiet,normal
3,2025-06-17,AAPL,195.64,-1.401068,38856152.0,21.620035,51.185641,6.480417,-0.064804,-29.565606,-0.391875,1.19789,normal,quiet,normal
4,2025-06-18,AAPL,196.58,0.480474,45394689.0,21.672431,51.134823,2.216984,0.02217,-29.462392,-0.217406,1.108334,normal,quiet,normal


In [4]:
# Create next-day return per ticker as reward target
# Assumes there's a 'day_pct' column in screener_features
df["next_day_pct"] = (
    df.groupby("ticker")["day_pct"].shift(-1)
)

# Drop rows where we don't have a next day yet
rl_df = df.dropna(subset=["next_day_pct"]).reset_index(drop=True)

print("RL rows after dropping last days:", len(rl_df))
rl_df.head()

RL rows after dropping last days: 1190


Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket,next_day_pct
0,2025-06-12,AAPL,199.2,0.211289,43904635.0,20.597902,51.247261,1.025779,0.010258,-30.649358,,,normal,quiet,normal,-1.380522
1,2025-06-13,AAPL,196.45,-1.380522,51447349.0,20.945016,51.249474,6.591172,-0.065912,-30.304458,-0.584617,1.12558,normal,quiet,normal,1.0028
2,2025-06-16,AAPL,198.42,1.0028,43020691.0,21.483382,51.291068,4.667793,0.046678,-29.807686,-0.055478,1.213849,normal,quiet,normal,-1.401068
3,2025-06-17,AAPL,195.64,-1.401068,38856152.0,21.620035,51.185641,6.480417,-0.064804,-29.565606,-0.391875,1.19789,normal,quiet,normal,0.480474
4,2025-06-18,AAPL,196.58,0.480474,45394689.0,21.672431,51.134823,2.216984,0.02217,-29.462392,-0.217406,1.108334,normal,quiet,normal,2.248448


## 3. Define RL state features

**What this block does**

Builds the actual feature vector the RL agent will observe.

- **Numeric features**
  - `day_pct_mv_5`  ‚Äì 5-day rolling mean of daily returns (short-term trend).
  - `day_pct_vol_5` ‚Äì 5-day rolling volatility of daily returns (choppiness).
  - `move_vs_rv20`  ‚Äì how big today‚Äôs move is vs 20-day realized volatility.

- **Categorical features (one-hot encoded)**
  - `vol_regime_*`  ‚Äì low / normal / high volatility regimes.
  - `edge_bucket_*` ‚Äì quiet / active / hot edge zones.

We then:

- Assemble all state columns into `state_df`.
- Keep `next_day_pct` as the reward column.
- Print the final **state dimension** and a preview row.

**Why this matters**

This block freezes the RL ‚Äúview of the world‚Äù:

- Only these columns will be visible to the agent.
- Everything else in DuckDB is just raw plumbing.
- When we later swap in different agents, they all see the *same* state vector, which makes experiments comparable.

In [5]:
# Adjust these names if your columns differ
numeric_features = [
    "day_pct_ma_5",
    "day_pct_vol_5",
    "move_vs_rv20",
]

categorical_features = [
    "vol_regime",
    "edge_bucket",
]

reward_col = "next_day_pct"

# One-hot encode categoricals
state_df = rl_df.copy()
state_df = pd.get_dummies(
    state_df,
    columns=categorical_features,
    prefix=categorical_features
)

# Build final state column list
state_cols = numeric_features + [
    c for c in state_df.columns
    if c.startswith("vol_regime_") or c.startswith("edge_bucket_")
]

print("State dimension:", len(state_cols))
state_df[state_cols + [reward_col]].head()


State dimension: 9


Unnamed: 0,day_pct_ma_5,day_pct_vol_5,move_vs_rv20,vol_regime_high,vol_regime_low,vol_regime_normal,edge_bucket_quiet,edge_bucket_active,edge_bucket_hot,next_day_pct
0,,,0.010258,False,False,True,True,False,False,-1.380522
1,-0.584617,1.12558,-0.065912,False,False,True,True,False,False,1.0028
2,-0.055478,1.213849,0.046678,False,False,True,True,False,False,-1.401068
3,-0.391875,1.19789,-0.064804,False,False,True,True,False,False,0.480474
4,-0.217406,1.108334,0.02217,False,False,True,True,False,False,2.248448


## 4. RL framing: state, action, reward

We now define the trading game formally:

- **State ùë†‚Çú**
  - One row from `state_df` for a given `ticker`, `run_date`.
  - Encodes trend, volatility, and edge regime on that day.

- **Action ùëé‚Çú (discrete)**
  - `0` ‚Üí stay flat (no position).
  - `1` ‚Üí take a one-day long position.

- **Reward ùëü‚Çú**
  - `r‚Çú = action * next_day_pct`
  - If we stay flat: reward = 0.
  - If we go long and the next day is +1.2%: reward = +0.012.
  - If we go long and the next day is -0.8%: reward = -0.008.

**What the `VAETradingEnv` class does**

- `reset()`  
  - Starts a new episode at the first row.
  - Returns the initial state vector.

- `step(action)`  
  - Takes an action (0 or 1).
  - Computes reward from `next_day_pct`.
  - Advances to the next row (next day).
  - Flags `done=True` when we hit the end of the dataset.
  - Returns `(next_state, reward, done, info)`.

**Why this matters**

This is the production-style contract for the RL agent:

- Any algorithm that understands `reset/step` can be dropped in later.
- The environment is backed by DuckDB data but exposes a clean gym-like API.

In [6]:
class VAETradingEnv:
    """
    Minimal trading environment for the Volatility Alpha Engine.

    - Observations: state vector built from engineered features.
    - Actions: 0 = flat, 1 = long for one day.
    - Reward: action * next_day_pct.
    """

    def __init__(self, data, state_cols, reward_col="next_day_pct", max_steps=None):
        self.data = data.reset_index(drop=True)
        self.state_cols = state_cols
        self.reward_col = reward_col

        self.n_steps_total = len(self.data)
        self.max_steps = max_steps or (self.n_steps_total - 1)

        # Expose spaces (Gym-style, but kept lightweight)
        self.action_space_n = 2
        self.observation_dim = len(self.state_cols)

        # Internal
        self.idx = 0
        self.step_count = 0

    def _get_state(self):
        row = self.data.loc[self.idx, self.state_cols]
        return row.values.astype("float32")

    def reset(self):
        """Start a new episode from the beginning."""
        self.idx = 0
        self.step_count = 0
        return self._get_state()

    def step(self, action: int):
        """
        Advance one step.

        Parameters
        ----------
        action : int
            0 = flat, 1 = long.

        Returns
        -------
        next_state, reward, done, info
        """
        # Clip invalid actions
        action = int(action)
        if action not in (0, 1):
            raise ValueError(f"Invalid action {action}, expected 0 or 1.")

        # Reward based on next day's return
        reward = float(action * self.data.loc[self.idx, self.reward_col])

        # Move forward
        self.idx += 1
        self.step_count += 1

        done = False
        if self.idx >= self.n_steps_total - 1:
            done = True
        if self.step_count >= self.max_steps:
            done = True

        next_state = self._get_state() if not done else None
        info = {
            "run_date": self.data.loc[self.idx - 1, "run_date"],
            "ticker": self.data.loc[self.idx - 1, "ticker"],
        }

        return next_state, reward, done, info

## 5. Smoke test: random policy

**What this block does**

- Instantiates `VAETradingEnv` with our RL dataset.
- Runs one short episode where actions are random:
  - Flip a coin each step ‚Üí 0 (flat) or 1 (long).
- Tracks:
  - Number of steps taken,
  - Total reward from the random policy,
  - A small sample of `(action, reward, ticker, run_date)` rows.

**Why this matters**

This is a sanity test that the environment is wired correctly:

- `reset()` and `step()` run without crashing.
- Rewards move in the right direction when price changes.
- Dates and tickers advance one step at a time.

If a *random* policy can run end-to-end, we‚Äôre safe to plug in smarter strategies next (baseline rules, Q-learning, deep RL).

In [7]:
env = VAETradingEnv(state_df, state_cols, reward_col=reward_col, max_steps=50)

state = env.reset()
rewards = []
actions = []
infos = []

for t in range(50):
    action = np.random.randint(0, 2)  # 0 or 1
    next_state, reward, done, info = env.step(action)

    actions.append(action)
    rewards.append(reward)
    infos.append(info)

    if done:
        break
    state = next_state

print(f"Steps taken: {len(rewards)}")
print(f"Total reward from random policy: {np.nansum(rewards):.4f}")
pd.DataFrame({
    "action": actions,
    "reward": rewards,
    "ticker": [i["ticker"] for i in infos],
    "run_date": [i["run_date"] for i in infos],
}).head()

Steps taken: 50
Total reward from random policy: 2.1979


Unnamed: 0,action,reward,ticker,run_date
0,1,-1.380522,AAPL,2025-06-12
1,0,0.0,AAPL,2025-06-13
2,1,-1.401068,AAPL,2025-06-16
3,1,0.480474,AAPL,2025-06-17
4,1,2.248448,AAPL,2025-06-18


In [8]:
con.close()

## Skills shown

**What this notebook demonstrates**

- Turning a DuckDB feature table into a structured RL dataset.
- Defining a clear state / action / reward mapping for a trading system.
- Implementing a minimal RL environment (`VAETradingEnv`) with `reset()` / `step()` semantics.
- Validating the environment with a simple random policy and inspecting the resulting rewards.

**How this scales in production**

- This same environment class could sit behind:
  - Rule-based strategies (signal thresholds),
  - Supervised models (signal scoring),
  - Deep RL agents (DQN, PPO, etc.).
- The data source could be swapped from DuckDB to:
  - BigQuery views,
  - Kafka / PubSub streams,
  - or live broker APIs ‚Äî

with no changes to the agent code itself. Only the data plumbing would change.