# Notebook 04 — RL Environment Skeleton

This notebook turns the engineered feature table into a small reinforcement learning (RL) playground.

We will:

- Load the `screener_features` table from DuckDB.
- Build a clean RL dataset with:
  - **State** = volatility / edge features for a single ticker-day.
  - **Action** = 0 (stay flat) or 1 (take a long position for 1 day).
  - **Reward** = next-day return for that ticker.
- Implement a simple `VAETradingEnv` class with `reset()` and `step()`.
- Smoke-test the environment with a random policy.

---

### What this notebook proves

1. The Volatility Alpha Engine can expose its signals as an RL-ready dataset.
2. We can simulate trades day-by-day using engineered features only (no live API calls).
3. The environment is modular, so later we can plug in:
   - Rule-based policies,
   - Supervised models,
   - Full RL agents (e.g. DQN, PPO).

In [1]:
from pathlib import Path
import duckdb
import numpy as np
import pandas as pd

# Use the exact same DB as Notebooks 1–3
DB_PATH = (Path.cwd().parent / "data" / "volatility_alpha.duckdb").as_posix()
print("Using DB:", DB_PATH)

# Close old connections if necessary
try:
    con.close() # type: ignore
except:
    pass

con = duckdb.connect(DB_PATH)

# Sanity check: list tables
con.sql("SHOW TABLES").df()

Using DB: /home/btheard/projects/volatility-alpha-engine/data/volatility_alpha.duckdb


Unnamed: 0,name
0,screener_features
1,screener_returns
2,screener_returns_with_target
3,screener_signals
4,screener_snapshots


## 1. Inspect engineered feature table

First we confirm which columns are available in `screener_features`. These become inputs to our RL state and reward.

In [2]:
# Look at schema
con.sql("PRAGMA table_info('screener_features')").df()

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,run_date,TIMESTAMP,False,,False
1,1,ticker,VARCHAR,False,,False
2,2,last_price,DOUBLE,False,,False
3,3,day_pct,DOUBLE,False,,False
4,4,volume,DOUBLE,False,,False
5,5,rv_20d,DOUBLE,False,,False
6,6,rv_60d,DOUBLE,False,,False
7,7,edge_score,DOUBLE,False,,False
8,8,move_vs_rv20,DOUBLE,False,,False
9,9,rv_trend,DOUBLE,False,,False


## 2. Build the RL dataset

We load the engineered features, sort by `ticker` and `run_date`, and create:

- `next_day_pct` = next day's percentage move for each ticker.
- A set of numeric and categorical features that will form the RL **state**.


In [3]:
# Load all engineered features
df = con.sql("""
    SELECT *
    FROM screener_features
    ORDER BY ticker, run_date
""").df()

# Make sure run_date is proper datetime
df["run_date"] = pd.to_datetime(df["run_date"])

print("Rows:", len(df))
df.head()

Rows: 10


Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket
0,2025-11-30,AMD,217.529999,1.535658,18658000.0,68.69167,74.422502,35.113664,0.022356,-5.730832,,,high,hot,normal
1,2025-12-01,AMD,215.660004,-0.859649,3312950.0,68.69167,74.422502,34.77566,-0.012515,-5.730832,0.338004,1.693738,high,hot,thin
2,2025-11-30,NVDA,177.0,-1.808496,121332800.0,41.973659,38.08171,21.891077,-0.043086,3.891949,,,normal,active,thick
3,2025-12-01,NVDA,176.332199,-0.377289,22401443.0,41.973659,38.08171,21.175474,-0.008989,3.891949,-1.092892,1.012016,normal,active,normal
4,2025-11-30,QQQ,619.25,0.810715,23034400.0,21.501747,17.302496,11.156231,0.037705,4.199251,,,low,active,thick


In [4]:
# Create next-day return per ticker as reward target
# Assumes there's a 'day_pct' column in screener_features
df["next_day_pct"] = (
    df.groupby("ticker")["day_pct"].shift(-1)
)

# Drop rows where we don't have a next day yet
rl_df = df.dropna(subset=["next_day_pct"]).reset_index(drop=True)

print("RL rows after dropping last days:", len(rl_df))
rl_df.head()

RL rows after dropping last days: 5


Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket,next_day_pct
0,2025-11-30,AMD,217.529999,1.535658,18658000.0,68.69167,74.422502,35.113664,0.022356,-5.730832,,,high,hot,normal,-0.859649
1,2025-11-30,NVDA,177.0,-1.808496,121332800.0,41.973659,38.08171,21.891077,-0.043086,3.891949,,,normal,active,thick,-0.377289
2,2025-11-30,QQQ,619.25,0.810715,23034400.0,21.501747,17.302496,11.156231,0.037705,4.199251,,,low,active,thick,-0.952769
3,2025-11-30,SPY,683.390015,0.545848,49212000.0,14.996082,12.424457,7.770965,0.036399,2.571625,,,low,quiet,thick,-0.61312
4,2025-11-30,TSLA,430.170013,0.841584,36252900.0,53.373477,51.377442,27.10753,0.015768,1.996035,,,high,hot,thick,-0.578846


## 3. Define RL state features

For each ticker-day, we use:

- **Numeric**:
  - `day_pct_ma_5` – short-term trend in returns.
  - `day_pct_vol_5` – short-term choppiness.
  - `move_vs_rv20` – how big today's move is vs recent vol.

- **Categorical (one-hot encoded)**:
  - `vol_regime` – low / normal / high volatility regime.
  - `edge_bucket` – quiet / active / hot edge zones.

These together form the RL **state vector**.

In [5]:
# Adjust these names if your columns differ
numeric_features = [
    "day_pct_ma_5",
    "day_pct_vol_5",
    "move_vs_rv20",
]

categorical_features = [
    "vol_regime",
    "edge_bucket",
]

reward_col = "next_day_pct"

# One-hot encode categoricals
state_df = rl_df.copy()
state_df = pd.get_dummies(
    state_df,
    columns=categorical_features,
    prefix=categorical_features
)

# Build final state column list
state_cols = numeric_features + [
    c for c in state_df.columns
    if c.startswith("vol_regime_") or c.startswith("edge_bucket_")
]

print("State dimension:", len(state_cols))
state_df[state_cols + [reward_col]].head()


State dimension: 9


Unnamed: 0,day_pct_ma_5,day_pct_vol_5,move_vs_rv20,vol_regime_high,vol_regime_low,vol_regime_normal,edge_bucket_quiet,edge_bucket_active,edge_bucket_hot,next_day_pct
0,,,0.022356,True,False,False,False,False,True,-0.859649
1,,,-0.043086,False,False,True,False,True,False,-0.377289
2,,,0.037705,False,True,False,False,True,False,-0.952769
3,,,0.036399,False,True,False,True,False,False,-0.61312
4,,,0.015768,True,False,False,False,False,True,-0.578846


## 4. RL framing: state, action, reward

- **State** `s_t`:
  - The row `state_df.loc[t, state_cols]` for a given `ticker`, `run_date`.
  - Encodes recent trend, volatility, and edge regime.

- **Action** `a_t` (discrete):
  - `0` = stay flat (no position).
  - `1` = take a 1-day long position in this ticker.

- **Reward** `r_t`:
  - `r_t = a_t * next_day_pct`
  - If we go long and the next day is +1.2%, reward ≈ `+0.012`.
  - If we go long and the next day is -0.8%, reward ≈ `-0.008`.
  - If we stay flat, reward is `0`.

The environment walks forward through the feature table row-by-row and lets an agent decide
when to take a one-day long bet based on volatility and edge signals.

In [6]:
class VAETradingEnv:
    """
    Minimal trading environment for the Volatility Alpha Engine.

    - Observations: state vector built from engineered features.
    - Actions: 0 = flat, 1 = long for one day.
    - Reward: action * next_day_pct.
    """

    def __init__(self, data, state_cols, reward_col="next_day_pct", max_steps=None):
        self.data = data.reset_index(drop=True)
        self.state_cols = state_cols
        self.reward_col = reward_col

        self.n_steps_total = len(self.data)
        self.max_steps = max_steps or (self.n_steps_total - 1)

        # Expose spaces (Gym-style, but kept lightweight)
        self.action_space_n = 2
        self.observation_dim = len(self.state_cols)

        # Internal
        self.idx = 0
        self.step_count = 0

    def _get_state(self):
        row = self.data.loc[self.idx, self.state_cols]
        return row.values.astype("float32")

    def reset(self):
        """Start a new episode from the beginning."""
        self.idx = 0
        self.step_count = 0
        return self._get_state()

    def step(self, action: int):
        """
        Advance one step.

        Parameters
        ----------
        action : int
            0 = flat, 1 = long.

        Returns
        -------
        next_state, reward, done, info
        """
        # Clip invalid actions
        action = int(action)
        if action not in (0, 1):
            raise ValueError(f"Invalid action {action}, expected 0 or 1.")

        # Reward based on next day's return
        reward = float(action * self.data.loc[self.idx, self.reward_col])

        # Move forward
        self.idx += 1
        self.step_count += 1

        done = False
        if self.idx >= self.n_steps_total - 1:
            done = True
        if self.step_count >= self.max_steps:
            done = True

        next_state = self._get_state() if not done else None
        info = {
            "run_date": self.data.loc[self.idx - 1, "run_date"],
            "ticker": self.data.loc[self.idx - 1, "ticker"],
        }

        return next_state, reward, done, info

## 5. Smoke test: random policy

To confirm the environment works end-to-end, we run one short episode
with a random policy (coin flip between flat and long) and inspect:

- Number of steps
- Total reward
- A few sample transitions

In [7]:
env = VAETradingEnv(state_df, state_cols, reward_col=reward_col, max_steps=50)

state = env.reset()
rewards = []
actions = []
infos = []

for t in range(50):
    action = np.random.randint(0, 2)  # 0 or 1
    next_state, reward, done, info = env.step(action)

    actions.append(action)
    rewards.append(reward)
    infos.append(info)

    if done:
        break
    state = next_state

print(f"Steps taken: {len(rewards)}")
print(f"Total reward from random policy: {np.nansum(rewards):.4f}")
pd.DataFrame({
    "action": actions,
    "reward": rewards,
    "ticker": [i["ticker"] for i in infos],
    "run_date": [i["run_date"] for i in infos],
}).head()

Steps taken: 4
Total reward from random policy: -0.3773


Unnamed: 0,action,reward,ticker,run_date
0,0,-0.0,AMD,2025-11-30
1,1,-0.377289,NVDA,2025-11-30
2,0,-0.0,QQQ,2025-11-30
3,0,-0.0,SPY,2025-11-30


In [8]:
con.close()

## Skills Shown

**What this notebook demonstrates**

- I can turn a raw DuckDB feature table into a **structured RL dataset**.
- I define a clear **state / action / reward** mapping for a trading system.
- I implement a reusable RL environment (`VAETradingEnv`) with `reset()` and `step()` semantics.
- I validate the environment with a simple **random policy** and inspect the resulting rewards.

**How this would scale in production**

- The same environment can sit behind:
  - Rule-based policies (for baseline systems),
  - Supervised models (signal scoring),
  - Deep RL agents (DQN, PPO, etc.).
- The environment reads from DuckDB, but in production we could swap in:
  - BigQuery views,
  - Kafka / PubSub streams,
  - or live broker APIs, with no change to the agent code.
