Notebook 03 — Build Returns + Targets (Production Version)

Purpose:
This notebook takes the engineered features from Notebook 2 and builds the next-day returns + training labels needed for all future ML/RL work.

It produces two key tables:

1. screener_returns — daily price + next-day return

2. screener_returns_with_target — full feature row + target return

Everything downstream (Notebook 4–6) depends on these tables.
This version is backfill-safe and works with 2 days or 200 days.em.
It answers: **Is there any measurable directional edge in the feature set?**


In [1]:
from pathlib import Path
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Consistent chart size
plt.rcParams["figure.figsize"] = (10, 6)

# Use same DB as earlier notebooks
DB_PATH = (Path.cwd().parent / "data" / "volatility_alpha.duckdb").as_posix()
print("Using DB:", DB_PATH)

# Close old connection if it exists
try:
    con.close() # type: ignore
except:
    pass

con = duckdb.connect(DB_PATH)

# Quick check
con.sql("SHOW TABLES;").df()

Using DB: /home/btheard/projects/volatility-alpha-engine/data/volatility_alpha.duckdb


Unnamed: 0,name
0,screener_features
1,screener_returns
2,screener_returns_with_target
3,screener_signals
4,screener_snapshots


## 1. Build clean daily price + return table (screener_returns)

We compute the next-day return for each ticker by comparing today’s last price vs tomorrow’s last price.
If tomorrow’s price doesn’t exist, the return is NULL (normal for latest run).

In [2]:
con.sql("""
CREATE OR REPLACE TABLE screener_returns AS
SELECT
    run_date,
    ticker,
    last_price,
    edge_score,
    rv_20d,
    rv_60d,
    vol_regime,
    edge_bucket,
    liquidity_bucket,
    
    -- Tomorrow's last price
    LEAD(last_price) OVER (
        PARTITION BY ticker
        ORDER BY run_date
    ) AS next_last_price,

    -- Compute next-day % return
    ((LEAD(last_price) OVER (
        PARTITION BY ticker
        ORDER BY run_date
    ) - last_price) / last_price) * 100 AS next_day_return_pct

FROM screener_features
ORDER BY run_date, ticker;
""")

# Quick preview
con.sql("SELECT * FROM screener_returns LIMIT 10;").df()

Unnamed: 0,run_date,ticker,last_price,edge_score,rv_20d,rv_60d,vol_regime,edge_bucket,liquidity_bucket,next_last_price,next_day_return_pct
0,2025-11-30,AMD,217.529999,35.113664,68.69167,74.422502,high,hot,normal,215.660004,-0.859649
1,2025-11-30,NVDA,177.0,21.891077,41.973659,38.08171,normal,active,thick,176.332199,-0.377289
2,2025-11-30,QQQ,619.25,11.156231,21.501747,17.302496,low,active,thick,613.349976,-0.952769
3,2025-11-30,SPY,683.390015,7.770965,14.996082,12.424457,low,quiet,thick,679.200012,-0.61312
4,2025-11-30,TSLA,430.170013,27.10753,53.373477,51.377442,high,hot,thick,427.679993,-0.578846
5,2025-12-01,AMD,215.660004,34.77566,68.69167,74.422502,high,hot,thin,,
6,2025-12-01,NVDA,176.332199,21.175474,41.973659,38.08171,normal,active,normal,,
7,2025-12-01,QQQ,613.349976,11.227258,21.501747,17.302496,low,active,thin,,
8,2025-12-01,SPY,679.200012,7.804601,14.996082,12.424457,low,quiet,thin,,
9,2025-12-01,TSLA,427.679993,26.976161,53.373477,51.377442,high,hot,normal,,


## 2. Join features + future return → Final training table

This step merges all engineered features with next-day returns, producing the full row each model/RL agent will train on.

In [3]:
con.sql("""
CREATE OR REPLACE TABLE screener_returns_with_target AS
SELECT
    f.*,
    r.next_last_price,
    r.next_day_return_pct
FROM screener_features f
LEFT JOIN screener_returns r
    ON f.run_date = r.run_date AND f.ticker = r.ticker
ORDER BY f.run_date, f.ticker;
""")

# Verify
con.sql("SELECT * FROM screener_returns_with_target LIMIT 10;").df()

Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket,next_last_price,next_day_return_pct
0,2025-11-30,AMD,217.529999,1.535658,18658000.0,68.69167,74.422502,35.113664,0.022356,-5.730832,,,high,hot,normal,215.660004,-0.859649
1,2025-11-30,NVDA,177.0,-1.808496,121332800.0,41.973659,38.08171,21.891077,-0.043086,3.891949,,,normal,active,thick,176.332199,-0.377289
2,2025-11-30,QQQ,619.25,0.810715,23034400.0,21.501747,17.302496,11.156231,0.037705,4.199251,,,low,active,thick,613.349976,-0.952769
3,2025-11-30,SPY,683.390015,0.545848,49212000.0,14.996082,12.424457,7.770965,0.036399,2.571625,,,low,quiet,thick,679.200012,-0.61312
4,2025-11-30,TSLA,430.170013,0.841584,36252900.0,53.373477,51.377442,27.10753,0.015768,1.996035,,,high,hot,thick,427.679993,-0.578846
5,2025-12-01,AMD,215.660004,-0.859649,3312950.0,68.69167,74.422502,34.77566,-0.012515,-5.730832,0.338004,1.693738,high,hot,thin,,
6,2025-12-01,NVDA,176.332199,-0.377289,22401443.0,41.973659,38.08171,21.175474,-0.008989,3.891949,-1.092892,1.012016,normal,active,normal,,
7,2025-12-01,QQQ,613.349976,-0.952769,5581522.0,21.501747,17.302496,11.227258,-0.044311,4.199251,-0.071027,1.246972,low,active,thin,,
8,2025-12-01,SPY,679.200012,-0.61312,4587375.0,14.996082,12.424457,7.804601,-0.040885,2.571625,-0.033636,0.819515,low,quiet,thin,,
9,2025-12-01,TSLA,427.679993,-0.578846,6627841.0,53.373477,51.377442,26.976161,-0.010845,1.996035,0.131369,1.004395,high,hot,normal,,


## 3. Basic Sanity Checks

We’re checking:

- no missing run_date or ticker

- returns exist for all but last date

- features match the expected ranges

In [4]:
df = con.sql("SELECT * FROM screener_returns_with_target").df()

print("Rows:", len(df))
df.describe(include="all")


Rows: 10


Unnamed: 0,run_date,ticker,last_price,day_pct,volume,rv_20d,rv_60d,edge_score,move_vs_rv20,rv_trend,day_pct_ma_5,day_pct_vol_5,vol_regime,edge_bucket,liquidity_bucket,next_last_price,next_day_return_pct
count,10,10,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,5.0,5.0,10,10,10,5.0,5.0
unique,,5,,,,,,,,,,,3,3,3,,
top,,AMD,,,,,,,,,,,high,active,thick,,
freq,,2,,,,,,,,,,,4,4,4,,
mean,2025-11-30 12:00:00,,423.956221,-0.145636,29100120.0,40.107327,38.721721,20.499862,-0.00484,1.385606,-0.145636,1.155327,,,,422.444437,-0.676335
min,2025-11-30 00:00:00,,176.332199,-1.808496,3312950.0,14.996082,12.424457,7.770965,-0.044311,-5.730832,-1.092892,0.819515,,,,176.332199,-0.952769
25%,2025-11-30 00:00:00,,216.127502,-0.798017,5843102.0,21.501747,17.302496,11.173988,-0.033793,1.996035,-0.071027,1.004395,,,,215.660004,-0.859649
50%,2025-11-30 12:00:00,,428.925003,-0.478067,20529720.0,41.973659,38.08171,21.533276,-0.009917,2.571625,-0.033636,1.012016,,,,427.679993,-0.61312
75%,2025-12-01 00:00:00,,617.774994,0.744498,32948280.0,53.373477,51.377442,27.074688,0.020709,3.891949,0.131369,1.246972,,,,613.349976,-0.578846
max,2025-12-01 00:00:00,,683.390015,1.535658,121332800.0,68.69167,74.422502,35.113664,0.037705,4.199251,0.338004,1.693738,,,,679.200012,-0.377289


## 4. Quick Visual: Target Return Distribution

See how noisy next-day returns are. RL expects this.

In [5]:
summary = con.sql("""
    SELECT
        signal_long,
        COUNT(*) AS n,
        AVG(next_day_return_pct) AS avg_next_day_return,
        STDDEV(next_day_return_pct) AS std_next_day_return
    FROM screener_signals
    WHERE next_day_return_pct IS NOT NULL
    GROUP BY signal_long
    ORDER BY signal_long DESC;
""").df()

summary

Unnamed: 0,signal_long,n,avg_next_day_return,std_next_day_return
0,1,2,-0.719248,0.198558
1,0,3,-0.647726,0.289297


## What this table unlocks

Plain English:
This table is your final training dataset.
You now have:

- engineered features

- rolling volatility features

- volatility regime

- edge buckets

- liquidity buckets

- tomorrow’s return

This feeds directly into:

- Notebook 4 → RL Environment Setup

- Notebook 5 → Baseline Policies

- Notebook 6 → RL Training

In [6]:
con.close()
