# 03 – Target Engineering for Sentinel

This notebook defines **what Sentinel is trying to predict**.

Using the modeling-ready TradyFlow dataset from Notebook 02, we create supervised learning targets that represent short-horizon price movement and volatility regimes. These targets will be used in the next notebook to train and evaluate forecasting models.

## 1. Goal of This Notebook

Machine learning needs a clear target to learn from.

In Sentinel, we want to understand how options sweep activity translates into short-term movement in the underlying. In this notebook we:

- Load the modeling dataset from `data/processed/`
- Define a forecasting horizon
- Create three labels:
  - `next_return_1d` – short-horizon % return
  - `direction_up` – binary up/down label
  - `vol_regime` – high vs. normal volatility regime
- Run sanity checks on the targets
- Save a new training dataset for modeling

## 2. Load Modeling Dataset

We load the `tradyflow_modeling.parquet` file produced in Notebook 02.

This dataset already contains:

- Cleaned numeric fields (volumes, premiums, OI, spreads)
- Engineered features (moneyness, liquidity, log-scaled quantities, DTE)
- Parsed timestamps (`Time_dt`, `Exp_dt`)

Next we sort the data in time order so that “next move” labels are well-defined.

In [1]:
import pandas as pd
from pathlib import Path

# Paths
DATA_PATH = Path("../data/processed/tradyflow_modeling.parquet")

# Load modeling dataset
df = pd.read_parquet(DATA_PATH)

df.head()

Unnamed: 0,Time,Sym,C/P,Exp,Strike,Spot,BidAsk,Orders,Vol,Prems,...,Time_dt,Exp_dt,moneyness,spread_pct,flow_intensity,log_vol,log_prems,dte,is_call,is_put
0,6/17/2022 15:07,ISEE,Call,10/21/2022,10.0,9.54,5.05,7,360.0,183600.0,...,2022-06-17 15:07:00,2022-10-21,-0.048218,0.52935,66096000.0,5.888878,12.12052,125,1,0
1,6/17/2022 15:05,CVNA,Call,1/19/2024,60.0,23.52,4.6,7,634.0,310660.0,...,2022-06-17 15:05:00,2024-01-19,-1.55102,0.195578,196958400.0,6.453625,12.646458,580,1,0
2,6/17/2022 14:51,PTLO,Put,2/17/2023,15.0,15.19,3.5,7,800.0,281000.0,...,2022-06-17 14:51:00,2023-02-17,0.012508,0.230415,224800000.0,6.685861,12.546114,244,0,1
3,6/17/2022 14:39,TWLO,Call,6/24/2022,86.0,84.51,2.95,5,722.0,198800.0,...,2022-06-17 14:39:00,2022-06-24,-0.017631,0.034907,143533600.0,6.583409,12.20006,6,1,0
4,6/17/2022 13:56,ATUS,Put,9/16/2022,7.0,8.62,0.68,5,6270.0,501840.0,...,2022-06-17 13:56:00,2022-09-16,0.187935,0.078886,3146537000.0,8.743691,13.126039,90,0,1


## 3. Define Forecasting Objective

Sentinel’s first objective is to learn how option sweeps relate to **short-horizon price movement** in the underlying.

Because we are still wiring in external OHLC data, we approximate a short-horizon move as the change between the current sweep’s spot price and the **next sweep on the same symbol**:

> If the next sweep on this ticker trades at a higher spot price, the short-horizon move is considered **up**; otherwise it is **down or flat**.

This gives us a realistic, sequence-based target to train on while keeping the pipeline fully self-contained.

In [2]:
# Ensure a well-defined time ordering within each symbol
df = df.sort_values(["Sym", "Time_dt"]).reset_index(drop=True)

df[["Sym", "Time_dt", "Spot"]].head()

Unnamed: 0,Sym,Time_dt,Spot
0,A,2021-08-19 11:12:00,168.16
1,A,2021-09-03 11:41:00,179.49
2,AA,2021-06-22 12:01:00,33.87
3,AA,2021-07-06 12:11:00,36.62
4,AA,2021-07-08 10:10:00,35.01


### 3.1 `next_return_1d` (Short-Horizon Return)

`next_return_1d` measures the percentage change between the current sweep’s spot price and the next sweep’s spot price on the **same symbol**.

This approximates a short-horizon move driven by consecutive sweeps in the order book and gives the model a continuous signal that reflects the immediate reaction of the underlying to flow.

In [3]:
# 1) Next-spot price within each symbol
df["next_spot"] = df.groupby("Sym")["Spot"].shift(-1)

# 2) Short-horizon return: next spot vs current spot
df["next_return_1d"] = (df["next_spot"] - df["Spot"]) / df["Spot"]

# Drop rows where we don't have a "next" observation
df = df.dropna(subset=["next_return_1d"]).reset_index(drop=True)

df[["Sym", "Spot", "next_spot", "next_return_1d"]].head()

Unnamed: 0,Sym,Spot,next_spot,next_return_1d
0,A,168.16,179.49,0.067376
1,AA,33.87,36.62,0.081193
2,AA,36.62,35.01,-0.043965
3,AA,35.01,31.82,-0.091117
4,AA,31.82,31.82,0.0


> The short-horizon returns show a realistic mix of positive and negative values, confirming that consecutive sweep movements include both upward and downward reactions.

### 3.2 `direction_up` (Binary Direction Label)

`direction_up` is a simplified classification target:

- `1` – the next sweep on this symbol occurs at a **higher** spot price  
- `0` – the next sweep occurs at a **lower or equal** spot price  

This label is ideal for baseline classifiers such as Logistic Regression, Random Forest, or Gradient Boosted Trees, and it aligns directly with a trader’s question: *“Did price move in my direction shortly after this flow?”*

In [4]:
# Binary up / down label
df["direction_up"] = (df["next_return_1d"] > 0).astype(int)

df["direction_up"].value_counts(normalize=True)

direction_up
0    0.526253
1    0.473747
Name: proportion, dtype: float64

> The class distribution is nearly balanced (≈52/48), which is excellent for classification models and suggests no major bias in price reaction labels.

### 3.3 `vol_regime` (Volatility Regime Label)

`vol_regime` classifies each observation into:

- `0` – **normal** movement (within the lower 75% of absolute `% Diff` values)  
- `1` – **high-volatility** regime (top 25% of absolute `% Diff` values)  

This target is useful for models that need to behave differently in calm vs. turbulent markets and for understanding how certain types of flow cluster in high-volatility environments.

In [5]:
# Use absolute Diff(%) as a volatility proxy
df["abs_diff_pct"] = df["Diff(%)"].abs()

vol_threshold = df["abs_diff_pct"].quantile(0.75)
df["vol_regime"] = (df["abs_diff_pct"] > vol_threshold).astype(int)

df["vol_regime"].value_counts(normalize=True)

vol_regime
0    0.750149
1    0.249851
Name: proportion, dtype: float64

> Roughly 25% of observations fall into the high-volatility regime, matching the quantile rule and confirming the split behaves as intended.

## 4. Target Sanity Checks

We check the basic statistics and class balance of our targets to ensure they are usable for modeling:

- `next_return_1d` should have a mean near zero with both positive and negative values.
- `direction_up` should not be extremely imbalanced (e.g., not 99% of one class).
- `vol_regime` should roughly match the chosen quantile split (around 25% in the high-vol regime).

These checks confirm that the labels are well-formed and suitable for both regression and classification experiments.

In [6]:
summary = df[["next_return_1d", "direction_up", "vol_regime"]].describe()
class_balance = {
    "direction_up": df["direction_up"].value_counts(normalize=True),
    "vol_regime": df["vol_regime"].value_counts(normalize=True),
}

summary, class_balance

(       next_return_1d  direction_up   vol_regime
 count     6704.000000   6704.000000  6704.000000
 mean        -0.011375      0.473747     0.249851
 std          0.159735      0.499348     0.432959
 min         -0.786242      0.000000     0.000000
 25%         -0.048614      0.000000     0.000000
 50%         -0.001148      0.000000     0.000000
 75%          0.033325      1.000000     0.000000
 max          6.853992      1.000000     1.000000,
 {'direction_up': direction_up
  0    0.526253
  1    0.473747
  Name: proportion, dtype: float64,
  'vol_regime': vol_regime
  0    0.750149
  1    0.249851
  Name: proportion, dtype: float64})

> All targets fall within expected ranges, confirming that there are no corrupt labels and that the dataset is ready for modeling.

## 5. Save Modeling-Ready Dataset

We save the enriched dataset (features + targets) to:

`../data/processed/tradyflow_training.parquet`

This file will be the primary input for the modeling notebook, where we will train and evaluate baseline forecasting models.

---

## 6. Summary & Next Notebook

In this notebook we:

- Loaded the modeling-ready TradyFlow dataset.
- Defined Sentinel’s short-horizon forecasting objective using consecutive sweeps per symbol.
- Engineered three targets:
  - `next_return_1d` – continuous short-horizon return
  - `direction_up` – binary up/down label
  - `vol_regime` – high vs. normal volatility regime
- Performed sanity checks on target distributions.
- Saved a training dataset ready for machine learning.

**Next:** Notebook 04 will split the data into train/validation/test sets, train baseline models (Logistic Regression, Random Forest, Gradient Boosted Trees), and evaluate how well options flow features forecast these targets.

In [7]:
OUT_PATH = Path("../data/processed/tradyflow_training.parquet")
df.to_parquet(OUT_PATH)

OUT_PATH

PosixPath('../data/processed/tradyflow_training.parquet')