# Instrução 01 — Fase 0 (dados e engenharia) — “v1-ready”

## Objetivo da etapa

Produzir um dataset único “v1-ready” para todos os tickers (Close+Volume + regime IBOV), com:

* saneamento básico,
* features adicionais mínimas do Plano v1,
* targets de treino (D_t e U_t),
* sem CLV,
* sem vazamento temporal.

## Entradas (obrigatórias)

* Arquivo base: `gold_rl_tabular.{csv|parquet}` contendo ao menos:
  `date, ticker, close, volume, z1, z2, z3, z5, vol21, vol21_pct, rvol_pct, rvol_chg, med_vol21, dist_peak20_sigma, pct_z2_le_m1, ret1, ret3, ret5`.
* Série do IBOV dentro do mesmo arquivo, identificada por `ticker == "_BVSP"`.

## Saídas

1. Arquivo consolidado “v1-ready” (mesmo diretório do input), com nome:

   * `gold_rl_tabular_v1_ready.parquet` e `gold_rl_tabular_v1_ready.csv`
2. Esquema mínimo adicional (por linha, por ticker≠_BVSP):

   * `t_slope_10_logclose`, `t_slope_20_logclose`
   * `vol21_pct_ibov` (percentil da vol 21d do IBOV, janela longa parametrizável)
   * `dd_ibov_20d_sigma` (drawdown 20d do IBOV escalado por sigma/21d)
   * Targets:

     * `D1_min`, `D3_min`, `D5_min`  (pior retorno intrajanela em 1/3/5)
     * `U_comp` = `1.0*ret1 + 0.6*ret3 + 0.3*ret5 - cost_entry`
   * Flags de qualidade: `flag_imputed`, `flag_window_warmup`
3. Relatório de controle (stdout): contagens por ticker, datas min/max, % de linhas removidas por regra.

## Critérios de aceitação (Go/No-Go)

* Nenhuma linha com `close<=0` ou `volume<=0` permanece.
* Não há NaN em colunas derivadas fora dos períodos de warmup de janelas (esses devem estar marcados com flags e/ou excluídos de treino).
* `_BVSP` **não** aparece nas linhas de saída (fica só como fonte para features de regime).
* Amostras de `t_slope_10/20`, `vol21_pct_ibov`, `dd_ibov_20d_sigma`, `D*_min`, `U_comp` são coerentes (valores não degenerados).
* Tamanhos finais por ticker e período batem com o calendário de pregão, salvo cortes do warmup.

## Parametrizações (fixe em constantes no topo)

* `W_TREND = [10, 20]` (slopes no log-close).
* `W_VOL_PCTL = 252` (janela longa para percentil de vol do IBOV; se insuficiente, use 126).
* `W_DD = 20`, `W_SIGMA = 21` (DD 20d, sigma/21d para escalonamento).
* `COST_ENTRY_BPS = 25` (custo de entrada para `U_comp`; manter igual ao Plano v1).
* `EXCLUDE_ZEROES = True` (remove close/volume ≤ 0).
* `DROP_CLV = True`.

## Regras de implementação (sem vazamento)

1. **Slopes de tendência**: transformar `close` em `log_close`; para cada janela `w∈{10,20}`, regressão linear simples de `log_close` em `t` (0..w-1) **usando somente passado**; salvar o coeficiente (slope/dia).
2. **Volatilidade do IBOV (21d) e percentil**:

   * Em `_BVSP`, calcule retorno diário `r_ibov = log(close_t/close_{t-1})`.
   * `vol21_ibov = std(r_ibov, janela=21)` (rolling, passado).
   * `vol21_pct_ibov = percent_rank(vol21_ibov, janela=W_VOL_PCTL)` (passado).
3. **Drawdown IBOV 20d escalado**:

   * Em `_BVSP`, para cada dia t, compute o **máximo** de `close` nos últimos `W_DD` dias (incluindo t), e o **mínimo subsequente até t`**? Aqui, como feature contemporânea sem lookahead: use `dd_ibov_20d = (close_t / rolling_max(close, W_DD) - 1)`.
   * Escale por `sigma_ibov = rolling_std(r_ibov, W_SIGMA)`;
     `dd_ibov_20d_sigma = dd_ibov_20d / sigma_ibov`.
4. **Propagar features do IBOV** para as linhas dos demais tickers por `date` (join por data).
5. **Targets**: para cada ticker≠_BVSP:

   * `D_h_min` (h∈{1,3,5}): pior retorno intrajanela em horizonte h contado **à frente**:
     `D_h_min(t) = min_{k=1..h} (close_{t+k}/close_t - 1)`; ignorar dias com janela incompleta.
   * `U_comp = 1.0*ret1 + 0.6*ret3 + 0.3*ret5 - COST_ENTRY_BPS/10000`.
6. **Saneamento**:

   * Se `EXCLUDE_ZEROES`: descartar linhas com `close<=0` ou `volume<=0`.
   * Linhas afetadas por warmup de janelas (slopes, vol, percentil, DD, targets) devem receber `flag_window_warmup=1`; opcionalmente, exclua-as do treino nas próximas fases.
7. **Retirar CLV**: se existir coluna `clv`, removê-la do output.

## Checklist de validação (imprimir no final)

* Linhas finais, por ticker e total.
* Datas min/max por ticker.
* % de linhas excluídas por zero/negativo.
* % de linhas marcadas como `flag_window_warmup`.
* Amostra (5 linhas) de cada nova coluna derivada.
* Confirmação de que `_BVSP` não está no output final.

## Telemetria mínima (stdout)

* `CHECKLIST_OK` ou `CHECKLIST_FAILURE` com motivos.
* Se falhar, listar as chaves que falharam e parar.

---

In [4]:
# Imports, constants, and notebook settings
import os
import sys
import math
import warnings
from pathlib import Path
from typing import Optional, Tuple

import numpy as np
import pandas as pd

pd.set_option("display.width", 160)
pd.set_option("display.max_columns", 60)
warnings.filterwarnings("ignore", category=FutureWarning)

# ---- Parameters (Plano v1) ----
W_TREND = [10, 20]              # slopes on log-close
W_VOL_PCTL = 252                # long window for IBOV vol percentile (fallback to 126 if insufficient)
W_DD = 20                       # drawdown lookback
W_SIGMA = 21                    # sigma window for scaling
COST_ENTRY_BPS = 25             # entry cost (bps)
EXCLUDE_ZEROES = True           # drop rows with close<=0 or volume<=0
DROP_CLV = True                 # drop 'clv' column if present

# Derived
COST_ENTRY = COST_ENTRY_BPS / 10000.0

# Globals (filled during run)
INPUT_FILE: Optional[Path] = None
INPUT_DIR: Optional[Path] = None
TOTAL_ROWS: int = 0
REMOVED_ZERO_NEG: int = 0

print("[CONFIG]")
print({
    "W_TREND": W_TREND,
    "W_VOL_PCTL": W_VOL_PCTL,
    "W_DD": W_DD,
    "W_SIGMA": W_SIGMA,
    "COST_ENTRY_BPS": COST_ENTRY_BPS,
    "EXCLUDE_ZEROES": EXCLUDE_ZEROES,
    "DROP_CLV": DROP_CLV,
})

[CONFIG]
{'W_TREND': [10, 20], 'W_VOL_PCTL': 252, 'W_DD': 20, 'W_SIGMA': 21, 'COST_ENTRY_BPS': 25, 'EXCLUDE_ZEROES': True, 'DROP_CLV': True}


In [6]:
# Optional input path override (set to a string path or leave as None)
# Example: r"G:\\Drives compartilhados\\BOLSA_2026\\a_bolsa2026_gemini\\00_data\\03_final\\gold_rl_tabular.parquet"
INPUT_FILE_OVERRIDE: Optional[str] = r"G:\Drives compartilhados\BOLSA_2026\a_bolsa2026_gemini\00_data\03_final\gold_rl_tabular.parquet"  # or set via env GOLD_RL_TABULAR_PATH

In [None]:
# Helper functions: loading, rolling slope, percent rank, telemetry
from dataclasses import dataclass


def find_input_file(preferred_dir: Optional[Path] = None) -> Path:
    # Honor explicit override via variable or environment
    override = INPUT_FILE_OVERRIDE or os.environ.get("GOLD_RL_TABULAR_PATH")
    if override:
        p = Path(override)
        if p.exists():
            return p
        raise FileNotFoundError(f"Override path not found: {p}")

    candidates = []
    search_dirs = [preferred_dir] if preferred_dir else []
    # default search path: project root's 00_data/03_final first, then 00_data/02_curado, then repo root
    root = Path.cwd()
    for p in [root / "00_data" / "03_final", root / "00_data" / "02_curado", root]:
        if p.exists():
            search_dirs.append(p)
    for d in search_dirs:
        candidates.extend([
            d / "gold_rl_tabular.parquet",
            d / "gold_rl_tabular.csv",
        ])
    for c in candidates:
        if c.exists():
            return c
    raise FileNotFoundError("gold_rl_tabular.{parquet|csv} not found in default locations. Set INPUT_FILE_OVERRIDE or env GOLD_RL_TABULAR_PATH.")


def read_base_file(path: Path) -> pd.DataFrame:
    if path.suffix.lower() == ".parquet":
        return pd.read_parquet(path)
    if path.suffix.lower() == ".csv":
        return pd.read_csv(path)
    raise ValueError(f"Unsupported extension: {path.suffix}")


def ensure_datetime(df: pd.DataFrame, date_col: str = "date") -> pd.DataFrame:
    if not np.issubdtype(df[date_col].dtype, np.datetime64):
        df[date_col] = pd.to_datetime(df[date_col])
    return df


def percent_rank_rolling(s: pd.Series, window: int) -> pd.Series:
    # Rolling percent rank: position of last element within the window
    def pr(x: pd.Series) -> float:
        v = x.iloc[-1]
        if not np.isfinite(v):
            return np.nan
        x_valid = x[np.isfinite(x)]
        if len(x_valid) == 0:
            return np.nan
        return (x_valid <= v).mean()
    return s.rolling(window, min_periods=window).apply(pr, raw=False)


def rolling_slope(series: pd.Series, window: int) -> pd.Series:
    # linear regression slope of y ~ t for t=0..w-1 using only past values
    # returns slope per day
    y = series
    idx = np.arange(window)
    # Precompute design stats
    x_mean = idx.mean()
    ssx = ((idx - x_mean) ** 2).sum()

    def slope_last_w(x: np.ndarray) -> float:
        if np.any(~np.isfinite(x)):
            return np.nan
        y_mean = x.mean()
        cov = ((idx - x_mean) * (x - y_mean)).sum()
        return cov / ssx if ssx != 0 else np.nan

    return (
        y.rolling(window=window, min_periods=window)
        .apply(lambda arr: slope_last_w(arr), raw=True)
    )


@dataclass
class Checklist:
    removed_zero_neg: int = 0
    total_rows_initial: int = 0
    warmup_count: int = 0
    final_rows: int = 0

    def report(self):
        pct_removed = (self.removed_zero_neg / self.total_rows_initial * 100.0) if self.total_rows_initial else 0.0
        print("[CHECKLIST]")
        print({
            "removed_zero_or_negative": self.removed_zero_neg,
            "pct_removed": round(pct_removed, 4),
            "warmup_flagged": self.warmup_count,
            "final_rows": self.final_rows,
        })

In [11]:
# Load dataset, basic sanitation, and base metrics

# Locate and read base file
INPUT_FILE = find_input_file()
INPUT_DIR = INPUT_FILE.parent
print(f"[IO] Input file: {INPUT_FILE}")

# Read
base = read_base_file(INPUT_FILE)
TOTAL_ROWS = len(base)
print(f"[IO] Loaded rows: {TOTAL_ROWS}")

# Ensure required columns exist
required_cols = [
    "date", "ticker", "close", "volume",
    "z1", "z2", "z3", "z5", "vol21", "vol21_pct",
    "rvol_pct", "rvol_chg", "med_vol21", "dist_peak20_sigma",
    "pct_z2_le_m1", "ret1", "ret3", "ret5",
]
missing = [c for c in required_cols if c not in base.columns]
if missing:
    raise ValueError(f"Missing required columns: {missing}")

# Basic coercions
base = ensure_datetime(base, "date")
base = base.sort_values(["ticker", "date"]).reset_index(drop=True)

# Saneamento: remove close<=0 or volume<=0
if EXCLUDE_ZEROES:
    mask = (base["close"] <= 0) | (base["volume"] <= 0)
    REMOVED_ZERO_NEG = int(mask.sum())
    base = base.loc[~mask].copy()
    print(f"[CLEAN] Removed non-positive close/volume rows: {REMOVED_ZERO_NEG}")

# Split IBOV and others (case-insensitive match for '_bvsp')
ibov_mask = base["ticker"].astype(str).str.lower().eq("_bvsp")
ibov = base.loc[ibov_mask, ["date", "ticker", "close"]].copy()
others = base.loc[~ibov_mask].copy()

# Precompute per-ticker log_close
others["log_close"] = np.log(others["close"])

[IO] Input file: G:\Drives compartilhados\BOLSA_2026\a_bolsa2026_gemini\00_data\03_final\gold_rl_tabular.parquet
[IO] Loaded rows: 105089
[CLEAN] Removed non-positive close/volume rows: 12466


In [12]:
# IBOV features: r_ibov, vol21_ibov, vol21_pct_ibov, dd_ibov_20d_sigma

# Re-load raw base to compute IBOV features without volume filtering (avoid degeneracy)
_base_raw = read_base_file(INPUT_FILE)
_base_raw = ensure_datetime(_base_raw, "date")
ibov = _base_raw.loc[_base_raw["ticker"].astype(str).str.lower().eq("_bvsp"), ["date", "ticker", "close"]].copy()
# Keep only positive close for IBOV
ibov = ibov.loc[ibov["close"] > 0].sort_values("date").reset_index(drop=True)

# Compute r_ibov (log returns)
ibov["r_ibov"] = np.log(ibov["close"]).diff()

# Rolling vol 21d (past only)
ibov["vol21_ibov"] = (
    ibov["r_ibov"].rolling(window=21, min_periods=21).std()
)

# Percentile rank over long window (W_VOL_PCTL with fallback)
win_pctl = W_VOL_PCTL
if len(ibov) < W_VOL_PCTL:
    win_pctl = min(126, len(ibov)) if len(ibov) >= 126 else len(ibov)
    if win_pctl < 2:
        win_pctl = 2
ibov["vol21_pct_ibov"] = percent_rank_rolling(ibov["vol21_ibov"], window=win_pctl)

# Drawdown over last W_DD days (current vs rolling max), scaled by sigma over W_SIGMA
roll_max = ibov["close"].rolling(window=W_DD, min_periods=W_DD).max()
ibov["dd_ibov_20d"] = ibov["close"] / roll_max - 1.0
ibov["sigma_ibov"] = ibov["r_ibov"].rolling(window=W_SIGMA, min_periods=W_SIGMA).std()
ibov["dd_ibov_20d_sigma"] = ibov["dd_ibov_20d"] / ibov["sigma_ibov"]

# Keep only date + needed columns for join
ibov_feat = ibov[["date", "vol21_pct_ibov", "dd_ibov_20d_sigma"]].copy()
print("[IBOV] Features prepared:", ibov_feat.columns.tolist(), "; rows:", len(ibov_feat), "; non-null:", ibov_feat.notna().sum().to_dict())

[IBOV] Features prepared: ['date', 'vol21_pct_ibov', 'dd_ibov_20d_sigma'] ; rows: 3407 ; non-null: {'date': 3407, 'vol21_pct_ibov': 3386, 'dd_ibov_20d_sigma': 3386}


In [15]:
# Join IBOV features to others; compute slopes, targets, and flags

# Join by date
df = others.merge(ibov_feat, on="date", how="left")

# Compute trend slopes on log_close per ticker
for w in W_TREND:
    col = f"t_slope_{w}_logclose"
    df[col] = (
        df.sort_values(["ticker", "date"])  # ensure order
          .groupby("ticker", group_keys=False)["log_close"]
          .apply(lambda s: rolling_slope(s, window=w))
    )

# Targets: worst forward return within horizon h in {1,3,5}
# D_h_min(t) = min_{k=1..h} (close_{t+k}/close_t - 1)
# We'll compute via groupby + shifting close

for h in [1, 3, 5]:
    future_rets = []
    for k in range(1, h + 1):
        future_rets.append(
            df.groupby("ticker", group_keys=False).apply(
                lambda g, kk=k: g["close"].shift(-kk) / g["close"] - 1.0
            )
        )
    stacked = pd.concat(future_rets, axis=1)
    df[f"D{h}_min"] = stacked.min(axis=1)

# U_comp = 1.0*ret1 + 0.6*ret3 + 0.3*ret5 - COST_ENTRY
# Use existing ret1/ret3/ret5 in the dataset
if not set(["ret1", "ret3", "ret5"]).issubset(df.columns):
    raise ValueError("ret1/ret3/ret5 not found in base columns.")

df["U_comp"] = 1.0 * df["ret1"] + 0.6 * df["ret3"] + 0.3 * df["ret5"] - COST_ENTRY

# Flags: warmup and imputed
# Warmup where any newly created rolling value is NaN because of insufficient history
warmup_cols = [
    *(f"t_slope_{w}_logclose" for w in W_TREND),
    "vol21_pct_ibov",
    "dd_ibov_20d_sigma",
    "D1_min", "D3_min", "D5_min",
    # Treat missing ret horizons as warmup as well (ensures U_comp non-NaN outside warmup)
    "ret1", "ret3", "ret5",
]

df["flag_window_warmup"] = df[warmup_cols].isna().any(axis=1).astype(int)

# No explicit imputation here; placeholder flag 0
# (If later we impute, set 1 accordingly.)
df["flag_imputed"] = 0

  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(
  df.groupby("ticker", group_keys=False).apply(


In [16]:
# Finalize: drop CLV, exclude IBOV rows (case-insensitive), write outputs, and validation checklist

# Remove CLV if requested
if DROP_CLV and "clv" in df.columns:
    df = df.drop(columns=["clv"])  # remove CLV from output

# Exclude IBOV from output (case-insensitive)
out_df = df.loc[~df["ticker"].astype(str).str.lower().eq("_bvsp")].copy()

# Determine warmup rows to exclude from training (kept here but flagged)
check = Checklist(
    removed_zero_neg=REMOVED_ZERO_NEG,
    total_rows_initial=TOTAL_ROWS,
    warmup_count=int(out_df["flag_window_warmup"].sum()),
    final_rows=len(out_df),
)

# Validate acceptance criteria
fail_reasons = []
if (out_df["close"] <= 0).any() or (out_df["volume"] <= 0).any():
    fail_reasons.append("Found non-positive close/volume in output")

# Ensure no NaNs in derived columns outside warmup
derived_cols = [
    *(f"t_slope_{w}_logclose" for w in W_TREND),
    "vol21_pct_ibov", "dd_ibov_20d_sigma",
    "D1_min", "D3_min", "D5_min", "U_comp",
]
mask_non_warmup = out_df["flag_window_warmup"].eq(0)
for c in derived_cols:
    if out_df.loc[mask_non_warmup, c].isna().any():
        fail_reasons.append(f"NaN in {c} outside warmup")

# Ensure IBOV does not appear
if out_df["ticker"].astype(str).str.lower().eq("_bvsp").any():
    fail_reasons.append("_BVSP present in output")

# Sample sanity checks (non-degenerate stats)
for c in [*(f"t_slope_{w}_logclose" for w in W_TREND),
          "vol21_pct_ibov", "dd_ibov_20d_sigma",
          "D1_min", "D3_min", "D5_min", "U_comp"]:
    if out_df[c].dropna().nunique() <= 1:
        fail_reasons.append(f"Degenerate values in {c}")

# Write outputs
out_parquet = INPUT_DIR / "gold_rl_tabular_v1_ready.parquet"
out_csv = INPUT_DIR / "gold_rl_tabular_v1_ready.csv"

out_df.to_parquet(out_parquet, index=False)
out_df.to_csv(out_csv, index=False)
print(f"[IO] Wrote: {out_parquet}")
print(f"[IO] Wrote: {out_csv}")

# Report per ticker counts and date ranges
print("[SUMMARY] Per-ticker counts and date ranges")
summary = out_df.groupby("ticker").agg(
    rows=("date", "size"),
    date_min=("date", "min"),
    date_max=("date", "max"),
)
print(summary)

# Checklist
check.final_rows = len(out_df)
check.report()

# Additional checklist details
pct_removed_zero_neg = (REMOVED_ZERO_NEG / TOTAL_ROWS * 100.0) if TOTAL_ROWS else 0.0
pct_warmup = (out_df["flag_window_warmup"].mean() * 100.0) if len(out_df) else 0.0
print("[CHECKLIST-DETAILS]")
print({
    "%_removed_zero_neg": round(pct_removed_zero_neg, 4),
    "%_warmup_flagged": round(pct_warmup, 4),
})

# Show samples of new columns
print("[SAMPLES]")
sample_cols = [
    *(f"t_slope_{w}_logclose" for w in W_TREND),
    "vol21_pct_ibov", "dd_ibov_20d_sigma",
    "D1_min", "D3_min", "D5_min", "U_comp",
    "flag_window_warmup",
]
print(out_df.sort_values(["ticker", "date"]).loc[:, ["ticker", "date", *sample_cols]].groupby("ticker").head(5))

# Telemetry: verdict
if fail_reasons:
    print("CHECKLIST_FAILURE", fail_reasons)
    raise SystemExit(f"CHECKLIST_FAILURE: {fail_reasons}")
else:
    print("CHECKLIST_OK")

[IO] Wrote: G:\Drives compartilhados\BOLSA_2026\a_bolsa2026_gemini\00_data\03_final\gold_rl_tabular_v1_ready.parquet
[IO] Wrote: G:\Drives compartilhados\BOLSA_2026\a_bolsa2026_gemini\00_data\03_final\gold_rl_tabular_v1_ready.csv
[SUMMARY] Per-ticker counts and date ranges
                rows   date_min   date_max
ticker                                    
ABEV3.SA        3400 2012-01-02 2025-10-01
B3SA3.SA        3402 2012-01-02 2025-10-01
BBAS3.SA        3400 2012-01-02 2025-10-01
CPLE6.SA        3398 2012-01-02 2025-10-01
CSNA3.SA        3397 2012-01-02 2025-10-01
ELET3.SA        3396 2012-01-02 2025-10-01
GGBR4.SA        3379 2012-01-02 2025-10-01
HAPV3.SA        1845 2018-04-26 2025-10-01
ITUB4.SA        3388 2012-01-02 2025-10-01
LREN3.SA        3395 2012-01-02 2025-10-01
PETR4.SA        3396 2012-01-02 2025-10-01
PRIO3.SA        3380 2012-01-02 2025-10-01
PSSA3.SA        3400 2012-01-02 2025-10-01
RAIL3.SA        2595 2015-04-02 2025-10-01
RDOR3.SA        1193 2020-12-15 2025-1