Data Pipeline

# üìä **Trend Surgeon ‚Äî Full Feature Engineering Pipeline**

This document describes the **end-to-end data pipeline** used to construct the final, fully aligned, leakage-safe feature dataset for multi-step recursive forecasting.

---

# üß± **1. Inputs**

| Parameter         | Description                                           |
| ----------------- | ----------------------------------------------------- |
| `TARGET_TICKER`   | Ticker being predicted (e.g. `"PPH"`)                 |
| `SUPPORT_TICKERS` | Related tickers used as features                      |
| `START_DATE`      | Historical start date                                 |
| `END_DATE`        | Historical end date                                   |
| `HORIZON`         | Number of forecast steps used in recursive prediction |

**All data is downloaded directly from Yahoo Finance.**

---

# üóÉÔ∏è **2. Data Download & Cleaning (Step A)**

1. Download OHLCV for target + support tickers.
2. Flatten MultiIndex columns.
3. Drop `"Adj Close"` columns.
4. Identify earliest date where **all tickers** have complete OHLCV.
5. Trim dataset to only fully-overlapping period.

Result:
**A clean base dataframe of synchronized OHLCV across all tickers.**

---

# üîë **3. Feature Registry (Step B)**

A global dictionary that stores feature alignment rules:

```python
feature_registry = {
    "PPH_Return_1d": "shift_1",
    "day_of_week": "no_shift",
    "is_cpi_day": "shift_plus_k",
    ...
}
```

Allowed rules:

| Rule             | Meaning                                           |
| ---------------- | ------------------------------------------------- |
| `"shift_1"`      | Feature uses OHLCV(t) and must be shifted to t‚àí1  |
| `"no_shift"`     | Feature known before t (calendar, lagged)         |
| `"shift_plus_k"` | Future event feature, expanded to t+1 ‚Ä¶ t+HORIZON |

This guarantees **explicit control** of temporal alignment and prevents leakage.

---

# üìà **4. Per-Ticker Technical Features (Step C)**

Generated for **each** ticker:

### Momentum

* Returns (1d, 5d, 10d, 20d)
* SMA-10, SMA-50, EMA-20, EMA-50
* MACD, MACD signal, MACD histogram
* RSI-14
* Stochastics (%K, %D)

### Volatility

* Rolling vol-20
* Parkinson
* Garman-Klass
* Bollinger Band width

### Volume

* Volume ROC
* Volume Z-score
* OBV

### Entropy

* Rolling return energy

All of these depend on OHLCV(t) ‚Üí
**registered with `"shift_1"`**

---

# üîó **5. Cross-Ticker Features (Step D)**

For each support ticker:

* Price ratio: `ticker_Close / target_Close`
* RS-20: 20-day relative strength
* 60-day rolling correlation

Also depend on OHLCV(t) ‚Üí
**registered `"shift_1"`**

---

# üß¨ **6. PCA Latent Factors (Step E)**

PCA on all tickers‚Äô 1-day returns produces:

* `PCA_1`
* `PCA_2`
* `PCA_3`

Captures market-level latent structure.
Depends on returns(t) ‚Üí
**registered `"shift_1"`**

---

# üìÖ **7. Calendar & Macro-Event Features (Step F)**

### Calendar Basics

`day_of_week`, `day_of_month`, `month`, `quarter`,
`is_month_end`, `is_month_start`, `is_year_end`
‚Üí **registered `"no_shift"`**

### Holidays

`is_holiday_adjacent`
‚Üí **`"no_shift"`**

### OPEX Week

‚Üí **`"no_shift"`**

### FOMC Week

‚Üí **`"no_shift"`**

### Macro-Events

Day-level & week-level flags:

* `is_cpi_day`
* `is_nfp_day`
* `is_ppi_day`
* `is_gdp_day`
* ‚Ä¶and their `_week` variants

‚Üí **registered `"shift_plus_k"`**

Later expanded into:

```
is_cpi_day_t+1
is_cpi_day_t+2
...
is_cpi_day_t+HORIZON
```

---

# üïí **8. Shift Engine (Step G)**

Applies temporal alignment **after** all features are created:

### `"shift_1"`

‚Üí `df[col] = df[col].shift(1)`

### `"no_shift"`

‚Üí unchanged

### `"shift_plus_k"`

‚Üí autocreates k=1..HORIZON shifted columns:

```
col_t+1 = df[col].shift(1)
col_t+2 = df[col].shift(2)
...
col_t+HORIZON = df[col].shift(HORIZON)
```

Guarantees **zero leakage** and proper alignment for recursive multi-step forecasting.

---

# üìú **9. Markdown Documentation Generator (Step H)**

Optional step that exports a table:

| Feature | Rule | Resulting Columns |
| ------- | ---- | ----------------- |

Uses registry + horizon to show how each feature is aligned.

Pure documentation ‚Äî not used by code.

---

# üß© **10. Master Entry Function**

All pieces are tied together in:

```python
build_feature_dataset(
    target="PPH",
    support_tickers=[...],
    start_date="2011-01-04",
    end_date="2025-12-01",
    horizon=30
)
```

What it does:

1. Download & clean data
2. Trim to earliest common date
3. Run technicals
4. Run cross-ticker features
5. Run PCA
6. Add calendar + macro events
7. Apply shift engine
8. (Optional) write markdown
9. Return final feature matrix

---

# üì¶ **Final Output**

A fully aligned, leakage-safe dataframe with:

* Target ticker
* Support tickers
* All technical features
* Cross-ticker features
* PCA factors
* Calendar features
* Holiday effects
* Macro event ladders (t+1 ‚Üí t+HORIZON)
* Proper shifts applied
* Documentation table


In [1]:
# ============================================================
# STEP A ‚Äî RAW DATA PREPARATION WITH SAFE WINDOW EXPANSION
# ============================================================

import pandas as pd
import yfinance as yf
import logging

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger(__name__)


def _compute_safe_window(start_date, end_date, min_history=20, horizon=30):
    """
    Expands the user's requested date range so that:
      - rolling windows (SMA50, corr60, etc.) have enough initial history
      - shift_plus_k has future coverage
    """
    start = pd.to_datetime(start_date)
    end   = pd.to_datetime(end_date)

    safe_start = start - pd.Timedelta(days=int(min_history * 2.5))
    safe_end   = end + pd.Timedelta(days=horizon + 10)

    return safe_start, safe_end


def download_and_prepare_data(target, support_tickers,
                              start_date, end_date,
                              min_history=20, horizon=30):
    """
    Combines:
      ‚úî Your original OHLCV preparation logic
      ‚úî Automatic expansion of the date window
      ‚úî Clear validation errors if the requested date range is unavailable
      ‚úî Full safe dataset for all later steps
    """

    tickers = [target] + list(support_tickers)
    logger.info(f"Downloading data for: {tickers}")

    # ------------------------------------------------------------
    # 1. Compute expanded safe window
    # ------------------------------------------------------------
    safe_start, safe_end = _compute_safe_window(start_date, end_date,
                                                min_history, horizon)

    logger.info(f"Safe fetch window: {safe_start.date()} ‚Üí {safe_end.date()}")

    # ------------------------------------------------------------
    # 2. Download from yfinance inside safe window
    # ------------------------------------------------------------
    raw = yf.download(
        tickers=tickers,
        start=safe_start.strftime("%Y-%m-%d"),
        end=safe_end.strftime("%Y-%m-%d"),
        auto_adjust=False,
        group_by="ticker",
        progress=False
    )

    if raw.empty:
        raise ValueError("‚ùå yfinance returned no data.")

    # ------------------------------------------------------------
    # 3. Flatten columns
    # ------------------------------------------------------------
    raw.columns = [f"{t}_{f}" for t, f in raw.columns]

    # ------------------------------------------------------------
    # 4. Drop Adj Close if any
    # ------------------------------------------------------------
    raw = raw[[c for c in raw.columns if "Adj_Close" not in c and "Adj Close" not in c]]

    # ------------------------------------------------------------
    # 5. Ensure datetime index
    # ------------------------------------------------------------
    if "date" in raw.columns:
        raw["date"] = pd.to_datetime(raw["date"], errors="coerce")
        raw = raw.set_index("date")

    raw.index = pd.to_datetime(raw.index)

    # ------------------------------------------------------------
    # 6. Validate that requested range is covered
    # ------------------------------------------------------------
    earliest = raw.index.min()
    latest   = raw.index.max()

    req_start = pd.to_datetime(start_date)
    req_end   = pd.to_datetime(end_date)

    if earliest > req_start:
        raise ValueError(
            f"‚ùå Requested start {req_start.date()} is too early.\n"
            f"   Earliest fetched data is {earliest.date()}"
        )

    if latest < req_end:
        raise ValueError(
            f"‚ùå Requested end {req_end.date()} is too late.\n"
            f"   Latest fetched data is {latest.date()}"
        )

    # ------------------------------------------------------------
    # 7. DO NOT trim early ‚Äî keep full safe window
    #    Feature generation needs the earlier history!
    # ------------------------------------------------------------

    logger.info(f"Data prepared. Shape after safe fetch: {raw.shape}")

    return raw


**Test of Step A**

In [2]:
# ============================
# TEST STEP A: RAW DATA PREP
# ============================

print(">>> Running Step A ‚Äî download_and_prepare_data\n")

# --- PARAMETERS ---
TARGET = "PPH"
SUPPORT = [
    "XPH", "IHE", "IBB", "XBI", "XLV", "VHT", "SPY", "VIXY"
]
START = "2011-01-04"
END   = "2025-12-01"

# --- RUN STEP A ---
df_raw = download_and_prepare_data(
    target=TARGET,
    support_tickers=SUPPORT,
    start_date=START,
    end_date=END
)

print("\n>>> DONE. Output summary:\n")
print("Shape:", df_raw.shape)
print("\nIndex dtype:", df_raw.index.dtype)
print("\nColumns:", list(df_raw.columns)[:20], "...\n")
print("\nDataFrame info():\n")
print(df_raw.info())
print("\nHead:\n")
print(df_raw.head())

Downloading data for: ['PPH', 'XPH', 'IHE', 'IBB', 'XBI', 'XLV', 'VHT', 'SPY', 'VIXY']
Safe fetch window: 2010-11-15 ‚Üí 2026-01-10


>>> Running Step A ‚Äî download_and_prepare_data



Data prepared. Shape after safe fetch: (3786, 45)



>>> DONE. Output summary:

Shape: (3786, 45)

Index dtype: datetime64[ns]

Columns: ['VIXY_Open', 'VIXY_High', 'VIXY_Low', 'VIXY_Close', 'VIXY_Volume', 'XPH_Open', 'XPH_High', 'XPH_Low', 'XPH_Close', 'XPH_Volume', 'XBI_Open', 'XBI_High', 'XBI_Low', 'XBI_Close', 'XBI_Volume', 'VHT_Open', 'VHT_High', 'VHT_Low', 'VHT_Close', 'VHT_Volume'] ...


DataFrame info():

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3786 entries, 2010-11-15 to 2025-12-03
Data columns (total 45 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   VIXY_Open    3752 non-null   float64
 1   VIXY_High    3752 non-null   float64
 2   VIXY_Low     3752 non-null   float64
 3   VIXY_Close   3752 non-null   float64
 4   VIXY_Volume  3752 non-null   float64
 5   XPH_Open     3786 non-null   float64
 6   XPH_High     3786 non-null   float64
 7   XPH_Low      3786 non-null   float64
 8   XPH_Close    3786 non-null   float64
 9   XPH_Volume   3786 non-null   int64  
 10  XB

In [3]:
# ============================================================
# STEP A.5 ‚Äî EARLY CLEANING OF RAW DATA
# ============================================================

def clean_raw_ohlcv(df, tickers):
    """
    Early cleaning applied immediately after downloading and trimming OHLCV data.
    Ensures stable inputs for feature generation.

    Steps:
    - Drop duplicate index entries
    - Ensure index is sorted
    - Replace infinite values
    - Forward-fill *only* OHLCV values to handle non-trading gaps
    - Remove rows where some tickers have partial OHLCV (e.g. trading halts)
    - Enforce numeric dtypes
    """

    df = df.copy()

    # --------------------------------------------
    # 1. Remove duplicate dates & sort index
    # --------------------------------------------
    df = df[~df.index.duplicated(keep='first')].sort_index()

    # --------------------------------------------
    # 2. Replace infinite values
    # --------------------------------------------
    df = df.replace([np.inf, -np.inf], np.nan)

    # --------------------------------------------
    # 3. Forward-fill ONLY OHLCV fields across non-trading gaps
    # --------------------------------------------
    ohlcv_cols = []
    for t in tickers:
        for field in ["Open", "High", "Low", "Close", "Volume"]:
            col = f"{t}_{field}"
            if col in df.columns:
                ohlcv_cols.append(col)

    # Forward-fill OHLCV missing values (safe for market closures)
    df[ohlcv_cols] = df[ohlcv_cols].ffill()

    # --------------------------------------------
    # 4. Remove rows where *not all* tickers have full OHLCV
    # --------------------------------------------
    mask = df[ohlcv_cols].notna().all(axis=1)
    df = df[mask]

    # --------------------------------------------
    # 5. Enforce numeric dtype for OHLCV columns
    # --------------------------------------------
    for col in ohlcv_cols:
        df[col] = pd.to_numeric(df[col], errors="coerce")

    # --------------------------------------------
    # 6. Remove any row that still contains NaNs in OHLCV
    # --------------------------------------------
    df = df.dropna(subset=ohlcv_cols, how="any")

    return df


In [4]:
# =====================================================================
# STEP A.5 ‚Äî AUTOMATED TEST BLOCK FOR clean_raw_ohlcv
# =====================================================================

import pandas as pd
import numpy as np

def test_step_A5(df_raw, target, support_tickers):
    print("\n==============================")
    print("üîç TESTING STEP A.5 ‚Äî CLEAN RAW OHLCV")
    print("==============================\n")

    tickers = [target] + support_tickers

    # ---------------------------------------------------------
    # Run the cleaning step
    # ---------------------------------------------------------
    df_clean = clean_raw_ohlcv(df_raw, tickers=tickers)

    # ---------------------------------------------------------
    # 1. Index must be datetime, sorted, unique
    # ---------------------------------------------------------
    assert isinstance(df_clean.index, pd.DatetimeIndex), \
        "‚ùå Index must remain DatetimeIndex"

    assert df_clean.index.is_monotonic_increasing, \
        "‚ùå Index must be sorted after cleaning"

    assert df_clean.index.is_unique, \
        "‚ùå Duplicate dates detected after cleaning"

    print("‚úì Index: datetime, sorted, unique")

    # ---------------------------------------------------------
    # 2. Collect expected OHLCV columns
    # ---------------------------------------------------------
    expected_cols = []
    for t in tickers:
        for field in ["Open", "High", "Low", "Close", "Volume"]:
            expected_cols.append(f"{t}_{field}")

    # Ensure all expected columns exist
    for col in expected_cols:
        assert col in df_clean.columns, f"‚ùå Missing OHLCV column: {col}"

    print(f"‚úì All expected OHLCV columns present ({len(expected_cols)} columns)")

    # ---------------------------------------------------------
    # 3. OHLCV must contain only numeric values
    # ---------------------------------------------------------
    for col in expected_cols:
        assert pd.api.types.is_numeric_dtype(df_clean[col]), \
            f"‚ùå OHLCV column {col} is not numeric dtype"

    print("‚úì All OHLCV columns are numeric")

    # ---------------------------------------------------------
    # 4. No NaNs allowed in OHLCV after cleanup
    # ---------------------------------------------------------
    na_counts = df_clean[expected_cols].isna().sum()
    assert na_counts.sum() == 0, \
        f"‚ùå NaNs remain in OHLCV after clean_raw_ohlcv:\n{na_counts[na_counts > 0]}"

    print("‚úì No NaNs in OHLCV after cleanup")

    # ---------------------------------------------------------
    # 5. No infinite values
    # ---------------------------------------------------------
    assert not np.isinf(df_clean[expected_cols].values).any(), \
        "‚ùå Infinite values remain in OHLCV"

    print("‚úì No infinite values in OHLCV")

    # ---------------------------------------------------------
    # 6. Forward-fill must not alter the first row
    # ---------------------------------------------------------
    for col in expected_cols:
        assert df_clean[col].iloc[0] == df_raw[col].loc[df_clean.index[0]], \
            f"‚ùå First-row OHLCV changed for column {col}; unsafe ffill"

    print("‚úì Forward-fill did not alter first valid OHLCV row")

    # ---------------------------------------------------------
    # 7. No partial rows allowed (every ticker must have all 5 fields)
    # ---------------------------------------------------------
    row_validity = df_clean[expected_cols].notna().all(axis=1)
    assert row_validity.all(), \
        "‚ùå Some rows still have incomplete OHLCV"

    print("‚úì All rows contain complete OHLCV data")

    # ---------------------------------------------------------
    # 8. df_clean must have fewer or equal rows (never more)
    # ---------------------------------------------------------
    assert len(df_clean) <= len(df_raw), \
        "‚ùå clean_raw_ohlcv added rows ‚Äî this should never happen"

    print("‚úì Row count is valid (no unexpected new rows)")

    # ---------------------------------------------------------
    # 9. Shape and summary
    # ---------------------------------------------------------
    print("\nüéâ STEP A.5 PASSED ‚Äî Raw OHLCV cleaned safely.\n")
    print(f"Final shape after Step A.5: {df_clean.shape}")

    return df_clean


In [5]:
# GLOBAL
feature_registry = {}

def register_feature(name, shift):
    global feature_registry
    feature_registry[name] = shift
    print(f"Registered feature: {name} with shift: {shift}")

# ============================================================
# STEP B ‚Äî FEATURE REGISTRY (Exact Implementation Requested)
# ============================================================

def initialize_feature_registry(df, target, support_tickers):

    global feature_registry

    # (1) RESET
    feature_registry.clear()

    # (2) Add canonical target_close
    raw_target_close = f"{target}_Close"
    df["target_close"] = df[raw_target_close].copy()

    # (3) Register canonical target
    register_feature("target_close", "no_shift")

    # (4) Loop through all columns
    for col in df.columns:

        if col == "target_close":
            continue  # already handled

        if col.endswith("_Open"):
            register_feature(col, "no_shift")
        else:
            register_feature(col, "shift_1")

    print("FINAL REGISTRY:", feature_registry)
    return feature_registry


In [6]:
# ============================================================
# STEP C ‚Äî PER-TICKER TECHNICAL INDICATORS (FINAL VERSION)
# ============================================================

import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from hmmlearn.hmm import GaussianHMM


# ------------------------------------------------------------
# OHLCV Helper
# ------------------------------------------------------------
def get_ohlcv(df, ticker):
    """
    Returns (Open, High, Low, Close, Volume) column names for a ticker.
    Ensures all are present.
    """
    cols = [f"{ticker}_{x}" for x in ["Open", "High", "Low", "Close", "Volume"]]
    return cols if all(c in df.columns for c in cols) else None


# ------------------------------------------------------------
# MOMENTUM FEATURES
# ------------------------------------------------------------
def feat_returns(df, ticker, close):
    feats = {}
    for w in [1, 5, 10, 20]:
        name = f"{ticker}_Return_{w}d"
        feats[name] = df[close].pct_change(w)
        register_feature(name, "shift_1")
    return feats


# ------------------------------------------------------------
# TREND FEATURES (SMA/EMA/MA Cross)
# ------------------------------------------------------------
def feat_sma_ema(df, ticker, close):
    feats = {}

    name = f"{ticker}_SMA_10"
    feats[name] = df[close].rolling(10).mean()
    register_feature(name, "shift_1")

    # name = f"{ticker}_SMA_50"
    # feats[name] = df[close].rolling(50).mean()
    # register_feature(name, "shift_1")

    name = f"{ticker}_EMA_20"
    feats[name] = df[close].ewm(span=20).mean()
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# MACD FEATURES
# ------------------------------------------------------------
def feat_macd(df, ticker, close):
    feats = {}

    ema12 = df[close].ewm(span=12).mean()
    ema26 = df[close].ewm(span=26).mean()
    macd = ema12 - ema26

    name = f"{ticker}_MACD"
    feats[name] = macd
    register_feature(name, "shift_1")

    name = f"{ticker}_MACD_sig"
    feats[name] = macd.ewm(span=9).mean()
    register_feature(name, "shift_1")

    name = f"{ticker}_MACD_hist"
    feats[name] = feats[f"{ticker}_MACD"] - feats[f"{ticker}_MACD_sig"]
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# RSI FEATURE
# ------------------------------------------------------------
def feat_rsi(df, ticker, close):
    feats = {}

    delta = df[close].diff()
    up = delta.clip(lower=0).rolling(14).mean()
    down = (-delta.clip(upper=0)).rolling(14).mean()
    rs = up / down

    name = f"{ticker}_RSI_14"
    feats[name] = 100 - (100 / (1 + rs))
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# STOCHASTICS FEATURES
# ------------------------------------------------------------
def feat_stochastics(df, ticker, close):
    feats = {}

    low14 = df[close].rolling(14).min()
    high14 = df[close].rolling(14).max()

    name = f"{ticker}_StochK"
    feats[name] = 100 * (df[close] - low14) / (high14 - low14)
    register_feature(name, "shift_1")

    name = f"{ticker}_StochD"
    feats[name] = feats[f"{ticker}_StochK"].rolling(3).mean()
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# VOLATILITY FEATURES
# ------------------------------------------------------------
def feat_volatility(df, ticker, o, h, l, c, returns_dict):
    feats = {}
    ret1 = returns_dict[f"{ticker}_Return_1d"]

    name = f"{ticker}_Vol_20"
    feats[name] = ret1.rolling(20).std()
    register_feature(name, "shift_1")

    name = f"{ticker}_Parkinson_20"
    feats[name] = ((np.log(df[h]/df[l])**2).rolling(20).mean() * (1/(4*np.log(2))))
    register_feature(name, "shift_1")

    name = f"{ticker}_GK_20"
    feats[name] = (
        0.5*(np.log(df[h]/df[l])**2)
        - (2*np.log(np.e)-1)*(np.log(df[c]/df[o])**2)
    ).rolling(20).mean()
    register_feature(name, "shift_1")

    sma20 = df[c].rolling(20).mean()
    std20 = df[c].rolling(20).std()

    name = f"{ticker}_BB_width"
    feats[name] = (2 * std20) / sma20
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# VOLUME FEATURES
# ------------------------------------------------------------
def feat_volume(df, ticker, v, c):
    feats = {}

    name = f"{ticker}_Volume_ROC"
    feats[name] = df[v].pct_change(5)
    register_feature(name, "shift_1")

    name = f"{ticker}_Volume_Z"
    feats[name] = (df[v] - df[v].rolling(20).mean()) / df[v].rolling(20).std()
    register_feature(name, "shift_1")

    name = f"{ticker}_OBV"
    feats[name] = (np.sign(df[c].diff()) * df[v]).fillna(0).cumsum()
    register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# ENTROPY FEATURE
# ------------------------------------------------------------
def feat_entropy(df, ticker):
    feats = {}

    col = f"{ticker}_Return_1d"
    if col in df.columns:
        name = f"{ticker}_Entropy_20"
        feats[name] = (df[col]**2).rolling(20).sum()
        register_feature(name, "shift_1")

    return feats


# ------------------------------------------------------------
# HMM BLOCK (optional)
# ------------------------------------------------------------
from hmmlearn.hmm import GaussianHMM

def feat_hmm(df, ticker, returns_dict, n_states=3):
    """
    Fit a simple Gaussian HMM on 1-day returns and output hidden state sequence.
    Output MUST be same length as df, with NaN padding at the top.
    """
    feats = {}
    ret = returns_dict.get(f"{ticker}_Return_1d")

    if ret is None:
        return feats

    # Convert to numpy, drop NaNs for fitting
    clean = ret.dropna().values.reshape(-1, 1)

    if len(clean) < 50:
        # Not enough data for HMM
        feats[f"{ticker}_HMM"] = pd.Series(np.nan, index=df.index)
        register_feature(f"{ticker}_HMM", "shift_1")
        return feats

    # Fit HMM
    model = GaussianHMM(n_components=n_states, covariance_type="full", n_iter=200)
    model.fit(clean)

    # Predict states
    hidden_states = model.predict(clean)

    # Pad output back to full length
    pad_length = len(df) - len(hidden_states)
    padded = np.concatenate([np.full(pad_length, np.nan), hidden_states])

    col = f"{ticker}_HMM"
    feats[col] = pd.Series(padded, index=df.index)

    register_feature(col, "shift_1")
    return feats



# ------------------------------------------------------------
# MASTER WRAPPER (FIXED)
# ------------------------------------------------------------
def generate_per_ticker_features(df, tickers, use_hmm=True):
    """
    Computes all technical indicators for each ticker.
    Ensures correct sequencing:

        returns ‚Üí trend ‚Üí MACD ‚Üí RSI ‚Üí Stochastics ‚Üí volatility ‚Üí volume ‚Üí entropy ‚Üí HMM
    """
    all_feats = {}

    # We make a working copy so df is never modified outside Step C
    df_local = df.copy()

    for ticker in tqdm(tickers, desc="Per-ticker technicals"):
        ohlcv = get_ohlcv(df_local, ticker)
        if not ohlcv:
            continue

        o, h, l, c, v = ohlcv

        # --------------------------------------------
        # 1. RETURNS (must run first)
        # --------------------------------------------
        returns_dict = feat_returns(df_local, ticker, c)
        all_feats.update(returns_dict)

        # **CRITICAL FIX: Insert returns into df_local so entropy & others can see them**
        for k, s in returns_dict.items():
            df_local[k] = s

        # --------------------------------------------
        # 2. TREND
        # --------------------------------------------
        feats = feat_sma_ema(df_local, ticker, c)
        all_feats.update(feats)

        # --------------------------------------------
        # 3. MACD
        # --------------------------------------------
        feats = feat_macd(df_local, ticker, c)
        all_feats.update(feats)

        # --------------------------------------------
        # 4. MOMENTUM
        # --------------------------------------------
        feats = feat_rsi(df_local, ticker, c)
        all_feats.update(feats)

        feats = feat_stochastics(df_local, ticker, c)
        all_feats.update(feats)

        # --------------------------------------------
        # 5. VOLATILITY (depends on Return_1d)
        # --------------------------------------------
        feats = feat_volatility(df_local, ticker, o, h, l, c, returns_dict)
        all_feats.update(feats)

        # --------------------------------------------
        # 6. VOLUME
        # --------------------------------------------
        feats = feat_volume(df_local, ticker, v, c)
        all_feats.update(feats)

        # --------------------------------------------
        # 7. ENTROPY  (now works, because Return_1d is in df_local)
        # --------------------------------------------
        feats = feat_entropy(df_local, ticker)
        all_feats.update(feats)

        # --------------------------------------------
        # 8. OPTIONAL HMM
        # --------------------------------------------
        if use_hmm:
            feats = feat_hmm(df_local, ticker, returns_dict)
            all_feats.update(feats)

    return all_feats


In [7]:
# =====================================================================
# STEP C ‚Äî AUTOMATED TEST BLOCK FOR generate_per_ticker_features
# =====================================================================

import pandas as pd
import numpy as np

def test_step_C(df_cleanA, target, support_tickers, use_hmm=True):

    print("\n==============================================")
    print("üîç TESTING STEP C ‚Äî PER-TICKER TECHNICALS")
    print("==============================================\n")

    tickers = [target] + support_tickers

    # ---------------------------------------------------------
    # 1. Run Step C
    # ---------------------------------------------------------
    tech_feats = generate_per_ticker_features(
        df_cleanA,
        tickers=tickers,
        use_hmm=use_hmm
    )

    assert isinstance(tech_feats, dict), \
        "‚ùå Step C must return a dict of feature_name ‚Üí Series"

    print(f"‚úì Returned a dict with {len(tech_feats)} features")

    # ---------------------------------------------------------
    # 2. Check all expected features exist for each ticker
    # ---------------------------------------------------------
    expected_suffixes = [
        "Return_1d", "Return_5d", "Return_10d", "Return_20d",
        "SMA_10", "SMA_50",
        "EMA_20", "EMA_50",
        "MA_Cross",
        "MACD", "MACD_sig", "MACD_hist",
        "RSI_14",
        "StochK", "StochD",
        "Vol_20",
        "Parkinson_20",
        "GK_20",
        "BB_width",
        "Volume_ROC",
        "Volume_Z",
        "OBV",
        "Entropy_20",
    ]

    if use_hmm:
        expected_suffixes.append("HMM")

    for t in tickers:
        for suf in expected_suffixes:
            feat = f"{t}_{suf}"
            assert feat in tech_feats, f"‚ùå Missing feature: {feat}"

    print(f"‚úì All expected per-ticker features present for {len(tickers)} tickers")

    # ---------------------------------------------------------
    # 3. Verify each feature is a Pandas Series aligned to df_cleanA
    # ---------------------------------------------------------
    for name, series in tech_feats.items():
        assert isinstance(series, pd.Series), \
            f"‚ùå Feature {name} is not a Pandas Series"

        assert len(series) == len(df_cleanA), \
            f"‚ùå Feature {name} length mismatch"

    print("‚úì All features are Series and correctly aligned")

    # ---------------------------------------------------------
    # 4. Check numeric dtype for all non-HMM features
    # ---------------------------------------------------------
    for name, series in tech_feats.items():
        if name.endswith("_HMM"):
            # HMM is integer states
            assert pd.api.types.is_numeric_dtype(series), \
                f"‚ùå HMM feature {name} must be numeric"
        else:
            assert pd.api.types.is_numeric_dtype(series), \
                f"‚ùå Feature {name} must have numeric dtype"

    print("‚úì All features have numeric dtype")

    # ---------------------------------------------------------
    # 5. Quick sanity check for NaN explosion (expected early NaNs OK)
    # ---------------------------------------------------------
    # Rolling features = NaNs at top (OK)
    # But the entire column cannot be NaN.
    for name, series in tech_feats.items():
        assert series.notna().sum() > 10, \
            f"‚ùå Feature {name} seems to be all-NaN (bad rolling window?)"

    print("‚úì Each feature has valid non-NaN data")

    # ---------------------------------------------------------
    # 6. Check registry integration (all shift_1)
    # ---------------------------------------------------------
    from pprint import pprint
    all_rules = get_feature_registry()

    # Every feature in tech_feats must exist in registry
    missing_in_registry = [
        f for f in tech_feats.keys() if f not in all_rules
    ]
    assert len(missing_in_registry) == 0, \
        f"‚ùå Missing registry entries:\n{missing_in_registry}"

    # All rules must be shift_1 for technicals
    bad_rules = {
        f: r for f, r in all_rules.items()
        if f in tech_feats and r != "shift_1"
    }
    assert len(bad_rules) == 0, \
        f"‚ùå Some technical features have incorrect shift rules:\n{bad_rules}"

    print("‚úì Registry entries valid and all marked shift_1")

    print("\nüéâ STEP C PASSED ‚Äî Technical Indicators Valid\n")

    return tech_feats


In [8]:
# ============================================================
# STEP D ‚Äî CROSS-TICKER FEATURES
# ============================================================

def compute_cross_ticker_features(df, target, support_tickers):
    """
    Builds cross-ticker features comparing each support ticker to the target.
    Returns a dict of {feature_name: Series}.
    """

    feats = {}

    target_close = f"{target}_Close"
    if target_close not in df.columns:
        raise ValueError(f"Target {target_close} not found in dataframe.")

    target_ret_20 = f"{target}_Return_20d"
    if target_ret_20 not in df.columns:
        # Should never happen because Step C adds it
        raise ValueError(f"{target_ret_20} missing ‚Äî Step C must run first.")

    for ticker in support_tickers:
        close_col = f"{ticker}_Close"
        ret20_col = f"{ticker}_Return_20d"

        if close_col not in df.columns:
            continue  # skip incomplete tickers

        # ------------------------------
        # Price Ratio
        # ------------------------------
        name = f"{ticker}_Ratio_{target}"
        feats[name] = df[close_col] / df[target_close]
        register_feature(name, "shift_1")

        # ------------------------------
        # RS-20 (Relative Strength)
        # ------------------------------
        if ret20_col in df.columns:
            name = f"{ticker}_RS_20"
            feats[name] = df[ret20_col] - df[target_ret_20]
            register_feature(name, "shift_1")

        # ------------------------------
        # 60-Day Rolling Correlation
        # ------------------------------
        # name = f"{ticker}_Corr_{target}_60"
        # feats[name] = df[close_col].rolling(60).corr(df[target_close])
        # register_feature(name, "shift_1")

    return feats


In [9]:
# ============================================================
# STEP E ‚Äî PCA LATENT FACTORS
# ============================================================

from sklearn.decomposition import PCA

def compute_pca_features(df, tickers):
    """
    Computes PCA on the 1-day returns of all tickers.
    Returns a dict of {feature_name: Series}.
    """

    feats = {}

    # Collect all 1-day return columns
    ret_cols = [f"{t}_Return_1d" for t in tickers if f"{t}_Return_1d" in df.columns]

    if len(ret_cols) == 0:
        # Should never happen (Step C must generate them)
        return feats

    # Extract data for PCA
    ret_df = df[ret_cols].dropna()

    # Need enough rows for PCA stability
    if len(ret_df) < 200:
        return feats

    try:
        pca = PCA(n_components=3)
        pca_values = pca.fit_transform(ret_df)

        # PCA_1
        name = "PCA_1"
        feats[name] = pd.Series(pca_values[:, 0], index=ret_df.index)
        register_feature(name, "shift_1")

        # PCA_2
        name = "PCA_2"
        feats[name] = pd.Series(pca_values[:, 1], index=ret_df.index)
        register_feature(name, "shift_1")

        # PCA_3
        name = "PCA_3"
        feats[name] = pd.Series(pca_values[:, 2], index=ret_df.index)
        register_feature(name, "shift_1")

    except Exception as e:
        logging.warning(f"PCA failed: {e}")

    return feats


In [None]:
# ============================================================
# STEP F ‚Äî CALENDAR + MACRO FEATURES (COMPRESSED, NO 'date')
# ============================================================

import pandas as pd
import holidays

# -----------------------------
# FOMC Calendar Fetcher
# -----------------------------
def fetch_fomc_dates():
    url = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"
    try:
        tables = pd.read_html(url)
        dates = pd.to_datetime(tables[0].iloc[:, 0], errors="coerce").dropna()
        return dates.sort_values()
    except:
        fallback = pd.to_datetime([
            "2024-01-31","2024-03-20","2024-05-01",
            "2024-06-12","2024-07-31","2024-09-18",
            "2024-11-07","2024-12-18"
        ])
        return fallback.sort_values()

# -----------------------------
# Macro Calendar
# -----------------------------
def macro_calendar():
    events = {
        "cpi": ["2024-01-11","2024-02-13","2024-03-12","2024-04-10"],
        "nfp": ["2024-01-05","2024-02-02","2024-03-08","2024-04-05"],
        "ppi": ["2024-01-12","2024-02-16","2024-03-14","2024-04-11"],
        "gdp": ["2024-01-25","2024-02-28","2024-03-28","2024-04-25"],
    }
    return {k: pd.to_datetime(v).sort_values() for k, v in events.items()}


# ============================================================
# HELPER ‚Äî Compute distances to events
# ============================================================

def _days_to_next(idx, event_dates):
    """Return Series: days until the next event."""
    out = []
    j = 0

    for d in idx:
        while j < len(event_dates) and event_dates[j] < d:
            j += 1
        if j == len(event_dates):
            out.append(None)  # No next event
        else:
            out.append((event_dates[j] - d).days)

    return pd.Series(out, index=idx)


def _days_since_prev(idx, event_dates):
    """Return Series: days since previous event."""
    out = []
    j = 0

    for d in idx:
        # Find last event <= d
        while j < len(event_dates) and event_dates[j] <= d:
            j += 1
        if j == 0:
            out.append(None)
        else:
            out.append((d - event_dates[j-1]).days)

    return pd.Series(out, index=idx)


# ============================================================
# CALENDAR BASICS (NO SHIFT)
# ============================================================

def add_calendar_basics(df):
    idx = df.index

    features = {
        "day_of_week":    idx.dayofweek,
        "day_of_month":   idx.day,
        "month":          idx.month,
        "quarter":        idx.quarter,
        "is_month_end":   idx.is_month_end.astype(int),
        "is_month_start": idx.is_month_start.astype(int)
    }

    for name, series in features.items():
        df[name] = series
        register_feature(name, "no_shift")

    return df


# ============================================================
# HOLIDAYS (NO SHIFT)
# ============================================================

def add_holiday_features(df):
    us_holidays = holidays.US()
    idx = df.index

    df["is_holiday_adjacent"] = [
        int((d + pd.Timedelta(days=1) in us_holidays) or
            (d - pd.Timedelta(days=1) in us_holidays))
        for d in idx
    ]
    register_feature("is_holiday_adjacent", "no_shift")

    return df


# ============================================================
# OPEX WEEK (NO SHIFT)
# ============================================================

def add_opex_features(df):
    idx = df.index
    df["is_opex_week"] = [
        int((d.weekday() == 4) and (15 <= d.day <= 21))
        for d in idx
    ]
    register_feature("is_opex_week", "no_shift")
    return df


# ============================================================
# FOMC (COMPRESSED)
# ============================================================

def add_fomc_features(df):
    idx = df.index
    dates = fetch_fomc_dates()

    df["is_fomc_day"]        = idx.isin(dates).astype(int)
    df["days_to_fomc"]       = _days_to_next(idx, dates)
    df["days_since_fomc"]    = _days_since_prev(idx, dates)

    for col in ["is_fomc_day", "days_to_fomc", "days_since_fomc"]:
        register_feature(col, "no_shift")

    return df


# ============================================================
# MACRO EVENTS (COMPRESSED)
# ============================================================

def add_macro_features(df):
    idx = df.index
    macros = macro_calendar()

    for name, dates in macros.items():

        df[f"is_{name}_day"]          = idx.isin(dates).astype(int)
        df[f"days_to_{name}"]        = _days_to_next(idx, dates)
        df[f"days_since_{name}"]     = _days_since_prev(idx, dates)

        register_feature(f"is_{name}_day",      "no_shift")
        register_feature(f"days_to_{name}",     "no_shift")
        register_feature(f"days_since_{name}",  "no_shift")

    return df


# ============================================================
# MASTER WRAPPER
# ============================================================

def generate_calendar_and_macro_features(df):
    df = df.copy()
    df = add_calendar_basics(df)
    df = add_holiday_features(df)
    df = add_opex_features(df)
    df = add_fomc_features(df)
    df = add_macro_features(df)

        # ------------------------------------------------------------
    # FIX: Macro distance features produce unavoidable NaNs
    # ------------------------------------------------------------
    # Any column like days_since_* or days_to_* will have NaNs:
    #   - days_since_* ‚Üí NaN before the FIRST macro event
    #   - days_to_*    ‚Üí NaN after the LAST macro event
    # These NaNs should NOT cause row deletion in the cleaning step.
    #
    # Strategy:
    #   Impute with a large sentinel value (999) so the model
    #   interprets ‚Äúfar from event‚Äù properly.
    # ------------------------------------------------------------

    dist_cols = [c for c in df.columns
                 if c.startswith("days_since_") or c.startswith("days_to_")]

    if dist_cols:
        df[dist_cols] = df[dist_cols].fillna(999)
    return df


In [11]:
# ============================================================
# STEP G ‚Äî SHIFT ENGINE (Column-Replacing, Registry-Driven)
# ============================================================

def apply_shift_engine(df, horizon):
    """
    Applies registry-driven temporal alignment AND updates the registry.

    For each column:
        - no_shift:       keep col(t)
        - shift_1:        replace col   ‚Üí col_t-1
        - shift_plus_k:   replace col   ‚Üí col_t+H

    Additional improvements:
        ‚úî Registry is rewritten to match final column names
        ‚úî Markdown output will now be accurate
        ‚úî No duplicate columns
        ‚úî No original (pre-shift) columns remain
    """

    global feature_registry

    df = df.copy()
    out = {}
    new_registry = {}     # fully rebuild registry from scratch

    # ----------------------------------------------------------
    # 1. Validate registry vs dataframe columns
    # ----------------------------------------------------------
    df_cols = set(df.columns)
    reg_cols = set(feature_registry.keys())

    missing = df_cols - reg_cols
    extra   = reg_cols - df_cols

    if missing:
        raise ValueError(f"Registry missing {len(missing)} columns: {sorted(missing)}")

    if extra:
        raise ValueError(f"Registry contains columns not in df: {sorted(extra)}")

    # ----------------------------------------------------------
    # 2. Apply transformations + rebuild registry
    # ----------------------------------------------------------
    for col in df.columns:
        rule = feature_registry[col]
        s = df[col]

        # ----- no_shift ‚Üí keep original name -----
        if rule == "no_shift":
            new_name = col
            out[new_name] = s
            new_registry[new_name] = "no_shift"

        # ----- shift_1 ‚Üí output col_t-1 -----
        elif rule == "shift_1":
            new_name = f"{col}_t-1"
            out[new_name] = s.shift(1)
            new_registry[new_name] = "shift_1"    # final shift label

        # ----- shift_plus_k ‚Üí output col_t+H -----
        elif rule == "shift_plus_k":
            new_name = f"{col}_t+{horizon}"
            out[new_name] = s.shift(-horizon)
            new_registry[new_name] = f"shift_plus_{horizon}"

        else:
            raise ValueError(f"Unknown shift rule: {rule}")

    # ----------------------------------------------------------
    # 3. Replace registry with the new post-shift registry
    # ----------------------------------------------------------
    feature_registry = new_registry

    # ----------------------------------------------------------
    # 4. Build final DataFrame
    # ----------------------------------------------------------
    aligned_df = pd.DataFrame(out, index=df.index)

    return aligned_df


In [12]:
# ============================================================
# STEP H ‚Äî MARKDOWN FEATURE DOCUMENTATION GENERATOR
# ============================================================

import os

def generate_markdown_feature_doc(path, horizon):
    """
    Creates a Markdown file documenting:
    - every feature registered
    - the shift rule
    - the resulting final columns after Step G

    Parameters
    ----------
    path : str
        Output path to write markdown file (e.g. "docs/feature_table.md")
    horizon : int
        Forecast horizon, used for shift_plus_k expansion
    """

    lines = []
    lines.append("# Feature Transformation Table\n")
    lines.append("This table is auto-generated from the feature pipeline.\n")
    lines.append("\n")
    lines.append("| Feature | Rule | Output Columns |\n")
    lines.append("|---------|------|----------------|\n")

    for feature, rule in feature_registry.items():

        # ----------------------------------------------
        # shift_1
        # ----------------------------------------------
        if rule == "shift_1":
            output_cols = f"{feature}_t-1"

        # ----------------------------------------------
        # no_shift
        # ----------------------------------------------
        elif rule == "no_shift":
            output_cols = feature

        # ----------------------------------------------
        # shift_plus_k
        # ----------------------------------------------
        elif rule == "shift_plus_k":
            shifted = [f"{feature}_t+{k}" for k in range(1, horizon + 1)]
            output_cols = ", ".join(shifted)

        else:
            output_cols = "ERROR_UNKNOWN_RULE"

        # add row
        lines.append(f"| `{feature}` | `{rule}` | `{output_cols}` |\n")

    # Ensure directory exists
    os.makedirs(os.path.dirname(path), exist_ok=True)

    # Write file
    with open(path, "w") as f:
        f.writelines(lines)

    print(f"Markdown feature documentation written to: {path}")


In [13]:
# ============================================================
# FINAL PRODUCTION CLEANER FOR THE FEATURE DATASET
# ============================================================

import numpy as np
import pandas as pd

def clean_final_dataset(df, target, min_history=20):
    """
    Cleans the aligned dataset after the shift engine.

    Actions performed:
        1. Drop rows where absolutely no features exist
        2. Drop first rows where rolling windows leave insufficient history
        3. Drop last rows with NaNs caused by shift_plus_k
        4. Remove columns that are all-NaN or constant
           AND update the registry accordingly
        5. Ensure index is clean, sorted, unique
    """

    global feature_registry

    df = df.copy()

    # ---------------------------------------------------------
    # 1. Drop rows that are literally all NaN
    # ---------------------------------------------------------
    df = df.dropna(how="all")

    # ---------------------------------------------------------
    # 2. Drop early rows lacking enough usable feature history
    # ---------------------------------------------------------
    non_target_cols = [
        c for c in df.columns
        if c not in {target, "target_close"}
    ]

    df["valid_feature_count"] = df[non_target_cols].notna().sum(axis=1)
    df = df[df["valid_feature_count"] >= min_history]
    df = df.drop(columns=["valid_feature_count"], errors="ignore")

    # ---------------------------------------------------------
    # 3. Drop trailing NaN rows (shift_plus_k consequences)
    # ---------------------------------------------------------


    # ---------------------------------------------------------
    # 4. Remove all-NaN columns and constant columns
    #    AND update the registry BEFORE dropping
    # ---------------------------------------------------------

# ---------------------------------------------------------
# 3. Drop all-NaN, constant, and long-window columns
# ---------------------------------------------------------

    cols_all_nan = df.columns[df.isna().all()].tolist()

    # Constant columns (e.g., is_year_end when dataset has no Dec 31)
    cols_constant = [c for c in df.columns if df[c].nunique() <= 1]

    # Combine everything to remove
    cols_to_remove = set(cols_all_nan + cols_constant)

    # ----- 1) Remove from registry -----
    for col in cols_to_remove:
        feature_registry.pop(col, None)

    # ----- 2) Remove from dataframe safely -----
    df = df.drop(columns=list(cols_to_remove), errors="ignore")


    # ---------------------------------------------------------
    # 5. Final index validations
    # ---------------------------------------------------------
    if not isinstance(df.index, pd.DatetimeIndex):
        raise ValueError("Final dataset index must be DatetimeIndex.")

    df = df.sort_index()

    if df.index.duplicated().any():
        raise ValueError("Duplicate timestamps detected.")

    return df


In [14]:
# ============================================================
# STEP J - MASTER PIPELINE ‚Äî COMPLETE DATASET BUILDER
# ============================================================

def build_feature_dataset(
    target,
    support_tickers,
    start_date,
    end_date,
    horizon,
    markdown_output_path=None,
    USE_HMM=False
):
    """
    Full leakage-safe feature engineering pipeline.
    """

    # ---------------------------------------------------------
    # 1. DOWNLOAD RAW PRICE DATA
    # ---------------------------------------------------------
    df = download_and_prepare_data(
        target=target,
        support_tickers=support_tickers,
        start_date=start_date,
        end_date=end_date
    )
    print(df.shape)
    print(df.shape)
    tickers = [target] + support_tickers



    # ---------------------------------------------------------
    # 2. EARLY RAW DATA CLEANUP
    # ---------------------------------------------------------
    def clean_raw_ohlcv(df, tickers):
        """
        Ensures raw OHLCV downloaded from yfinance is clean, sorted, and valid.
        """
        df = df.copy()

        # Ensure index is sorted and unique
        df = df[~df.index.duplicated()].sort_index()

        # Replace inf
        df = df.replace([np.inf, -np.inf], np.nan)

        # Drop any row where ALL tickers have missing OHLCV
        required_cols = []
        for t in tickers:
            for f in ["Open", "High", "Low", "Close", "Volume"]:
                required_cols.append(f"{t}_{f}")

        df = df.dropna(subset=required_cols, how="all")

        return df


    # ---------------------------------------------------------
    # 3. RESET FEATURE REGISTRY
    # ---------------------------------------------------------
    initialize_feature_registry(df, target, support_tickers)

    # ---------------------------------------------------------
    # 4. PER-TICKER TECHNICAL FEATURES  (Step C)
    # ---------------------------------------------------------
    tech_feats = generate_per_ticker_features(df, tickers, use_hmm=USE_HMM)

    # Merge tech features so that Step D & Step E can see them
    df_with_tech = pd.concat(
        [df, pd.DataFrame(tech_feats, index=df.index)], axis=1
    )

    print(df_with_tech.shape)

    # ---------------------------------------------------------
    # 5. CROSS-TICKER FEATURES (Step D)
    # ---------------------------------------------------------
    cross_feats = compute_cross_ticker_features(df_with_tech, target, support_tickers)

    # ---------------------------------------------------------
    # 6. PCA LATENT FACTORS (Step E)
    # ---------------------------------------------------------
    pca_feats = compute_pca_features(df_with_tech, tickers)


    # ---------------------------------------------------------
    # 7. CALENDAR + MACRO FEATURES
    # ---------------------------------------------------------
    df_calendar = generate_calendar_and_macro_features(df)

    print(df_calendar.shape)


    # ---------------------------------------------------------
    # 8. MERGE ALL FEATURE BLOCKS
    # ---------------------------------------------------------
    all_feats = {}
    all_feats.update(tech_feats)
    all_feats.update(cross_feats)
    all_feats.update(pca_feats)

    df_all = pd.concat(
        [df_calendar, pd.DataFrame(all_feats, index=df.index)],
        axis=1
    )

    print(df_all.shape)

    # ---------------------------------------------------------
    # 9. SHIFT ENGINE (APPLY TEMPORAL ALIGNMENT)
    # ---------------------------------------------------------
    df_aligned = apply_shift_engine(
        df_all,
        horizon=horizon       # <-- FIXED
    )
    print(df_aligned.shape)
    # ---------------------------------------------------------
    # 9. SHIFT ENGINE (APPLY TEMPORAL ALIGNMENT)
    # ---------------------------------------------------------

    df_cleaned = clean_final_dataset(df_aligned, target, min_history=20)
    print(df_cleaned.shape)

    # ---------------------------------------------------------
    # 11. OPTIONAL: FEATURE DOCUMENTATION
    # ---------------------------------------------------------
    if markdown_output_path is not None:
        generate_markdown_feature_doc(
            path=markdown_output_path,
            horizon=horizon
        )

    # ---------------------------------------------------------
    # 12. TRIM AND RETURN MODEL-READY DATAFRAME
    # ---------------------------------------------------------

    df = df_cleaned.loc[start_date : end_date]
    print(df.shape)
    return df

In [15]:
TARGET_TICKER = "PPH"
SUPPORT_TICKERS = [
    "XPH", "IHE", "IBB", "XBI", "XLV", "VHT", "SPY", "VIXY"
]

START_DATE = "2014-02-04"
END_DATE = "2025-08-03"
HORIZON = 30
USE_HMM = True
df_final = build_feature_dataset(
    target=TARGET_TICKER,
    support_tickers=SUPPORT_TICKERS,
    start_date=START_DATE,
    end_date=END_DATE,
    horizon=HORIZON,
    markdown_output_path="docs/features_no_clean.md"    # Optional
)

Downloading data for: ['PPH', 'XPH', 'IHE', 'IBB', 'XBI', 'XLV', 'VHT', 'SPY', 'VIXY']
Safe fetch window: 2013-12-16 ‚Üí 2025-09-12
Data prepared. Shape after safe fetch: (2952, 45)


(2952, 45)
(2952, 45)
Registered feature: target_close with shift: no_shift
Registered feature: VIXY_Open with shift: no_shift
Registered feature: VIXY_High with shift: shift_1
Registered feature: VIXY_Low with shift: shift_1
Registered feature: VIXY_Close with shift: shift_1
Registered feature: VIXY_Volume with shift: shift_1
Registered feature: IHE_Open with shift: no_shift
Registered feature: IHE_High with shift: shift_1
Registered feature: IHE_Low with shift: shift_1
Registered feature: IHE_Close with shift: shift_1
Registered feature: IHE_Volume with shift: shift_1
Registered feature: IBB_Open with shift: no_shift
Registered feature: IBB_High with shift: shift_1
Registered feature: IBB_Low with shift: shift_1
Registered feature: IBB_Close with shift: shift_1
Registered feature: IBB_Volume with shift: shift_1
Registered feature: XBI_Open with shift: no_shift
Registered feature: XBI_High with shift: shift_1
Registered feature: XBI_Low with shift: shift_1
Registered feature: XBI_Clos

Per-ticker technicals:   0%|          | 0/9 [00:00<?, ?it/s]

Registered feature: PPH_Return_1d with shift: shift_1
Registered feature: PPH_Return_5d with shift: shift_1
Registered feature: PPH_Return_10d with shift: shift_1
Registered feature: PPH_Return_20d with shift: shift_1
Registered feature: PPH_SMA_10 with shift: shift_1
Registered feature: PPH_EMA_20 with shift: shift_1
Registered feature: PPH_MACD with shift: shift_1
Registered feature: PPH_MACD_sig with shift: shift_1
Registered feature: PPH_MACD_hist with shift: shift_1
Registered feature: PPH_RSI_14 with shift: shift_1
Registered feature: PPH_StochK with shift: shift_1
Registered feature: PPH_StochD with shift: shift_1
Registered feature: PPH_Vol_20 with shift: shift_1
Registered feature: PPH_Parkinson_20 with shift: shift_1
Registered feature: PPH_GK_20 with shift: shift_1
Registered feature: PPH_BB_width with shift: shift_1
Registered feature: PPH_Volume_ROC with shift: shift_1
Registered feature: PPH_Volume_Z with shift: shift_1
Registered feature: PPH_OBV with shift: shift_1
Regi

In [16]:
df_final.head(5)

Unnamed: 0_level_0,VIXY_Open,VIXY_High_t-1,VIXY_Low_t-1,VIXY_Close_t-1,VIXY_Volume_t-1,IHE_Open,IHE_High_t-1,IHE_Low_t-1,IHE_Close_t-1,IHE_Volume_t-1,...,XLV_RS_20_t-1,VHT_Ratio_PPH_t-1,VHT_RS_20_t-1,SPY_Ratio_PPH_t-1,SPY_RS_20_t-1,VIXY_Ratio_PPH_t-1,VIXY_RS_20_t-1,PCA_1_t-1,PCA_2_t-1,PCA_3_t-1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-02-04,55728.0,57232.0,52784.0,56624.0,3322.0,39.639999,40.543331,39.310001,39.356667,135900.0,...,-0.012533,1.892965,-0.007081,3.276336,-0.052974,1065.161779,0.218426,0.105332,-0.025382,-0.001905
2014-02-05,56640.0,56688.0,54736.0,55504.0,4969.0,39.803333,39.869999,39.496666,39.813332,71100.0,...,-0.0111,1.893256,-0.005572,3.267325,-0.056044,1033.97913,0.196813,-0.029754,0.010988,-0.003979
2014-02-06,56880.0,59056.0,56096.0,57536.0,862.0,39.82,39.869999,39.246666,39.633331,138600.0,...,-0.018914,1.876672,-0.014781,3.253529,-0.054854,1068.647842,0.277941,0.044851,-0.000998,-0.013096
2014-02-07,50448.0,56880.0,51792.0,51792.0,744.0,39.776669,39.886665,39.473331,39.613335,64800.0,...,-0.014334,1.880096,-0.01268,3.289103,-0.034769,959.822107,0.155041,-0.082527,-0.05576,-0.001468
2014-02-10,48416.0,50688.0,47568.0,48720.0,714.0,40.450001,40.450001,39.776669,40.130001,52200.0,...,-0.015535,1.880459,-0.013784,3.269287,-0.034092,886.462897,0.073808,-0.091241,0.035934,0.017027


In [17]:
import numpy as np
import pandas as pd


def validate_final_dataframe(
    df,
    feature_registry,
    horizon,
    requested_start_date,
    requested_end_date,
    target_col="target_close"
):
    """
    Validates the completed post-shift feature matrix.

    Checks:
        - NaNs / Inf
        - DatetimeIndex integrity
        - Start & end dates correct (relative to user input)
        - Expected number of columns based on registry + horizon
        - No all-NaN or constant columns
        - Sampled shift sanity checks
        - Target column validity
    """

    print("\n" + "="*80)
    print("üîç VALIDATING FINAL DATAFRAME")
    print("="*80)

    ok = True

    def good(msg):
        print(f"  ‚úÖ {msg}")

    def bad(msg):
        nonlocal ok
        ok = False
        print(f"  ‚ùå {msg}")

    # -------------------------------------------------------------
    # 1. Basic Data Health
    # -------------------------------------------------------------
    if df.isna().any().any():
        bad("DataFrame contains NaN values.")
    else:
        good("No NaN values found.")

    if np.isinf(df.select_dtypes(include=[np.number])).any().any():
        bad("DataFrame contains infinite values.")
    else:
        good("No infinite values detected.")

    # -------------------------------------------------------------
    # 2. Index Integrity
    # -------------------------------------------------------------
    if not isinstance(df.index, pd.DatetimeIndex):
        bad("Index is not a DatetimeIndex.")
    else:
        good("Index type OK (DatetimeIndex).")

    if not df.index.is_monotonic_increasing:
        bad("Index not sorted.")
    else:
        good("Index sorted ascending.")

    if df.index.duplicated().any():
        bad("Duplicate timestamps present in index.")
    else:
        good("No duplicate timestamps.")

    # -------------------------------------------------------------
    # 3. EXACT START AND END DATE VALIDATION
    # -------------------------------------------------------------
    req_start = pd.Timestamp(requested_start_date)
    req_end   = pd.Timestamp(requested_end_date)

    first_df_date = df.index.min()
    last_df_date  = df.index.max()

    # ---- START DATE MUST MATCH EXACTLY ----
    if first_df_date != req_start:
        bad(
            f"Start date mismatch: DF begins at {first_df_date.date()}, "
            f"but requested start was {req_start.date()}. "
            f"This indicates Step A failed to expand the lookback window "
            f"or the ticker lacks sufficient historical data."
        )
    else:
        good(f"Start date correct: {first_df_date.date()}")

    # ---- END DATE MUST MATCH EXACTLY ----
    if last_df_date != req_end:
        bad(
            f"End date mismatch: DF ends at {last_df_date.date()}, "
            f"but requested end was {req_end.date()}."
        )
    else:
        good(f"End date correct: {last_df_date.date()}")


    # -------------------------------------------------------------
    # 4. EXPECTED COLUMN COUNT (Registry + Horizon)
    # -------------------------------------------------------------
    # Number of base columns the registry knows about
    base_cols = len(feature_registry)

    # How many of those are shift_plus_k?
    shift_plus_k_cols = [
        f for f, rule in feature_registry.items()
        if rule == "shift_plus_k"
    ]

    expected_final_columns = (
        # each base feature (no_shift + shift_1) becomes exactly 1 column
        len(feature_registry)
        +
        # each shift_plus_k feature creates H additional columns
        len(shift_plus_k_cols) * horizon
    )

    actual_columns = df.shape[1]

    if expected_final_columns != actual_columns:
        bad(
            f"Column count mismatch:\n"
            f"  Expected (registry+horizon): {expected_final_columns}\n"
            f"  Actual df columns:           {actual_columns}"
        )
    else:
        good(
            f"Column count correct: {actual_columns} "
            f"(matches registry+horizon projection)"
        )

    # -------------------------------------------------------------
    # 5. No all-NaN columns
    # -------------------------------------------------------------
    all_nan_cols = df.columns[df.isna().all()]
    if len(all_nan_cols) > 0:
        bad(f"{len(all_nan_cols)} columns are all NaN.")
    else:
        good("No empty (all-NaN) columns.")

    # -------------------------------------------------------------
    # 6. Constant columns
    # -------------------------------------------------------------
    const_cols = [c for c in df.columns if df[c].nunique() <= 1]
    if len(const_cols) > 0:
        bad(f"{len(const_cols)} constant columns detected: {const_cols}")
    else:
        good("No constant columns.")

    # -------------------------------------------------------------
    # 7. SHIFT SANITY CHECK (sample)
    # -------------------------------------------------------------
    sample_shifted = [c for c in df.columns if c.endswith("_t-1")][:8]

    for col in sample_shifted:
        base = col[:-4]  # remove _t-1 suffix
        if base in df.columns:
            if not df[col].shift(-1).equals(df[base]):
                bad(f"Shift alignment error: {col} is not base shifted by 1.")
            else:
                good(f"{col} correctly aligned with base {base}.")
        else:
            good(f"{col}: no base column found (OK for calendar/macro).")

    # -------------------------------------------------------------
    # 8. Target sanity
    # -------------------------------------------------------------
    if target_col in df.columns:
        tgt = df[target_col]
        if tgt.isna().any() or (tgt <= 0).any():
            bad(f"Target column '{target_col}' contains invalid values.")
        else:
            good(f"Target column '{target_col}' looks valid.")
    else:
        bad(f"Target column '{target_col}' not found in final dataset.")

    # -------------------------------------------------------------
    # SUMMARY
    # -------------------------------------------------------------
    print("\n" + "="*80)
    if ok:
        print("üéâ FINAL DATAFRAME VALID ‚Äî All checks passed.")
    else:
        print("‚ö†Ô∏è FINAL DATAFRAME **NOT** VALID ‚Äî See issues above.")
    print("="*80)

    return ok


In [18]:
validate_final_dataframe(
    df_final,
    feature_registry,
    horizon=HORIZON,
    requested_start_date= START_DATE,
    requested_end_date= END_DATE,
    target_col="target_close"
)


üîç VALIDATING FINAL DATAFRAME
  ‚úÖ No NaN values found.
  ‚úÖ No infinite values detected.
  ‚úÖ Index type OK (DatetimeIndex).
  ‚úÖ Index sorted ascending.
  ‚úÖ No duplicate timestamps.
  ‚úÖ Start date correct: 2014-02-04
  ‚ùå End date mismatch: DF ends at 2025-08-01, but requested end was 2025-08-03.
  ‚úÖ Column count correct: 268 (matches registry+horizon projection)
  ‚úÖ No empty (all-NaN) columns.
  ‚úÖ No constant columns.
  ‚úÖ VIXY_High_t-1: no base column found (OK for calendar/macro).
  ‚úÖ VIXY_Low_t-1: no base column found (OK for calendar/macro).
  ‚úÖ VIXY_Close_t-1: no base column found (OK for calendar/macro).
  ‚úÖ VIXY_Volume_t-1: no base column found (OK for calendar/macro).
  ‚úÖ IHE_High_t-1: no base column found (OK for calendar/macro).
  ‚úÖ IHE_Low_t-1: no base column found (OK for calendar/macro).
  ‚úÖ IHE_Close_t-1: no base column found (OK for calendar/macro).
  ‚úÖ IHE_Volume_t-1: no base column found (OK for calendar/macro).
  ‚úÖ Target column '

False

In [19]:
!pip install pyarrow



In [20]:
df_final.to_parquet("data/final_features.parquet")

In [21]:
df_final.shape

(2891, 268)

In [22]:

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.width', None):
    display(df_final.isna().sum().sort_values(ascending=False))

VIXY_Open                0
XLV_BB_width_t-1         0
XLV_Return_5d_t-1        0
XLV_Return_10d_t-1       0
XLV_Return_20d_t-1       0
XLV_SMA_10_t-1           0
XLV_EMA_20_t-1           0
XLV_MACD_t-1             0
XLV_MACD_sig_t-1         0
XLV_MACD_hist_t-1        0
XLV_RSI_14_t-1           0
XLV_StochK_t-1           0
XLV_StochD_t-1           0
XLV_Vol_20_t-1           0
XLV_Parkinson_20_t-1     0
XLV_GK_20_t-1            0
XLV_Volume_ROC_t-1       0
XBI_Entropy_20_t-1       0
XLV_Volume_Z_t-1         0
XLV_OBV_t-1              0
XLV_Entropy_20_t-1       0
VHT_Return_1d_t-1        0
VHT_Return_5d_t-1        0
VHT_Return_10d_t-1       0
VHT_Return_20d_t-1       0
VHT_SMA_10_t-1           0
VHT_EMA_20_t-1           0
VHT_MACD_t-1             0
VHT_MACD_sig_t-1         0
VHT_MACD_hist_t-1        0
VHT_RSI_14_t-1           0
VHT_StochK_t-1           0
XLV_Return_1d_t-1        0
XBI_OBV_t-1              0
VIXY_High_t-1            0
XBI_Return_5d_t-1        0
IBB_MACD_sig_t-1         0
I