# Renaissance Bot — Retrain All Models (98-dim Cross-Asset Features)

This notebook retrains all 7 ML models with the new 98-dimension cross-asset feature pipeline:
- **46 scale-invariant single-pair features**: returns, ratios, z-scores (no raw prices)
- **15 cross-asset features**: lead signals, correlations, spreads, market-wide
- **7 derivatives features**: funding rate, OI, long/short ratio, taker ratio, Fear & Greed
- **Padded to 98**

## Training order
1. Upload historical CSVs via Google Drive (6 pairs x 5+ years of 5-min bars)
2. (Optional) Upload derivatives CSVs for 7 additional features
3. Phase 1: Train 5 base models (QT, BiLSTM, DilatedCNN, CNN, GRU)
4. Phase 2: Train Meta-Ensemble (stacking layer over 5 base models)
5. Phase 3: Train VAE anomaly detector
6. Download trained `.pth` weight files

## Key training fixes (v7)
- **Soft labels**: 6-bar (30-min) forward return x 100, clipped to [-1, 1]
- **v6 loss**: BCE(pred x 20) + 10 x separation_margin(0.10) + 5 x magnitude_floor
- **v7 QT optimizer**: weight_decay=0 (not 1e-5), differential LR (attention 0.1x), collapse recovery
- **Other models**: weight_decay=1e-4, LR=3e-4
- **LR warmup**: 3-epoch linear warmup + cosine decay

**Why v7?** v6 fixed the loss but QT still collapsed at epoch 5 on full data.
Root cause: weight_decay=1e-5 per step × 10,600 batches/epoch = **10.6% total
weight shrinkage per epoch** — 13.5x more than on local 50K-sample test.
This systematically shrinks attention Q/K/V weights → softmax goes uniform →
mean pooling produces constant output → pred=0. Fix: wd=0 for QT + 0.1x LR
on attention params + auto-recovery from collapse.

**Runtime**: Select **GPU -> T4** (Runtime -> Change runtime type -> T4 GPU)

## 0. Setup

In [1]:
# Install dependencies
!pip install -q torch numpy pandas scikit-learn

In [2]:
import os
import sys
import glob
import time
import math
import shutil
import logging
import zipfile
from datetime import datetime, timezone
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s')
logger = logging.getLogger('retrain')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Device: {device}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_memory/ 1e9:.1f} GB')

Device: cuda
GPU: NVIDIA A100-SXM4-80GB
Memory: 85.1 GB


## 1. Upload Training Data via Google Drive

1. Upload the 6 CSV files to a folder in your Google Drive (e.g., `My Drive/training_data/`)
2. Run cell 1a below — it mounts Drive and copies files to local storage
3. Run cell 1b — it **automatically fetches** derivatives data from Binance Futures
   (works from Colab — blocked in US locally). Saves to Drive for reuse.

**Files needed** (from `data/training/`):
- `BTC-USD_5m_historical.csv` (~49 MB)
- `ETH-USD_5m_historical.csv` (~45 MB)
- `SOL-USD_5m_historical.csv` (~27 MB)
- `DOGE-USD_5m_historical.csv` (~38 MB)
- `AVAX-USD_5m_historical.csv` (~25 MB)
- `LINK-USD_5m_historical.csv` (~34 MB)

**Derivatives** (auto-fetched by cell 1b, or reused from Drive on subsequent runs):
- `BTC-USD_derivatives.csv`, `ETH-USD_derivatives.csv`, etc.
- `fear_greed_history.csv`

In [3]:
# Mount Google Drive and copy CSV files to local storage
# Change DRIVE_FOLDER if you put them somewhere else.

DRIVE_FOLDER = 'training_data'  # folder name inside My Drive

from google.colab import drive
drive.mount('/content/drive')

os.makedirs('data/training', exist_ok=True)
os.makedirs('models/trained', exist_ok=True)

# Search for CSV files in the specified folder
drive_path = f'/content/drive/My Drive/{DRIVE_FOLDER}'
if not os.path.exists(drive_path):
    print(f'Folder "{DRIVE_FOLDER}" not found in My Drive.')
    print('Searching for *_5m_historical.csv files in entire Drive...')
    found = glob.glob('/content/drive/My Drive/**/*_5m_historical.csv', recursive=True)
    if found:
        drive_path = os.path.dirname(found[0])
        print(f'Found files in: {drive_path}')
    else:
        raise FileNotFoundError(
            'No *_5m_historical.csv files found in Google Drive.\n'
            'Please upload them to a folder in Drive first.'
        )

# Copy CSV files from Drive to local storage (much faster I/O)
csv_files = glob.glob(os.path.join(drive_path, '*_5m_historical.csv'))
print(f'\nFound {len(csv_files)} CSV files in {drive_path}:')
for src in sorted(csv_files):
    fname = os.path.basename(src)
    dst = os.path.join('data/training', fname)
    size_mb = os.path.getsize(src) / 1e6
    print(f'  Copying {fname} ({size_mb:.1f} MB)...', end=' ')
    shutil.copy2(src, dst)
    print('done')

print(f'\nAll files copied to data/training/')

# Copy derivatives data if available (optional — enables 7 extra features)
deriv_drive_path = os.path.join(drive_path, 'derivatives')
if os.path.exists(deriv_drive_path):
    os.makedirs('data/training/derivatives', exist_ok=True)
    deriv_files = [f for f in os.listdir(deriv_drive_path) if f.endswith('.csv')]
    for fname in sorted(deriv_files):
        src = os.path.join(deriv_drive_path, fname)
        shutil.copy2(src, os.path.join('data/training/derivatives', fname))
    print(f'\nCopied {len(deriv_files)} derivatives files to data/training/derivatives/')
else:
    print(f'\nNo derivatives/ subfolder found (optional — 7 features will be zero-padded)')

Mounted at /content/drive

Found 6 CSV files in /content/drive/My Drive/training_data:
  Copying AVAX-USD_5m_historical.csv (26.4 MB)... done
  Copying BTC-USD_5m_historical.csv (51.6 MB)... done
  Copying DOGE-USD_5m_historical.csv (39.5 MB)... done
  Copying ETH-USD_5m_historical.csv (47.6 MB)... done
  Copying LINK-USD_5m_historical.csv (35.8 MB)... done
  Copying SOL-USD_5m_historical.csv (28.1 MB)... done

All files copied to data/training/

Copied 7 derivatives files to data/training/derivatives/


### 1b. Fetch Derivatives Data (runs on Colab)

Binance Futures API is **geo-restricted in the US** but works from Colab (Google Cloud).
This cell fetches funding rate, open interest, long/short ratio, taker volume,
and Fear & Greed history — then saves CSVs to `data/training/derivatives/`.

**Skip this cell** if you already uploaded derivatives CSVs via Google Drive.
Takes ~5-10 minutes for 6 pairs × 2 years of 5-min data.

In [5]:
import requests
# ── Fetch derivatives data from Binance Futures (works on Colab, blocked in US) ──
# Skip this cell if you already uploaded derivatives CSVs to Google Drive.

import time as _time

DERIV_PAIRS = ['BTC-USD', 'ETH-USD', 'SOL-USD', 'DOGE-USD', 'AVAX-USD', 'LINK-USD']
DERIV_DAYS = 180  # ~6 months -- Binance OI/LS endpoints limited to ~30d windows
DERIV_PERIOD = '5m'
DERIV_DIR = 'data/training/derivatives'
os.makedirs(DERIV_DIR, exist_ok=True)

BINANCE_FAPI = 'https://fapi.binance.com'
BINANCE_FUTURES = 'https://fapi.binance.com/futures/data'
FNG_API = 'https://api.alternative.me/fng/'
PAIR_MAP = {'BTC-USD':'BTCUSDT','ETH-USD':'ETHUSDT','SOL-USD':'SOLUSDT',
            'DOGE-USD':'DOGEUSDT','AVAX-USD':'AVAXUSDT','LINK-USD':'LINKUSDT'}
REQ_DELAY = 0.2

WINDOW_MS = 30 * 86400 * 1000  # 30-day windows for OI/LS endpoints

def _paginate(url, symbol, period, start_ms, end_ms, val_key, label):
    # Reverse-windowed pagination -- iterate newest-to-oldest so we get
    # recent data first.  Binance OI/LS endpoints only retain ~30 days of
    # 5m data; older windows return HTTP 400.  After 2 consecutive 400s
    # we stop and return what we have.
    rows = []
    consecutive_400 = 0
    # Build window list (newest first)
    windows = []
    ws = start_ms
    while ws < end_ms:
        we = min(ws + WINDOW_MS, end_ms)
        windows.append((ws, we))
        ws = we
    windows.reverse()  # newest first
    for window_start, window_end in windows:
        if consecutive_400 >= 2:
            print(f'    {label}: skipping older windows (data unavailable)')
            break
        cur = window_start
        got_data = False
        while cur < window_end:
            resp = requests.get(url, params={
                'symbol': symbol, 'period': period,
                'startTime': int(cur), 'endTime': int(window_end), 'limit': 500,
            }, timeout=15)
            if resp.status_code == 400:
                consecutive_400 += 1
                # Try a fallback: fetch latest data without time params
                if not rows and consecutive_400 == 1:
                    fb = requests.get(url, params={
                        'symbol': symbol, 'period': period, 'limit': 500,
                    }, timeout=15)
                    if fb.status_code == 200:
                        data = fb.json()
                        for e in data:
                            rows.append({'timestamp': int(e['timestamp'])//1000, 'value': float(e[val_key])})
                        print(f'    {label}: got {len(data)} entries via latest-only fallback')
                        got_data = True
                break
            if resp.status_code != 200:
                print(f'    {label} HTTP {resp.status_code}: {resp.text[:100]}')
                break
            data = resp.json()
            if not data:
                break
            consecutive_400 = 0  # reset on success
            got_data = True
            for e in data:
                rows.append({'timestamp': int(e['timestamp'])//1000, 'value': float(e[val_key])})
            cur = int(data[-1]['timestamp']) + 1
            _time.sleep(REQ_DELAY)
    return pd.DataFrame(rows).drop_duplicates('timestamp').sort_values('timestamp').reset_index(drop=True) if rows else pd.DataFrame()

def _fetch_funding(symbol, start_ms, end_ms):
    rows = []
    cur = start_ms
    while cur < end_ms:
        resp = requests.get(f'{BINANCE_FAPI}/fapi/v1/fundingRate',
                            params={'symbol':symbol,'startTime':cur,'endTime':end_ms,'limit':1000}, timeout=15)
        if resp.status_code != 200:
            print(f'    Funding HTTP {resp.status_code}')
            break
        data = resp.json()
        if not data:
            break
        for e in data:
            rows.append({'timestamp': int(e['fundingTime'])//1000, 'value': float(e['fundingRate'])})
        cur = int(data[-1]['fundingTime']) + 1
        _time.sleep(REQ_DELAY)
    return pd.DataFrame(rows).drop_duplicates('timestamp').sort_values('timestamp').reset_index(drop=True) if rows else pd.DataFrame()

def _fetch_taker(symbol, period, start_ms, end_ms):
    rows = []
    cur = start_ms
    while cur < end_ms:
        resp = requests.get(f'{BINANCE_FUTURES}/takeBuySellVol',
                            params={'symbol':symbol,'period':period,'startTime':cur,'endTime':end_ms,'limit':500}, timeout=15)
        if resp.status_code != 200:
            print(f'    Taker HTTP {resp.status_code}')
            break
        data = resp.json()
        if not data:
            break
        for e in data:
            rows.append({'timestamp': int(e['timestamp'])//1000,
                         'taker_buy_vol': float(e['buyVol']), 'taker_sell_vol': float(e['sellVol'])})
        cur = int(data[-1]['timestamp']) + 1
        _time.sleep(REQ_DELAY)
    return pd.DataFrame(rows).drop_duplicates('timestamp').sort_values('timestamp').reset_index(drop=True) if rows else pd.DataFrame()

end_ms = int(datetime.now(timezone.utc).timestamp() * 1000)
start_ms = end_ms - (DERIV_DAYS * 86400 * 1000)

# ── Quick connectivity test ──
print('Testing Binance Futures API connectivity...')
test_resp = requests.get(f'{BINANCE_FAPI}/fapi/v1/fundingRate',
                         params={'symbol':'BTCUSDT','limit':1}, timeout=10)
if test_resp.status_code != 200:
    print(f'ERROR: Binance Futures returned {test_resp.status_code}.')
    print('This API is geo-restricted. If running locally in the US, use Colab or a VPN.')
    print('Skipping derivatives fetch — features will be zero-padded.')
else:
    print(f'OK — connected to Binance Futures\n')

    for pair in DERIV_PAIRS:
        symbol = PAIR_MAP[pair]
        print(f'\n{"="*50}')
        print(f'{pair} ({symbol}) — {DERIV_DAYS}d, period={DERIV_PERIOD}')
        print(f'{"="*50}')

        funding_df = _fetch_funding(symbol, start_ms, end_ms)
        print(f'  Funding rate: {len(funding_df)} entries')

        oi_df = _paginate(f'{BINANCE_FUTURES}/openInterestHist',
                          symbol, DERIV_PERIOD, start_ms, end_ms, 'sumOpenInterest', 'OI')
        print(f'  Open interest: {len(oi_df)} entries')

        ls_df = _paginate(f'{BINANCE_FUTURES}/globalLongShortAccountRatio',
                          symbol, DERIV_PERIOD, start_ms, end_ms, 'longShortRatio', 'LS')
        print(f'  Long/Short ratio: {len(ls_df)} entries')

        taker_df = _fetch_taker(symbol, DERIV_PERIOD, start_ms, end_ms)
        print(f'  Taker volume: {len(taker_df)} entries')

        # Merge
        dfs = []
        if not funding_df.empty:
            dfs.append(funding_df.rename(columns={'value':'funding_rate'}))
        if not oi_df.empty:
            dfs.append(oi_df.rename(columns={'value':'open_interest'}))
        if not ls_df.empty:
            dfs.append(ls_df.rename(columns={'value':'long_short_ratio'}))
        if not taker_df.empty:
            dfs.append(taker_df)

        if dfs:
            result = dfs[0]
            for df in dfs[1:]:
                result = pd.merge(result, df, on='timestamp', how='outer')
            result = result.sort_values('timestamp').reset_index(drop=True)
            if 'funding_rate' in result.columns:
                result['funding_rate'] = result['funding_rate'].ffill()

            out_path = os.path.join(DERIV_DIR, f'{pair}_derivatives.csv')
            result.to_csv(out_path, index=False)
            print(f'  -> Saved {out_path} ({len(result):,} rows, {os.path.getsize(out_path)/1e6:.1f} MB)')
        else:
            print(f'  -> No data for {pair}')

    # ── Fear & Greed ──
    print(f'\n{"="*50}')
    print('Fear & Greed Index (all history)')
    print(f'{"="*50}')
    try:
        fng_resp = requests.get(FNG_API, params={'format':'json','limit':0}, timeout=30)
        fng_data = fng_resp.json().get('data', [])
        if fng_data:
            fng_rows = [{'timestamp': int(e['timestamp']), 'fear_greed': int(e['value'])} for e in fng_data]
            fng_df = pd.DataFrame(fng_rows).sort_values('timestamp').drop_duplicates('timestamp').reset_index(drop=True)
            fng_path = os.path.join(DERIV_DIR, 'fear_greed_history.csv')
            fng_df.to_csv(fng_path, index=False)
            print(f'  Saved {fng_path} ({len(fng_df):,} daily values)')
        else:
            print('  No Fear & Greed data returned')
    except Exception as e:
        print(f'  Fear & Greed fetch failed: {e}')

    # ── Summary ──
    print(f'\n{"="*50}')
    print('DERIVATIVES FETCH SUMMARY')
    print(f'{"="*50}')
    total_size = 0
    for f in sorted(os.listdir(DERIV_DIR)):
        if f.endswith('.csv'):
            p = os.path.join(DERIV_DIR, f)
            sz = os.path.getsize(p)
            total_size += sz
            rows = len(pd.read_csv(p))
            print(f'  {f}: {rows:,} rows ({sz/1e6:.1f} MB)')
    print(f'  Total: {total_size/1e6:.1f} MB')

    # Also save to Google Drive for reuse
    drive_deriv = f'/content/drive/My Drive/{DRIVE_FOLDER}/derivatives'
    os.makedirs(drive_deriv, exist_ok=True)
    for f in os.listdir(DERIV_DIR):
        if f.endswith('.csv'):
            shutil.copy2(os.path.join(DERIV_DIR, f), os.path.join(drive_deriv, f))
    print(f'\nAlso saved to Google Drive: {drive_deriv}/')
    print('Next time, cell 1a will auto-copy these from Drive (no re-fetch needed).')

Testing Binance Futures API connectivity...
OK — connected to Binance Futures


BTC-USD (BTCUSDT) — 180d, period=5m
  Funding rate: 540 entries


KeyboardInterrupt: 

In [6]:
# Load and validate data
ALL_PAIRS = ['BTC-USD', 'ETH-USD', 'SOL-USD', 'DOGE-USD', 'AVAX-USD', 'LINK-USD']

pair_dfs = {}
for pair in ALL_PAIRS:
    csv_path = f'data/training/{pair}_5m_historical.csv'
    if os.path.exists(csv_path):
        df = pd.read_csv(csv_path)
        pair_dfs[pair] = df
        first_ts = datetime.fromtimestamp(
            df['timestamp'].iloc[0]/1000 if df['timestamp'].iloc[0] > 1e12 else df['timestamp'].iloc[0],
            tz=timezone.utc)
        last_ts = datetime.fromtimestamp(
            df['timestamp'].iloc[-1]/1000 if df['timestamp'].iloc[-1] > 1e12 else df['timestamp'].iloc[-1],
            tz=timezone.utc)
        print(f'{pair}: {len(df):>10,} bars  ({first_ts.strftime("%Y-%m-%d")} -> {last_ts.strftime("%Y-%m-%d")})')
    else:
        print(f'{pair}: NOT FOUND')

total = sum(len(df) for df in pair_dfs.values())
print(f'\nTotal: {total:,} bars across {len(pair_dfs)} pairs')
assert len(pair_dfs) >= 2, 'Need at least 2 pairs for cross-asset features'

# Load derivatives data (optional — 7 features: funding_rate_z, oi_change_pct,
# long_short_ratio, taker_buy_sell_ratio, fear_greed_norm, fear_greed_roc, has_derivatives_data)
derivatives_dfs = {}
fear_greed_df = None
deriv_dir = 'data/training/derivatives'

if os.path.exists(deriv_dir):
    for fname in sorted(os.listdir(deriv_dir)):
        if fname.endswith('_derivatives.csv'):
            pair = fname.replace('_derivatives.csv', '')
            ddf = pd.read_csv(os.path.join(deriv_dir, fname))
            if len(ddf) > 0:
                derivatives_dfs[pair] = ddf
                print(f'  Derivatives {pair}: {len(ddf):,} rows')

    fng_path = os.path.join(deriv_dir, 'fear_greed_history.csv')
    if os.path.exists(fng_path):
        fear_greed_df = pd.read_csv(fng_path)
        if len(fear_greed_df) > 0:
            print(f'  Fear & Greed: {len(fear_greed_df):,} daily values')
        else:
            fear_greed_df = None

    if derivatives_dfs or fear_greed_df is not None:
        print(f'\nDerivatives: {len(derivatives_dfs)} pairs + {"yes" if fear_greed_df is not None else "no"} Fear & Greed')
    else:
        print('\nDerivatives CSVs empty — 7 derivatives features will be zero-padded')
else:
    print(f'\nNo derivatives data directory — 7 derivatives features will be zero-padded')

BTC-USD:    888,637 bars  (2017-09-01 -> 2026-02-17)
ETH-USD:    888,639 bars  (2017-09-01 -> 2026-02-17)
SOL-USD:    574,424 bars  (2020-09-01 -> 2026-02-17)
DOGE-USD:    696,063 bars  (2019-07-05 -> 2026-02-17)
AVAX-USD:    568,299 bars  (2020-09-22 -> 2026-02-17)
LINK-USD:    744,842 bars  (2019-01-16 -> 2026-02-17)

Total: 4,360,904 bars across 6 pairs
  Derivatives AVAX-USD: 2,190 rows
  Derivatives BTC-USD: 2,190 rows
  Derivatives DOGE-USD: 2,190 rows
  Derivatives ETH-USD: 2,190 rows
  Derivatives LINK-USD: 2,190 rows
  Derivatives SOL-USD: 2,190 rows
  Fear & Greed: 2,936 daily values

Derivatives: 6 pairs + yes Fear & Greed


## 2. Constants & Feature Pipeline

In [7]:
# Constants
INPUT_DIM = 98
N_CROSS_FEATURES = 15
N_DERIVATIVES_FEATURES = 7
SEQ_LEN = 30
N_BASE_MODELS = 6
BASE_MODEL_NAMES = ['quantum_transformer', 'bidirectional_lstm', 'dilated_cnn', 'cnn', 'gru', 'lightgbm']

# Label configuration — predict 30-min forward return, not next-bar direction
LABEL_HORIZON = 6   # 6 bars × 5 min = 30 min lookahead
LABEL_SCALE = 100   # Scaling: 0.5% return → label 0.5, clipped to [-1, 1]

LEAD_SIGNALS = {
    'BTC-USD':  {'primary': 'ETH-USD',  'secondary': 'SOL-USD'},
    'ETH-USD':  {'primary': 'BTC-USD',  'secondary': 'LINK-USD'},
    'SOL-USD':  {'primary': 'BTC-USD',  'secondary': 'ETH-USD'},
    'LINK-USD': {'primary': 'ETH-USD',  'secondary': 'BTC-USD'},
    'AVAX-USD': {'primary': 'ETH-USD',  'secondary': 'BTC-USD'},
    'DOGE-USD': {'primary': 'BTC-USD',  'secondary': 'ETH-USD'},
}

print(f'INPUT_DIM = {INPUT_DIM}  (46 single-pair + 15 cross-asset + 7 derivatives = 68 real, padded to 98)')
print(f'SEQ_LEN = {SEQ_LEN}')
print(f'LABEL_HORIZON = {LABEL_HORIZON} bars ({LABEL_HORIZON * 5} min)')
print(f'LABEL_SCALE = {LABEL_SCALE}')
print(f'Lead signals: {list(LEAD_SIGNALS.keys())}')

INPUT_DIM = 98  (46 single-pair + 15 cross-asset + 7 derivatives = 68 real, padded to 98)
SEQ_LEN = 30
LABEL_HORIZON = 6 bars (30 min)
LABEL_SCALE = 100
Lead signals: ['BTC-USD', 'ETH-USD', 'SOL-USD', 'LINK-USD', 'AVAX-USD', 'DOGE-USD']


In [8]:
# ================================================================
# CROSS-ASSET FEATURE BUILDER (15 features)
# ================================================================

def _build_cross_features(close, volume, cross_data, pair_name):
    """Compute 15 cross-asset features (all returns/correlations/z-scores)."""
    feats = {}
    log_ret = np.log(close / close.shift(1))

    lead_cfg = LEAD_SIGNALS.get(pair_name, {})
    for role, leader_pair, horizons in [
        ('primary', lead_cfg.get('primary'), [1, 3, 6]),
        ('secondary', lead_cfg.get('secondary'), [1, 3]),
    ]:
        if leader_pair and leader_pair in cross_data:
            lc = cross_data[leader_pair]['close'].astype(float)
            lr = np.log(lc / lc.shift(1))
            for h in horizons:
                feats[f'lead_{role}_ret_{h}'] = lr.rolling(h).sum()
        else:
            for h in ([1, 3, 6] if role == 'primary' else [1, 3]):
                feats[f'lead_{role}_ret_{h}'] = pd.Series(0.0, index=close.index)

    for ref_name, ref_label in [('BTC-USD', 'btc'), ('ETH-USD', 'eth')]:
        if ref_name in cross_data and ref_name != pair_name:
            rc = cross_data[ref_name]['close'].astype(float)
            rr = np.log(rc / rc.shift(1))
            corr_50 = log_ret.rolling(50).corr(rr)
            feats[f'corr_{ref_label}_50'] = corr_50
            cm = corr_50.rolling(200).mean()
            cs = corr_50.rolling(200).std()
            feats[f'corr_z_{ref_label}'] = (corr_50 - cm) / (cs + 1e-10)
        else:
            feats[f'corr_{ref_label}_50'] = pd.Series(0.0, index=close.index)
            feats[f'corr_z_{ref_label}'] = pd.Series(0.0, index=close.index)

    for ref_name, ref_label in [('BTC-USD', 'btc'), ('ETH-USD', 'eth')]:
        if ref_name in cross_data and ref_name != pair_name:
            rc = cross_data[ref_name]['close'].astype(float)
            ls = np.log(close / (rc + 1e-10))
            sm = ls.rolling(100).mean()
            ss = ls.rolling(100).std()
            feats[f'spread_{ref_label}_z'] = (ls - sm) / (ss + 1e-10)
        else:
            feats[f'spread_{ref_label}_z'] = pd.Series(0.0, index=close.index)

    all_rets, all_vz = [], []
    for p, cdf in cross_data.items():
        c = cdf['close'].astype(float)
        all_rets.append(np.log(c / c.shift(1)))
        if 'volume' in cdf.columns:
            v = cdf['volume'].astype(float)
            vm = v.rolling(20).mean()
            all_vz.append(v / (vm + 1e-10) - 1.0)
    all_rets.append(log_ret)
    if volume is not None:
        vm = volume.rolling(20).mean()
        all_vz.append(volume / (vm + 1e-10) - 1.0)

    if all_rets:
        rdf = pd.concat(all_rets, axis=1)
        feats['mkt_avg_ret'] = rdf.mean(axis=1)
        feats['mkt_dispersion'] = rdf.std(axis=1)
        feats['mkt_breadth'] = (rdf > 0).mean(axis=1)
    else:
        feats['mkt_avg_ret'] = pd.Series(0.0, index=close.index)
        feats['mkt_dispersion'] = pd.Series(0.0, index=close.index)
        feats['mkt_breadth'] = pd.Series(0.5, index=close.index)
    feats['mkt_avg_vol_z'] = pd.concat(all_vz, axis=1).mean(axis=1) if all_vz else pd.Series(0.0, index=close.index)

    return feats

print('Cross-asset feature builder defined (15 features)')

Cross-asset feature builder defined (15 features)


In [9]:
# ================================================================
# SCALE-INVARIANT FEATURE PIPELINE (46 single-pair + 15 cross-asset + 7 derivatives)
#
# CRITICAL: No feature depends on absolute price level or raw volume.
# Every feature is a return, ratio, z-score, or bounded indicator.
# This ensures identical feature distributions whether BTC is $3K or $130K.
# ================================================================

def _compute_single_pair_features(df):
    """46 scale-invariant features from OHLCV data."""
    close = df['close'].astype(float)
    _open = df['open'].astype(float) if 'open' in df.columns else close
    high = df['high'].astype(float) if 'high' in df.columns else close
    low = df['low'].astype(float) if 'low' in df.columns else close
    vol = df['volume'].astype(float) if 'volume' in df.columns else None

    features = {}

    # Group 1: Candle shape (5)
    features['open_gap'] = np.log(_open / (close.shift(1) + 1e-10))
    features['upper_wick'] = (high - np.maximum(_open, close)) / (close + 1e-10)
    features['lower_wick'] = (np.minimum(_open, close) - low) / (close + 1e-10)
    features['body'] = (close - _open) / (close + 1e-10)
    if vol is not None:
        vm = vol.rolling(100, min_periods=10).mean()
        vs = vol.rolling(100, min_periods=10).std()
        features['volume_z'] = (vol - vm) / (vs + 1e-10)
    else:
        features['volume_z'] = close * 0.0

    # Group 2: Returns (7)
    for w in [1, 2, 3, 5, 10, 20]:
        features[f'ret_{w}'] = close.pct_change(w)
    features['log_ret'] = np.log(close / close.shift(1))

    # Group 3: SMA distance + slope (8)
    for w in [5, 10, 20, 50]:
        sma = close.rolling(w).mean()
        features[f'sma_dist_{w}'] = (close - sma) / (sma + 1e-10)
        features[f'sma_slope_{w}'] = sma.pct_change(3)

    # Group 4: EMA distance + slope (6)
    for w in [5, 10, 20]:
        ema = close.ewm(span=w, adjust=False).mean()
        features[f'ema_dist_{w}'] = (close - ema) / (ema + 1e-10)
        features[f'ema_slope_{w}'] = ema.pct_change(3)

    # Group 5: Realized volatility (3)
    pct_ret = close.pct_change()
    for w in [5, 10, 20]:
        features[f'vol_{w}'] = pct_ret.rolling(w).std()

    # Group 6: RSI [-1, 1] (1)
    delta = close.diff()
    gain = delta.clip(lower=0).rolling(14).mean()
    loss_s = (-delta.clip(upper=0)).rolling(14).mean()
    rs = gain / (loss_s + 1e-10)
    features['rsi_norm'] = (100 - (100 / (1 + rs)) - 50) / 50

    # Group 7: MACD / price (3)
    ema12 = close.ewm(span=12, adjust=False).mean()
    ema26 = close.ewm(span=26, adjust=False).mean()
    macd = ema12 - ema26
    macd_signal = macd.ewm(span=9, adjust=False).mean()
    features['macd_pct'] = macd / (close + 1e-10)
    features['macd_signal_pct'] = macd_signal / (close + 1e-10)
    features['macd_hist_pct'] = (macd - macd_signal) / (close + 1e-10)

    # Group 8: Bollinger Bands (4)
    sma20 = close.rolling(20).mean()
    std20 = close.rolling(20).std()
    bb_upper = sma20 + 2 * std20
    bb_lower = sma20 - 2 * std20
    bb_range = bb_upper - bb_lower + 1e-10
    features['bb_pct'] = (close - bb_lower) / bb_range
    features['bb_width'] = bb_range / (sma20 + 1e-10)
    features['bb_upper_dist'] = (bb_upper - close) / (close + 1e-10)
    features['bb_lower_dist'] = (close - bb_lower) / (close + 1e-10)

    # Group 9: ATR / price (1)
    tr = pd.concat([
        high - low,
        (high - close.shift(1)).abs(),
        (low - close.shift(1)).abs(),
    ], axis=1).max(axis=1)
    features['atr_pct'] = tr.rolling(14).mean() / (close + 1e-10)

    # Group 10: Volume ratios (3)
    if vol is not None:
        features['vol_ratio'] = vol / (vol.rolling(10, min_periods=1).mean() + 1e-10)
        features['vol_change'] = vol.pct_change()
        features['vol_trend'] = vol.rolling(5, min_periods=1).mean() / (vol.rolling(20, min_periods=1).mean() + 1e-10)
    else:
        features['vol_ratio'] = close * 0.0
        features['vol_change'] = close * 0.0
        features['vol_trend'] = close * 0.0 + 1.0

    # Group 11: Momentum (3)
    features['momentum_5'] = close / close.shift(5) - 1
    features['momentum_10'] = close / close.shift(10) - 1
    features['momentum_20'] = close / close.shift(20) - 1

    # Group 12: Range (2)
    features['hl_range'] = (high - low) / (close + 1e-10)
    features['hl_range_norm'] = features['hl_range'] / (features['hl_range'].rolling(10, min_periods=1).mean() + 1e-10)

    return features


def _build_derivatives_features(n_rows, derivatives_data=None):
    """Compute 7 derivatives + sentiment features.

    Args:
        n_rows: Number of rows in the price DataFrame
        derivatives_data: Dict with optional keys:
            - 'funding_rate': pd.Series of raw funding rates
            - 'open_interest': pd.Series of open interest values
            - 'long_short_ratio': pd.Series of long/short account ratios
            - 'taker_buy_vol': pd.Series of taker buy volume
            - 'taker_sell_vol': pd.Series of taker sell volume
            - 'fear_greed': pd.Series of Fear & Greed index (0-100)

    Returns:
        Dict of 7 feature_name -> pd.Series
    """
    idx = pd.RangeIndex(n_rows)
    feats = {}
    has_deriv = False

    if derivatives_data is not None:
        # Funding rate z-score (50-bar window)
        fr = derivatives_data.get('funding_rate')
        if fr is not None and len(fr) > 0:
            fr = fr.astype(float)
            fr_mean = fr.rolling(50, min_periods=5).mean()
            fr_std = fr.rolling(50, min_periods=5).std()
            feats['funding_rate_z'] = (fr - fr_mean) / (fr_std + 1e-10)
            has_deriv = True
        else:
            feats['funding_rate_z'] = pd.Series(0.0, index=idx)

        # Open interest 5-bar % change
        oi = derivatives_data.get('open_interest')
        if oi is not None and len(oi) > 0:
            oi = oi.astype(float)
            feats['oi_change_pct'] = oi.pct_change(5)
            has_deriv = True
        else:
            feats['oi_change_pct'] = pd.Series(0.0, index=idx)

        # Long/short ratio (raw, already scale-invariant)
        ls = derivatives_data.get('long_short_ratio')
        if ls is not None and len(ls) > 0:
            feats['long_short_ratio'] = ls.astype(float)
            has_deriv = True
        else:
            feats['long_short_ratio'] = pd.Series(0.0, index=idx)

        # Taker buy/sell ratio
        buy_vol = derivatives_data.get('taker_buy_vol')
        sell_vol = derivatives_data.get('taker_sell_vol')
        if buy_vol is not None and sell_vol is not None and len(buy_vol) > 0:
            bv = buy_vol.astype(float)
            sv = sell_vol.astype(float)
            feats['taker_buy_sell_ratio'] = bv / (sv + 1e-10)
            has_deriv = True
        else:
            feats['taker_buy_sell_ratio'] = pd.Series(0.0, index=idx)

        # Fear & Greed (normalized + 3-day ROC)
        fg = derivatives_data.get('fear_greed')
        if fg is not None and len(fg) > 0:
            fg = fg.astype(float)
            feats['fear_greed_norm'] = fg / 100.0
            feats['fear_greed_roc'] = fg.diff(864) / 100.0  # 3 days in 5-min bars
        else:
            feats['fear_greed_norm'] = pd.Series(0.0, index=idx)
            feats['fear_greed_roc'] = pd.Series(0.0, index=idx)
    else:
        feats['funding_rate_z'] = pd.Series(0.0, index=idx)
        feats['oi_change_pct'] = pd.Series(0.0, index=idx)
        feats['long_short_ratio'] = pd.Series(0.0, index=idx)
        feats['taker_buy_sell_ratio'] = pd.Series(0.0, index=idx)
        feats['fear_greed_norm'] = pd.Series(0.0, index=idx)
        feats['fear_greed_roc'] = pd.Series(0.0, index=idx)

    # Binary flag: model knows when derivatives data is present
    feats['has_derivatives_data'] = pd.Series(1.0 if has_deriv else 0.0, index=idx)

    return feats


def _align_derivatives(price_df, pair, derivatives_dfs, fear_greed_df):
    """Align derivatives data to price DataFrame timestamps via merge_asof.

    Returns dict suitable for _build_derivatives_features(), or None if no data.
    """
    if not derivatives_dfs and fear_greed_df is None:
        return None

    result = {}

    # Align per-pair derivatives (funding_rate, OI, LS, taker volumes)
    if derivatives_dfs and pair in derivatives_dfs:
        deriv_df = derivatives_dfs[pair].copy()
        if 'timestamp' in deriv_df.columns and 'timestamp' in price_df.columns:
            price_ts = price_df[['timestamp']].copy()
            price_ts['timestamp'] = price_ts['timestamp'].astype(int)
            deriv_df['timestamp'] = deriv_df['timestamp'].astype(int)
            price_ts = price_ts.sort_values('timestamp')
            deriv_df = deriv_df.sort_values('timestamp')
            merged = pd.merge_asof(price_ts, deriv_df, on='timestamp', direction='backward')
            for col in ['funding_rate', 'open_interest', 'long_short_ratio',
                        'taker_buy_vol', 'taker_sell_vol']:
                if col in merged.columns:
                    result[col] = merged[col].reset_index(drop=True)

    # Align Fear & Greed (daily -> forward-fill to 5-min bars)
    if fear_greed_df is not None and 'timestamp' in price_df.columns:
        fng = fear_greed_df.copy()
        fng['timestamp'] = fng['timestamp'].astype(int)
        fng = fng.sort_values('timestamp')
        price_ts = price_df[['timestamp']].copy()
        price_ts['timestamp'] = price_ts['timestamp'].astype(int)
        price_ts = price_ts.sort_values('timestamp')
        merged_fng = pd.merge_asof(price_ts, fng, on='timestamp', direction='backward')
        if 'fear_greed' in merged_fng.columns:
            result['fear_greed'] = merged_fng['fear_greed'].reset_index(drop=True)

    return result if result else None


def build_full_feature_matrix(price_df, cross_data=None, pair_name=None, derivatives_data=None):
    """(N, INPUT_DIM) matrix — all features for entire DataFrame at once."""
    if price_df is None or len(price_df) < 30:
        return None
    df = price_df.copy().reset_index(drop=True)
    if cross_data is not None:
        cross_data = {p: cdf.copy().reset_index(drop=True) for p, cdf in cross_data.items() if p != pair_name}
        if not cross_data:
            cross_data = None
    close = df['close'].astype(float) if 'close' in df.columns else None
    if close is None:
        return None
    vol = df['volume'].astype(float) if 'volume' in df.columns else None

    features = _compute_single_pair_features(df)
    if cross_data is not None and pair_name is not None:
        features.update(_build_cross_features(close, vol, cross_data, pair_name))

    # Add derivatives features (7 features — zeros if no data)
    features.update(_build_derivatives_features(len(df), derivatives_data))

    feat_df = pd.DataFrame(features, index=df.index)
    feat_df = feat_df.replace([np.inf, -np.inf], np.nan).ffill().bfill().fillna(0)
    feat_arr = feat_df.values.astype(np.float32)
    n_feat = feat_arr.shape[1]
    if n_feat < INPUT_DIM:
        feat_arr = np.concatenate([feat_arr, np.zeros((len(feat_arr), INPUT_DIM - n_feat), dtype=np.float32)], axis=1)
    elif n_feat > INPUT_DIM:
        feat_arr = feat_arr[:, :INPUT_DIM]
    return feat_arr


def build_feature_sequence(price_df, seq_len=30, cross_data=None, pair_name=None, derivatives_data=None):
    """(seq_len, INPUT_DIM) for a single window — with per-window standardization."""
    if price_df is None or len(price_df) < seq_len:
        return None
    df = price_df.tail(seq_len + 50).copy()
    if cross_data is not None:
        cross_data = {p: cdf.tail(seq_len + 50).copy().reset_index(drop=True) for p, cdf in cross_data.items() if p != pair_name}
        if not cross_data:
            cross_data = None
    df = df.reset_index(drop=True)
    close = df['close'].astype(float) if 'close' in df.columns else None
    if close is None:
        return None
    vol = df['volume'].astype(float) if 'volume' in df.columns else None

    features = _compute_single_pair_features(df)
    if cross_data is not None and pair_name is not None:
        features.update(_build_cross_features(close, vol, cross_data, pair_name))

    # Add derivatives features
    features.update(_build_derivatives_features(len(df), derivatives_data))

    feat_df = pd.DataFrame(features, index=df.index)
    feat_df = feat_df.replace([np.inf, -np.inf], np.nan).ffill().bfill().fillna(0)
    feat_arr = feat_df.tail(seq_len).values.astype(np.float32)

    # Per-window standardization
    mean = feat_arr.mean(axis=0, keepdims=True)
    std = feat_arr.std(axis=0, keepdims=True) + 1e-8
    feat_arr = (feat_arr - mean) / std

    n_feat = feat_arr.shape[1]
    if n_feat < INPUT_DIM:
        feat_arr = np.concatenate([feat_arr, np.zeros((seq_len, INPUT_DIM - n_feat), dtype=np.float32)], axis=1)
    elif n_feat > INPUT_DIM:
        feat_arr = feat_arr[:, :INPUT_DIM]
    return feat_arr


# Verification
test_pair = list(pair_dfs.keys())[0]
test_df = pair_dfs[test_pair]

f = _compute_single_pair_features(test_df.head(200))
print(f'Single-pair features: {len(f)} (expected 46)')

df = _build_derivatives_features(200, None)
print(f'Derivatives features: {len(df)} (expected 7)')

feat_seq = build_feature_sequence(test_df.head(200), seq_len=30)
print(f'Per-window shape: {feat_seq.shape} (expected (30, {INPUT_DIM}))')
feat_full = build_full_feature_matrix(test_df.head(1000))
print(f'Full-matrix shape: {feat_full.shape} (expected (1000, {INPUT_DIM}))')

early = build_feature_sequence(test_df.head(200), seq_len=30)
late = build_feature_sequence(test_df.tail(200), seq_len=30)
print(f'\nScale invariance (per-window standardized):')
print(f'  Early abs mean: {np.abs(early).mean():.4f}')
print(f'  Late abs mean:  {np.abs(late).mean():.4f}')
print(f'  Ratio: {np.abs(late).mean() / np.abs(early).mean():.2f}x (should be ~1.0)')

assert feat_seq.shape == (30, INPUT_DIM)
assert feat_full.shape[1] == INPUT_DIM
print('\nAll feature pipeline tests passed!')

Single-pair features: 46 (expected 46)
Derivatives features: 7 (expected 7)
Per-window shape: (30, 98) (expected (30, 98))
Full-matrix shape: (1000, 98) (expected (1000, 98))

Scale invariance (per-window standardized):
  Early abs mean: 0.3809
  Late abs mean:  0.3730
  Ratio: 0.98x (should be ~1.0)

All feature pipeline tests passed!


## 3. Model Architectures

In [11]:
# ================================================================
# ATTENTION & POSITIONAL ENCODING
# ================================================================

class _TrainedAttention(nn.Module):
    def __init__(self, d_model, n_heads, qkv_dim):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = qkv_dim // n_heads
        self.w_q = nn.Linear(d_model, qkv_dim)
        self.w_k = nn.Linear(d_model, qkv_dim)
        self.w_v = nn.Linear(d_model, qkv_dim)
        self.w_o = nn.Linear(qkv_dim, d_model)
        self.attention_temperature = nn.Parameter(torch.ones(1))
        self.quantum_enhancement_scale = nn.Parameter(torch.ones(n_heads))

    def forward(self, x):
        B, S, _ = x.shape
        q = self.w_q(x).view(B, S, self.n_heads, self.d_head).transpose(1, 2)
        k = self.w_k(x).view(B, S, self.n_heads, self.d_head).transpose(1, 2)
        v = self.w_v(x).view(B, S, self.n_heads, self.d_head).transpose(1, 2)
        scale = math.sqrt(self.d_head) * self.attention_temperature
        scores = torch.matmul(q, k.transpose(-2, -1)) / scale
        attn = F.softmax(scores, dim=-1)
        enhancement = self.quantum_enhancement_scale.view(1, self.n_heads, 1, 1)
        attn = attn * enhancement
        out = torch.matmul(attn, v)
        out = out.transpose(1, 2).contiguous().view(B, S, -1)
        return self.w_o(out)


class _TrainedPosEncoding(nn.Module):
    def __init__(self, d_model, max_len=128):
        super().__init__()
        self.quantum_phase = nn.Parameter(torch.zeros(d_model))
        pe = torch.zeros(1, max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[0, :, 0::2] = torch.sin(pos * div)
        pe[0, :, 1::2] = torch.cos(pos * div[:d_model // 2])
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :] * torch.cos(self.quantum_phase)


class _TrainedTransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, qkv_dim, d_ff, dropout=0.2):
        super().__init__()
        self.attention = _TrainedAttention(d_model, n_heads, qkv_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(d_ff, d_model), nn.Dropout(dropout))
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.skip_enhancement = nn.Parameter(torch.tensor(1.0))

    def forward(self, x):
        x = self.norm1(x + self.skip_enhancement * self.attention(x))
        x = self.norm2(x + self.feed_forward(x))
        return x

print('Attention & positional encoding defined')

Attention & positional encoding defined


In [12]:
# ================================================================
# ALL 7 MODEL ARCHITECTURES
# ================================================================

class TrainedQuantumTransformer(nn.Module):
    def __init__(self, input_dim=INPUT_DIM):
        super().__init__()
        d_model, n_heads, qkv_dim, d_ff, n_blocks = 288, 8, 328, 1315, 4
        self.input_projection = nn.Linear(input_dim, d_model)
        self.pos_encoding = _TrainedPosEncoding(d_model)
        self.transformer_blocks = nn.ModuleList([
            _TrainedTransformerBlock(d_model, n_heads, qkv_dim, d_ff) for _ in range(n_blocks)])
        self.output_head = nn.Sequential(
            nn.BatchNorm1d(d_model), nn.GELU(),
            nn.Linear(d_model, 144), nn.GELU(), nn.Dropout(0.2),
            nn.Linear(144, 72), nn.GELU(), nn.Linear(72, 1), nn.Tanh())
        self.uncertainty_head = nn.Sequential(
            nn.Linear(d_model, 72), nn.ReLU(), nn.Linear(72, 1), nn.Softplus())

    def forward(self, x):
        x = self.input_projection(x)
        x = self.pos_encoding(x)
        for block in self.transformer_blocks:
            x = block(x)
        pooled = x.mean(dim=1)
        return self.output_head(pooled), self.uncertainty_head(pooled)


class _TrainedLSTMCore(nn.Module):
    def __init__(self, input_size=INPUT_DIM, hidden_size=292, num_layers=2):
        super().__init__()
        self.hidden_size = hidden_size
        bidir_dim = hidden_size * 2
        self.lstm_layers = nn.ModuleList()
        self.skip_projections = nn.ModuleList()
        self.consciousness_gates = nn.ModuleList()
        for i in range(num_layers):
            in_dim = input_size if i == 0 else bidir_dim
            self.lstm_layers.append(nn.LSTM(
                input_size=in_dim, hidden_size=hidden_size,
                num_layers=1, batch_first=True, bidirectional=True))
            self.skip_projections.append(nn.Linear(in_dim, bidir_dim))
            self.consciousness_gates.append(nn.Sequential(
                nn.Linear(bidir_dim, bidir_dim), nn.Sigmoid()))

    def forward(self, x):
        for i, lstm_layer in enumerate(self.lstm_layers):
            skip = self.skip_projections[i](x)
            out, _ = lstm_layer(x)
            gate = self.consciousness_gates[i](out)
            x = gate * out + (1 - gate) * skip
        return x


class TrainedBidirectionalLSTM(nn.Module):
    def __init__(self, input_dim=INPUT_DIM):
        super().__init__()
        bidir_dim = 584
        self.lstm = _TrainedLSTMCore(input_size=input_dim, hidden_size=292, num_layers=2)
        self.prediction_head = nn.Sequential(
            nn.BatchNorm1d(bidir_dim), nn.GELU(),
            nn.Linear(bidir_dim, 292), nn.GELU(),
            nn.BatchNorm1d(292), nn.GELU(),
            nn.Linear(292, 146), nn.GELU(), nn.Linear(146, 1), nn.Tanh())
        self.confidence_head = nn.Sequential(
            nn.Linear(bidir_dim, 73), nn.ReLU(), nn.Linear(73, 1), nn.Sigmoid())

    def forward(self, x):
        pooled = self.lstm(x).mean(dim=1)
        return self.prediction_head(pooled), self.confidence_head(pooled)


class _TrainedDilatedBlock(nn.Module):
    def __init__(self, channels, dilation):
        super().__init__()
        self.add_module('0', nn.Conv1d(channels, channels, kernel_size=3, dilation=dilation, padding=dilation))
        self.add_module('1', nn.BatchNorm1d(channels))
        self.add_module('4', nn.Conv1d(channels, channels, kernel_size=1))
        self.add_module('5', nn.BatchNorm1d(channels))

    def forward(self, x):
        h = F.relu(getattr(self, '1')(getattr(self, '0')(x)))
        h = F.dropout(h, p=0.2, training=self.training)
        h = getattr(self, '5')(getattr(self, '4')(h))
        return F.relu(h + x)


class _TrainedDilatedCNNCore(nn.Module):
    def __init__(self, channels=INPUT_DIM, hidden=332, n_blocks=5):
        super().__init__()
        self.conv_blocks = nn.ModuleList([
            _TrainedDilatedBlock(channels, dilation=2**i) for i in range(n_blocks)])
        fusion_in = channels * n_blocks
        self.fusion = nn.Sequential(
            nn.Conv1d(fusion_in, hidden, kernel_size=1), nn.BatchNorm1d(hidden),
            nn.ReLU(), nn.Dropout(0.2),
            nn.Conv1d(hidden, hidden, kernel_size=1), nn.BatchNorm1d(hidden))
        self.attention_pool = nn.Sequential(
            nn.Softmax(dim=-1),
            nn.Conv1d(hidden, channels, kernel_size=1), nn.ReLU(),
            nn.Conv1d(channels, hidden, kernel_size=1))

    def forward(self, x):
        block_outputs = []
        h = x
        for block in self.conv_blocks:
            h = block(h)
            block_outputs.append(h)
        cat = torch.cat(block_outputs, dim=1)
        fused = F.relu(self.fusion(cat))
        attn = self.attention_pool(fused)
        return (fused * F.softmax(attn, dim=-1)).sum(dim=-1)


class TrainedDilatedCNN(nn.Module):
    def __init__(self, input_dim=INPUT_DIM):
        super().__init__()
        hidden = 332
        self.dilated_cnn = _TrainedDilatedCNNCore(channels=input_dim, hidden=hidden, n_blocks=5)
        self.classifier = nn.Sequential(
            nn.BatchNorm1d(hidden), nn.GELU(),
            nn.Linear(hidden, 166), nn.BatchNorm1d(166), nn.GELU(), nn.Dropout(0.2),
            nn.Linear(166, input_dim), nn.BatchNorm1d(input_dim), nn.GELU(),
            nn.Linear(input_dim, 1), nn.Tanh())
        self.pattern_strength = nn.Sequential(
            nn.Linear(hidden, 41), nn.ReLU(), nn.Linear(41, 1))

    def forward(self, x):
        x = x.transpose(1, 2)
        pooled = self.dilated_cnn(x)
        return self.classifier(pooled)


class TrainedCNN(nn.Module):
    def __init__(self, input_dim=INPUT_DIM):
        super().__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv1d(input_dim, 128, kernel_size=3, padding=1), nn.BatchNorm1d(128), nn.GELU(),
            nn.Conv1d(128, 256, kernel_size=5, padding=2), nn.BatchNorm1d(256), nn.GELU(),
            nn.Conv1d(256, 128, kernel_size=7, padding=3), nn.BatchNorm1d(128), nn.GELU(),
            nn.Conv1d(128, 64, kernel_size=3, padding=1), nn.BatchNorm1d(64), nn.GELU())
        self.classifier = nn.Sequential(
            nn.Linear(64, 32), nn.GELU(), nn.Dropout(0.2), nn.Linear(32, 1), nn.Tanh())

    def forward(self, x):
        x = x.transpose(1, 2)
        x = self.conv_layers(x)
        return self.classifier(x.mean(dim=-1))


class TrainedGRU(nn.Module):
    def __init__(self, input_dim=INPUT_DIM):
        super().__init__()
        self.gru = nn.GRU(input_size=input_dim, hidden_size=134,
                          num_layers=2, batch_first=True, bidirectional=True, dropout=0.2)
        bidir_dim = 268
        self.classifier = nn.Sequential(
            nn.Linear(bidir_dim, 134), nn.GELU(), nn.Dropout(0.2),
            nn.Linear(134, 64), nn.GELU(), nn.Linear(64, 1), nn.Tanh())

    def forward(self, x):
        out, _ = self.gru(x)
        return self.classifier(out.mean(dim=1))


class TrainedMetaEnsemble(nn.Module):
    def __init__(self, input_dim=INPUT_DIM, n_models=N_BASE_MODELS):
        super().__init__()
        self._input_dim = input_dim
        self._n_models = n_models
        self.feature_extractor = nn.Sequential(
            nn.Linear(input_dim, 128), nn.BatchNorm1d(128), nn.GELU(), nn.Dropout(0.2),
            nn.Linear(128, 64), nn.BatchNorm1d(64), nn.GELU())
        self.weight_generator = nn.Sequential(
            nn.Linear(64, 32), nn.GELU(), nn.Linear(32, n_models))
        self.final_predictor = nn.Sequential(
            nn.Linear(64 + n_models, 32), nn.GELU(), nn.Dropout(0.1), nn.Linear(32, 1), nn.Tanh())
        self.confidence_estimator = nn.Sequential(
            nn.Linear(64 + n_models, 16), nn.ReLU(), nn.Linear(16, 1), nn.Sigmoid())

    def forward(self, x):
        features = x[:, :self._input_dim]
        model_preds = x[:, self._input_dim:]
        ctx = self.feature_extractor(features)
        weights = F.softmax(self.weight_generator(ctx), dim=-1)
        combined = torch.cat([ctx, model_preds], dim=-1)
        return self.final_predictor(combined), self.confidence_estimator(combined)


class VariationalAutoEncoder(nn.Module):
    def __init__(self, input_dim, latent_dim=32, hidden_dims=None):
        super().__init__()
        self.input_dim = input_dim
        self.latent_dim = latent_dim
        if hidden_dims is None:
            hidden_dims = [256, 128, 64]
        encoder_layers = []
        prev_dim = input_dim
        for h in hidden_dims:
            encoder_layers.extend([nn.Linear(prev_dim, h), nn.BatchNorm1d(h), nn.LeakyReLU(0.2), nn.Dropout(0.2)])
            prev_dim = h
        self.encoder = nn.Sequential(*encoder_layers)
        self.fc_mu = nn.Linear(hidden_dims[-1], latent_dim)
        self.fc_logvar = nn.Linear(hidden_dims[-1], latent_dim)
        decoder_layers = []
        prev_dim = latent_dim
        for h in reversed(hidden_dims):
            decoder_layers.extend([nn.Linear(prev_dim, h), nn.BatchNorm1d(h), nn.LeakyReLU(0.2), nn.Dropout(0.2)])
            prev_dim = h
        decoder_layers.append(nn.Linear(hidden_dims[0], input_dim))
        self.decoder = nn.Sequential(*decoder_layers)

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        return mu + std * torch.randn_like(std)

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        recon = self.decode(z)
        return recon, mu, logvar


# Verify all models
x_test = torch.randn(2, 30, INPUT_DIM)
for name, cls in [('QT', TrainedQuantumTransformer), ('BiLSTM', TrainedBidirectionalLSTM),
                   ('DilatedCNN', TrainedDilatedCNN), ('CNN', TrainedCNN), ('GRU', TrainedGRU)]:
    m = cls(input_dim=INPUT_DIM)
    out = m(x_test)
    shape = out[0].shape if isinstance(out, tuple) else out.shape
    print(f'{name}: output {shape}')

meta_test = torch.randn(2, INPUT_DIM + N_BASE_MODELS)
m = TrainedMetaEnsemble(input_dim=INPUT_DIM, n_models=N_BASE_MODELS)
print(f'MetaEnsemble: output {m(meta_test)[0].shape}')

vae_test = torch.randn(2, INPUT_DIM)
m = VariationalAutoEncoder(input_dim=INPUT_DIM)
print(f'VAE: output {m(vae_test)[0].shape}')

print('\nAll 7 model architectures verified!')

QT: output torch.Size([2, 1])
BiLSTM: output torch.Size([2, 1])
DilatedCNN: output torch.Size([2, 1])
CNN: output torch.Size([2, 1])
GRU: output torch.Size([2, 1])
MetaEnsemble: output torch.Size([2, 1])
VAE: output torch.Size([2, 98])

All 7 model architectures verified!


## 4. Training Utilities

In [13]:
class DirectionalLoss(nn.Module):
    """v7: Recalibrated for tanh-bounded model outputs in [-1, 1].

    v6 was calibrated for unbounded micro-predictions (~0.01-0.05). With models
    now outputting tanh-bounded [-1, 1] values, the loss needs recalibration:

    - logit_scale: 20.0 -> 3.0 (tanh outputs are already in [-1,1], 3x gives
      logits in [-3,3] = ~5%-95% probability range. 20x gave [-20,20] = always
      99.99% confident, destroying gradient.)
    - margin: 0.10 -> 0.25 (with tanh outputs spanning [-1,1], require 0.25
      separation — meaningful directional commitment, not micro-signal)
    - mag_floor: 0.01 -> 0.10 (push |pred| above 0.10 — prevents collapse to
      zero while leaving room for low-confidence predictions)
    - mag_weight: 5.0 -> 3.0 (softer penalty since tanh naturally bounds outputs)

    At collapse (all pred=0), v7 penalty:
      BCE(0*3, 0.5) = 0.693 (uninformative)
      + 10 * relu(0.25 - 0) = 2.5 (strong anti-collapse)
      + 3.0 * relu(0.10 - 0) = 0.30
      Total = 3.49 (forces model away from zero quickly)
    """
    def __init__(self, logit_scale=3.0, margin=0.25):
        super().__init__()
        self.logit_scale = logit_scale
        self.margin = margin

    def forward(self, pred, target):
        pred = pred.squeeze(-1) if pred.dim() > 1 else pred
        target = target.squeeze(-1) if target.dim() > 1 else target

        # 1. BCE direction: logit_scale=3.0 maps tanh output to reasonable probabilities
        #    pred=0.5 -> logit=1.5 -> 82% prob, pred=1.0 -> logit=3.0 -> 95% prob
        target_pos = (target > 0).float()
        bce = F.binary_cross_entropy_with_logits(
            pred * self.logit_scale, target_pos)

        # 2. Separation margin: require 0.25 gap between up/down predictions
        pos_mask = target > 0
        neg_mask = target <= 0
        if pos_mask.any() and neg_mask.any():
            separation = pred[pos_mask].mean() - pred[neg_mask].mean()
            sep_loss = F.relu(self.margin - separation)
        else:
            sep_loss = torch.tensor(0.0, device=pred.device)

        # 3. Magnitude floor: push |pred| above 0.10 (was 0.01)
        mag_loss = F.relu(0.10 - pred.abs()).mean()

        return bce + 10.0 * sep_loss + 3.0 * mag_loss

def directional_accuracy(predictions, targets):
    return float(np.mean(np.sign(predictions) == np.sign(targets)))


def train_epoch(model, loader, optimizer, criterion, gradient_clip=1.0):
    model.train()
    total_loss, n = 0.0, 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        output = model(X_batch)
        pred = output[0] if isinstance(output, tuple) else output
        loss = criterion(pred.squeeze(-1), y_batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clip)
        optimizer.step()
        total_loss += loss.item(); n += 1
    return total_loss / max(n, 1)


def validate_epoch(model, loader, criterion):
    model.eval()
    total_loss, n = 0.0, 0
    all_preds, all_targets = [], []
    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            output = model(X_batch)
            pred = output[0] if isinstance(output, tuple) else output
            pred = pred.squeeze(-1)
            total_loss += criterion(pred, y_batch).item(); n += 1
            all_preds.append(pred.cpu().numpy())
            all_targets.append(y_batch.cpu().numpy())
    preds = np.concatenate(all_preds) if all_preds else np.array([])
    targets = np.concatenate(all_targets) if all_targets else np.array([])
    acc = directional_accuracy(preds, targets) if len(preds) > 0 else 0.5
    pred_std = float(np.std(preds)) if len(preds) > 0 else 0.0
    return total_loss / max(n, 1), acc, pred_std

print('Training utilities defined (v7: BCE*3 + 10*sep_margin(0.25) + 3*mag_floor(0.10))')

Training utilities defined (v7: BCE*3 + 10*sep_margin(0.25) + 3*mag_floor(0.10))


## 5. Generate Training Data

In [14]:
def walk_forward_split(pair_dfs, train_frac=0.7, val_frac=0.13):
    train_dfs, val_dfs, test_dfs = {}, {}, {}
    for pair, df in pair_dfs.items():
        n = len(df)
        t_end = int(n * train_frac)
        v_end = int(n * (train_frac + val_frac))
        train_dfs[pair] = df.iloc[:t_end].copy().reset_index(drop=True)
        val_dfs[pair] = df.iloc[t_end:v_end].copy().reset_index(drop=True)
        test_dfs[pair] = df.iloc[v_end:].copy().reset_index(drop=True)
        print(f'  {pair}: train={len(train_dfs[pair]):,}, val={len(val_dfs[pair]):,}, test={len(test_dfs[pair]):,}')
    return train_dfs, val_dfs, test_dfs


def generate_sequences_fast(pair_dfs, seq_len=SEQ_LEN, stride=1, cross_asset=True,
                            derivatives_dfs=None, fear_greed_df=None):
    """Vectorized sequence generation with 6-bar soft labels.

    Labels: 30-min forward return x 100, clipped to [-1, 1].
    This gives the model continuous gradient signal proportional to
    move magnitude, instead of hard +/-1 that causes dead-zone collapse.

    Args:
        pair_dfs: Dict of pair -> DataFrame with OHLCV columns
        seq_len: Sequence length per sample
        stride: Step size between windows
        cross_asset: If True, include 15 cross-asset features
        derivatives_dfs: Optional dict of pair -> DataFrame with derivatives columns
        fear_greed_df: Optional DataFrame with [timestamp, fear_greed]
    """
    all_X, all_y = [], []
    warmup = 50

    for pair, df in pair_dfs.items():
        min_rows = seq_len + warmup + LABEL_HORIZON
        if len(df) < min_rows:
            print(f'  Skipping {pair}: only {len(df)} bars (need {min_rows})')
            continue

        cross_data = None
        if cross_asset and len(pair_dfs) > 1:
            cross_data = {p: odf for p, odf in pair_dfs.items() if p != pair}

        # Align derivatives data for this pair
        deriv_data = _align_derivatives(df, pair, derivatives_dfs, fear_greed_df)

        feat_matrix = build_full_feature_matrix(
            df, cross_data=cross_data, pair_name=pair,
            derivatives_data=deriv_data,
        )
        if feat_matrix is None:
            print(f'  Skipping {pair}: feature computation failed')
            continue

        close_vals = df['close'].values.astype(float)
        n_samples = 0

        for end_idx in range(warmup + seq_len, len(df) - LABEL_HORIZON + 1, stride):
            start_idx = end_idx - seq_len

            # Future price: LABEL_HORIZON bars after the window ends
            future_idx = end_idx + LABEL_HORIZON - 1
            if future_idx >= len(df):
                break

            window = feat_matrix[start_idx:end_idx]

            # Per-window standardization
            mean = window.mean(axis=0, keepdims=True)
            std = window.std(axis=0, keepdims=True) + 1e-8
            window = (window - mean) / std

            current_close = close_vals[end_idx - 1]
            future_close = close_vals[future_idx]
            if current_close <= 0:
                continue

            # Soft label: 6-bar forward return, scaled and clipped to [-1, 1]
            ret = future_close / current_close - 1.0
            label = float(np.clip(ret * LABEL_SCALE, -1.0, 1.0))

            all_X.append(window)
            all_y.append(label)
            n_samples += 1

        print(f'  {pair}: {n_samples:,} sequences')

    if not all_X:
        return np.array([]), np.array([])

    X = np.array(all_X, dtype=np.float32)
    y = np.array(all_y, dtype=np.float32)
    up = (y > 0).sum()
    down = (y < 0).sum()
    print(f'Total: {len(X):,} sequences (dim={X.shape[-1]}), '
          f'balance: {up:,} up / {down:,} down '
          f'({up/len(y)*100:.1f}% / {down/len(y)*100:.1f}%)')
    print(f'Label stats: mean={y.mean():.4f}, std={y.std():.4f}, '
          f'min={y.min():.4f}, max={y.max():.4f}')
    return X, y

In [15]:
%%time
print('Splitting data (walk-forward 70/13/17)...')
train_dfs, val_dfs, test_dfs = walk_forward_split(pair_dfs)

# Auto-adjust stride to fit in Colab RAM (~12 GB)
total_train_bars = sum(len(df) for df in train_dfs.values())
bytes_per_seq = SEQ_LEN * INPUT_DIM * 4
max_ram_gb = 8.0
max_sequences = int(max_ram_gb * 1e9 / bytes_per_seq)
stride = max(1, total_train_bars // max_sequences)
stride = max(stride, 3)

print(f'\nTotal train bars: {total_train_bars:,}')
print(f'Auto stride={stride} (targets <{max_sequences:,} sequences to fit {max_ram_gb}GB RAM)')
has_deriv = bool(derivatives_dfs) or fear_greed_df is not None
print(f'Derivatives data: {"yes" if has_deriv else "no (7 features zero-padded)"}')

print('\nGenerating training sequences...')
X_train, y_train = generate_sequences_fast(
    train_dfs, stride=stride, cross_asset=True,
    derivatives_dfs=derivatives_dfs, fear_greed_df=fear_greed_df,
)

print('\nGenerating validation sequences...')
X_val, y_val = generate_sequences_fast(
    val_dfs, stride=stride, cross_asset=True,
    derivatives_dfs=derivatives_dfs, fear_greed_df=fear_greed_df,
)

print('\nGenerating test sequences...')
X_test, y_test = generate_sequences_fast(
    test_dfs, stride=stride, cross_asset=True,
    derivatives_dfs=derivatives_dfs, fear_greed_df=fear_greed_df,
)

print(f'\nDataset shapes: train={X_train.shape}, val={X_val.shape}, test={X_test.shape}')
assert X_train.shape[-1] == INPUT_DIM
mem_gb = (X_train.nbytes + X_val.nbytes + X_test.nbytes) / 1e9
print(f'Total feature array memory: {mem_gb:.2f} GB')

Splitting data (walk-forward 70/13/17)...
  BTC-USD: train=622,045, val=115,523, test=151,069
  ETH-USD: train=622,047, val=115,523, test=151,069
  SOL-USD: train=402,096, val=74,675, test=97,653
  DOGE-USD: train=487,244, val=90,488, test=118,331
  AVAX-USD: train=397,809, val=73,879, test=96,611
  LINK-USD: train=521,389, val=96,829, test=126,624

Total train bars: 3,052,630
Auto stride=4 (targets <680,272 sequences to fit 8.0GB RAM)
Derivatives data: yes

Generating training sequences...
  BTC-USD: 155,490 sequences
  ETH-USD: 155,491 sequences
  SOL-USD: 100,503 sequences
  DOGE-USD: 121,790 sequences
  AVAX-USD: 99,431 sequences
  LINK-USD: 130,326 sequences
Total: 763,031 sequences (dim=98), balance: 378,394 up / 374,133 down (49.6% / 49.0%)
Label stats: mean=0.0027, std=0.5224, min=-1.0000, max=1.0000

Generating validation sequences...
  BTC-USD: 28,860 sequences
  ETH-USD: 28,860 sequences
  SOL-USD: 18,648 sequences
  DOGE-USD: 22,601 sequences
  AVAX-USD: 18,449 sequences
  

## 6. Train Base Models (Phase 1)

In [16]:
EPOCHS = 100
BATCH_SIZE = 64
LR = 3e-4        # Reduced from 1e-3 — prevents overshoot on noisy data
WARMUP_EPOCHS = 3 # Linear warmup before cosine decay
PATIENCE = 15

train_ds = TensorDataset(torch.FloatTensor(X_train), torch.FloatTensor(y_train))
val_ds = TensorDataset(torch.FloatTensor(X_val), torch.FloatTensor(y_val))
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False)

print(f'Train batches: {len(train_loader)}, Val batches: {len(val_loader)}')
print(f'Label horizon: {LABEL_HORIZON} bars ({LABEL_HORIZON * 5} min)')
print(f'Label scale: {LABEL_SCALE} (1% return = label {LABEL_SCALE/100:.1f})')
print(f'LR: {LR}, Warmup: {WARMUP_EPOCHS} epochs, Epochs: {EPOCHS}, Patience: {PATIENCE}')
print(f'Loss: v6 (BCE*20 + 10*sep_margin(0.10) + 5*mag_floor)')
print(f'QT optimizer: weight_decay=0, attention LR={LR*0.1:.1e} (0.1x), other LR={LR:.1e}')
print(f'Other models: weight_decay=1e-4, LR={LR:.1e}')
results = {}

Train batches: 11922, Val batches: 2213
Label horizon: 6 bars (30 min)
Label scale: 100 (1% return = label 1.0)
LR: 0.0003, Warmup: 3 epochs, Epochs: 100, Patience: 15
Loss: v6 (BCE*20 + 10*sep_margin(0.10) + 5*mag_floor)
QT optimizer: weight_decay=0, attention LR=3.0e-05 (0.1x), other LR=3.0e-04
Other models: weight_decay=1e-4, LR=3.0e-04


In [17]:
def train_base_model(name, model_cls):
    print(f'\n{"="*60}')
    print(f'Training {name}')
    print(f'{"="*60}')

    model = model_cls(input_dim=INPUT_DIM).to(device)
    n_params = sum(p.numel() for p in model.parameters())
    print(f'Parameters: {n_params:,}')

    # v7: QT needs special optimizer — weight_decay compounds 13.5x more on
    # full data (10K+ batches/epoch vs 780 local). Even 1e-5 per step = 10%
    # shrinkage/epoch, which degenerates attention (softmax → uniform → constant).
    #
    # Fix: wd=0 + differential LR (attention at 0.1x) + collapse recovery.
    if name == 'quantum_transformer':
        # Split params: attention layers get lower LR and zero weight decay
        attn_params = []
        other_params = []
        for pname, p in model.named_parameters():
            if any(k in pname for k in ['attention', 'pos_encoding', 'skip_enhancement']):
                attn_params.append(p)
            else:
                other_params.append(p)

        optimizer = torch.optim.AdamW([
            {'params': attn_params, 'lr': LR * 0.1, 'weight_decay': 0},
            {'params': other_params, 'lr': LR, 'weight_decay': 0},
        ])
        n_attn = sum(p.numel() for p in attn_params)
        n_other = sum(p.numel() for p in other_params)
        print(f'Param groups: attention={n_attn:,} (lr={LR*0.1:.1e}, wd=0), '
              f'other={n_other:,} (lr={LR:.1e}, wd=0)')
    else:
        wd = 1e-4
        optimizer = torch.optim.AdamW(model.parameters(), lr=LR, weight_decay=wd)
        print(f'Weight decay: {wd}')

    # Linear warmup + cosine decay
    def lr_lambda(epoch):
        if epoch < WARMUP_EPOCHS:
            return (epoch + 1) / WARMUP_EPOCHS
        progress = (epoch - WARMUP_EPOCHS) / max(EPOCHS - WARMUP_EPOCHS, 1)
        return 0.5 * (1 + math.cos(math.pi * progress))

    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    # v7 loss: BCE*3 + 10*separation_margin(0.25) + 3*magnitude_floor(0.10)
    # Recalibrated for tanh-bounded model outputs in [-1, 1]
    criterion = DirectionalLoss(logit_scale=3.0, margin=0.25)

    best_val_loss = float('inf')
    best_acc = 0.0
    best_state = None
    patience_counter = 0
    collapse_recoveries = 0
    max_collapse_recoveries = 3
    t0 = time.time()

    for epoch in range(EPOCHS):
        train_loss = train_epoch(model, train_loader, optimizer, criterion)
        val_loss, val_acc, pred_std = validate_epoch(model, val_loader, criterion)
        scheduler.step()

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_acc = val_acc
            best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
            patience_counter = 0
        else:
            patience_counter += 1

        collapsed = pred_std < 0.001

        if (epoch + 1) % 5 == 0 or epoch == 0 or patience_counter == 0 or collapsed:
            tag = ' *** COLLAPSED ***' if collapsed else ''
            print(f'  Epoch {epoch+1:3d}/{EPOCHS}: train={train_loss:.4f}, val={val_loss:.4f}, '
                  f'acc={val_acc:.3f}, pred_std={pred_std:.4f}, '
                  f'lr={optimizer.param_groups[0]["lr"]:.2e}{tag}')

        # Collapse recovery: reload best checkpoint and halve all LRs
        if collapsed and epoch >= WARMUP_EPOCHS and best_state is not None:
            if collapse_recoveries < max_collapse_recoveries:
                collapse_recoveries += 1
                model.load_state_dict({k: v.to(device) for k, v in best_state.items()})
                for pg in optimizer.param_groups:
                    pg['lr'] *= 0.5
                print(f'  >>> Collapse recovery #{collapse_recoveries}: reloaded best, '
                      f'halved LR to {optimizer.param_groups[0]["lr"]:.2e}')
                patience_counter = 0  # Reset patience after recovery
                continue
            else:
                print(f'  >>> Max collapse recoveries ({max_collapse_recoveries}) reached, stopping')
                break

        if patience_counter >= PATIENCE:
            print(f'  Early stopping at epoch {epoch+1}')
            break

    model.load_state_dict(best_state)
    model.eval()

    test_ds = TensorDataset(torch.FloatTensor(X_test), torch.FloatTensor(y_test))
    test_loader = DataLoader(test_ds, batch_size=128, shuffle=False)
    test_loss, test_acc, test_std = validate_epoch(model, test_loader, criterion)

    elapsed = time.time() - t0
    print(f'\n  Test: loss={test_loss:.4f}, dir_acc={test_acc:.3f}, pred_std={test_std:.4f}')
    print(f'  Time: {elapsed/60:.1f} min, Epochs: {epoch+1}, Collapse recoveries: {collapse_recoveries}')

    save_path = f'models/trained/best_{name}_model.pth'
    torch.save(model.state_dict(), save_path)
    print(f'  Saved: {save_path}')

    results[name] = {
        'val_loss': best_val_loss, 'val_acc': best_acc,
        'test_loss': test_loss, 'test_acc': test_acc,
        'test_pred_std': test_std,
        'epochs': epoch + 1, 'time_min': elapsed / 60,
        'collapse_recoveries': collapse_recoveries,
    }
    return model

In [18]:
base_configs = [
    ('quantum_transformer', TrainedQuantumTransformer),
    ('bidirectional_lstm', TrainedBidirectionalLSTM),
    ('dilated_cnn', TrainedDilatedCNN),
    ('cnn', TrainedCNN),
    ('gru', TrainedGRU),
]

trained_base_models = {}
for name, cls in base_configs:
    try:
        model = train_base_model(name, cls)
        trained_base_models[name] = model
    except Exception as e:
        print(f'\nFAILED: {name}: {e}')
        results[name] = {'status': 'failed', 'error': str(e)}

print(f'\n{"="*60}')
print(f'Base models trained: {len(trained_base_models)}/{len(base_configs)}')
print(f'{"="*60}')


Training quantum_transformer
Parameters: 4,659,718
Param groups: attention=1,516,840 (lr=3.0e-05, wd=0), other=3,142,878 (lr=3.0e-04, wd=0)
  Epoch   1/100: train=2.6521, val=2.7593, acc=0.512, pred_std=0.5398, lr=2.00e-05
  Epoch   2/100: train=2.6079, val=2.6192, acc=0.517, pred_std=0.5653, lr=3.00e-05
  Epoch   5/100: train=2.5384, val=2.6722, acc=0.515, pred_std=0.4779, lr=3.00e-05
  Epoch  10/100: train=2.5192, val=2.6718, acc=0.514, pred_std=0.5204, lr=2.96e-05
  Epoch  11/100: train=2.4697, val=2.6071, acc=0.515, pred_std=0.5098, lr=2.95e-05
  Epoch  12/100: train=2.4550, val=2.5977, acc=0.519, pred_std=0.5512, lr=2.94e-05
  Epoch  15/100: train=2.4312, val=2.6471, acc=0.518, pred_std=0.5023, lr=2.89e-05
  Epoch  20/100: train=2.3645, val=2.6107, acc=0.515, pred_std=0.4991, lr=2.78e-05
  Epoch  21/100: train=2.3479, val=2.5767, acc=0.517, pred_std=0.5343, lr=2.75e-05
  Epoch  25/100: train=2.3007, val=2.6310, acc=0.516, pred_std=0.4620, lr=2.64e-05
  Epoch  26/100: train=2.3001

In [21]:
# ================================================================
# Section 6b: Train LightGBM (gradient-boosted trees)
# ================================================================
# LightGBM is structurally different from neural nets -- trees find
# interaction effects and threshold rules that DL models miss.
# This makes it an excellent diversifier in the meta-ensemble.
# ================================================================

import lightgbm as lgb
import pickle
import json

print(f'\n{"="*60}')
print('Training LightGBM (gradient-boosted trees)')
print(f'{"="*60}')

def _prepare_lgb_features(X_seq):
    # Flatten sequence data for LightGBM: [last, mean, std] -> (N, INPUT_DIM*3)
    last = X_seq[:, -1, :]           # (N, INPUT_DIM)
    mean = X_seq.mean(axis=1)        # (N, INPUT_DIM)
    std  = X_seq.std(axis=1)         # (N, INPUT_DIM)
    return np.concatenate([last, mean, std], axis=1)  # (N, INPUT_DIM*3)

lgb_X_train = _prepare_lgb_features(X_train)
lgb_X_val   = _prepare_lgb_features(X_val)
lgb_X_test  = _prepare_lgb_features(X_test)

# Binary classification: is forward return positive?
lgb_y_train = (y_train > 0).astype(int)
lgb_y_val   = (y_val > 0).astype(int)
lgb_y_test  = (y_test > 0).astype(int)

print(f'  LGB features: {lgb_X_train.shape} (last + mean + std of {INPUT_DIM}-dim sequence)')
print(f'  Label balance: train={lgb_y_train.mean():.3f}, val={lgb_y_val.mean():.3f}, test={lgb_y_test.mean():.3f}')

lgb_train = lgb.Dataset(lgb_X_train, label=lgb_y_train)
lgb_val   = lgb.Dataset(lgb_X_val, label=lgb_y_val, reference=lgb_train)

lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'learning_rate': 0.03,
    'num_leaves': 63,
    'max_depth': 7,
    'min_child_samples': 50,
    'subsample': 0.8,
    'colsample_bytree': 0.6,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'verbose': -1,
    'seed': 42,
}

callbacks = [
    lgb.early_stopping(stopping_rounds=30),
    lgb.log_evaluation(period=50),
]

t0 = time.time()
lgb_model = lgb.train(
    lgb_params,
    lgb_train,
    valid_sets=[lgb_val],
    valid_names=['val'],
    num_boost_round=500,
    callbacks=callbacks,
)
lgb_elapsed = time.time() - t0

# Evaluate
lgb_val_prob = lgb_model.predict(lgb_X_val)
lgb_test_prob = lgb_model.predict(lgb_X_test)

# Convert probabilities to signed predictions: (prob - 0.5) * 2 -> [-1, 1]
lgb_val_pred  = np.clip((lgb_val_prob  - 0.5) * 2.0, -1.0, 1.0)
lgb_test_pred = np.clip((lgb_test_prob - 0.5) * 2.0, -1.0, 1.0)

lgb_val_acc  = directional_accuracy(lgb_val_pred, y_val)
lgb_test_acc = directional_accuracy(lgb_test_pred, y_test)
lgb_pred_std = float(np.std(lgb_test_pred))

print(f'\n  LightGBM results:')
print(f'    Val dir_acc:  {lgb_val_acc:.3f}')
print(f'    Test dir_acc: {lgb_test_acc:.3f}')
print(f'    Pred std:     {lgb_pred_std:.4f}')
print(f'    Best round:   {lgb_model.best_iteration}')
print(f'    Time:         {lgb_elapsed:.1f}s')

# Feature importance (top 20)
importance = lgb_model.feature_importance(importance_type='gain')
feat_names = [f'feat_{i}' for i in range(lgb_X_train.shape[1])]
# Label the three sections
for i in range(INPUT_DIM):
    feat_names[i] = f'last_{i}'
    feat_names[INPUT_DIM + i] = f'mean_{i}'
    feat_names[INPUT_DIM * 2 + i] = f'std_{i}'
top_idx = np.argsort(importance)[::-1][:20]
print(f'\n  Top 20 features by gain:')
for rank, idx in enumerate(top_idx):
    print(f'    {rank+1:2d}. {feat_names[idx]:>12s}: {importance[idx]:.0f}')

# Save LightGBM model
lgb_pkl_path = 'models/trained/best_lightgbm_model.pkl'
lgb_meta_path = 'models/trained/lightgbm_meta.json'

with open(lgb_pkl_path, 'wb') as f:
    pickle.dump(lgb_model, f)
print(f'\n  Saved: {lgb_pkl_path} ({os.path.getsize(lgb_pkl_path)/1e6:.1f} MB)')

# Save metadata for local bot
lgb_meta = {
    'input_dim': INPUT_DIM,
    'n_features': lgb_X_train.shape[1],
    'feature_prep': 'last_mean_std',
    'objective': 'binary',
    'best_iteration': lgb_model.best_iteration,
    'val_acc': float(lgb_val_acc),
    'test_acc': float(lgb_test_acc),
}
with open(lgb_meta_path, 'w') as f:
    json.dump(lgb_meta, f, indent=2)
print(f'  Saved: {lgb_meta_path}')

# Register in trained_base_models for meta-ensemble
trained_base_models['lightgbm'] = lgb_model

results['lightgbm'] = {
    'val_loss': float(lgb_model.best_score['val']['binary_logloss']),
    'val_acc': lgb_val_acc,
    'test_acc': lgb_test_acc,
    'test_pred_std': lgb_pred_std,
    'epochs': lgb_model.best_iteration,
    'time_min': lgb_elapsed / 60,
}

print(f'\n  LightGBM registered as base model #{len(trained_base_models)}')
print(f'  trained_base_models keys: {list(trained_base_models.keys())}')


Training LightGBM (gradient-boosted trees)
  LGB features: (763031, 294) (last + mean + std of 98-dim sequence)
  Label balance: train=0.496, val=0.497, test=0.491
Training until validation scores don't improve for 30 rounds
[50]	val's binary_logloss: 0.691812
[100]	val's binary_logloss: 0.691611
[150]	val's binary_logloss: 0.691563
Early stopping, best iteration is:
[136]	val's binary_logloss: 0.691545

  LightGBM results:
    Val dir_acc:  0.514
    Test dir_acc: 0.509
    Pred std:     0.0745
    Best round:   136
    Time:         21.1s

  Top 20 features by gain:
     1.      last_18: 25121
     2.        std_0: 8189
     3.       std_35: 6396
     4.      last_24: 5794
     5.        std_2: 4256
     6.      last_20: 4223
     7.      last_34: 4174
     8.      last_31: 3962
     9.       std_36: 3913
    10.      last_32: 3904
    11.       last_3: 3778
    12.      last_35: 3603
    13.      last_33: 3423
    14.      mean_33: 3211
    15.      last_36: 3193
    16.      last_

## 7. Train Meta-Ensemble (Phase 2)

In [None]:
assert len(trained_base_models) == N_BASE_MODELS, f'Need all {N_BASE_MODELS} base models, got {len(trained_base_models)}: {list(trained_base_models.keys())}'

def generate_meta_inputs(base_models, X, y):
    n = len(X)
    all_preds = {name: np.zeros(n) for name in BASE_MODEL_NAMES}
    for name in BASE_MODEL_NAMES:
        model = base_models[name]
        if name == 'lightgbm':
            # LightGBM uses flattened features, not sequence tensors
            lgb_feats = _prepare_lgb_features(X)
            probs = model.predict(lgb_feats)
            all_preds[name] = np.clip((probs - 0.5) * 2.0, -1.0, 1.0)
            acc = directional_accuracy(all_preds[name], y)
            print(f'  {name}: dir_acc={acc:.3f}')
            continue
        model_preds = []
        for i in range(0, n, 128):
            batch = torch.FloatTensor(X[i:i+128]).to(device)
            with torch.no_grad():
                output = model(batch)
                pred = output[0] if isinstance(output, tuple) else output
                model_preds.append(torch.tanh(pred.squeeze(-1)).cpu().numpy())
        all_preds[name] = np.concatenate(model_preds)
        acc = directional_accuracy(all_preds[name], y)
        print(f'  {name}: dir_acc={acc:.3f}')
    last_features = X[:, -1, :]
    pred_matrix = np.column_stack([all_preds[name] for name in BASE_MODEL_NAMES])
    meta_X = np.concatenate([last_features, pred_matrix], axis=1).astype(np.float32)
    print(f'Meta-inputs: {meta_X.shape} (features={INPUT_DIM} + {N_BASE_MODELS} model preds)')
    return meta_X, y

print('Generating meta-inputs from base model predictions...')
print('\nTraining set:')
meta_X_train, meta_y_train = generate_meta_inputs(trained_base_models, X_train, y_train)
print('\nValidation set:')
meta_X_val, meta_y_val = generate_meta_inputs(trained_base_models, X_val, y_val)
print('\nTest set:')
meta_X_test, meta_y_test = generate_meta_inputs(trained_base_models, X_test, y_test)

In [None]:
print(f'\n{"="*60}')
print('Training Meta-Ensemble')
print(f'{"="*60}')

META_EPOCHS = 80
META_LR = 3e-4
META_PATIENCE = 12

meta_model = TrainedMetaEnsemble(input_dim=INPUT_DIM).to(device)
optimizer = torch.optim.AdamW(meta_model.parameters(), lr=META_LR, weight_decay=1e-4)

# Warmup + cosine decay
def meta_lr_lambda(epoch):
    if epoch < 3:
        return (epoch + 1) / 3
    progress = (epoch - 3) / max(META_EPOCHS - 3, 1)
    return 0.5 * (1 + math.cos(math.pi * progress))

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, meta_lr_lambda)
criterion = DirectionalLoss(logit_scale=20.0, margin=0.10)

meta_train_ds = TensorDataset(torch.FloatTensor(meta_X_train), torch.FloatTensor(meta_y_train))
meta_val_ds = TensorDataset(torch.FloatTensor(meta_X_val), torch.FloatTensor(meta_y_val))
meta_train_loader = DataLoader(meta_train_ds, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
meta_val_loader = DataLoader(meta_val_ds, batch_size=BATCH_SIZE, shuffle=False)

best_val_loss = float('inf')
best_state = None
patience_counter = 0
t0 = time.time()

for epoch in range(META_EPOCHS):
    meta_model.train()
    total_loss, n_b = 0.0, 0
    for X_b, y_b in meta_train_loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        optimizer.zero_grad()
        pred, _ = meta_model(X_b)
        loss = criterion(pred.squeeze(-1), y_b)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(meta_model.parameters(), 1.0)
        optimizer.step()
        total_loss += loss.item(); n_b += 1
    train_loss = total_loss / max(n_b, 1)

    meta_model.eval()
    v_loss, v_n = 0.0, 0
    v_preds, v_tgts = [], []
    with torch.no_grad():
        for X_b, y_b in meta_val_loader:
            X_b, y_b = X_b.to(device), y_b.to(device)
            pred, _ = meta_model(X_b)
            v_loss += criterion(pred.squeeze(-1), y_b).item(); v_n += 1
            v_preds.append(pred.squeeze(-1).cpu().numpy())
            v_tgts.append(y_b.cpu().numpy())
    val_loss = v_loss / max(v_n, 1)
    val_acc = directional_accuracy(np.concatenate(v_preds), np.concatenate(v_tgts))
    pred_std = float(np.std(np.concatenate(v_preds)))
    scheduler.step()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_state = {k: v.cpu().clone() for k, v in meta_model.state_dict().items()}
        patience_counter = 0
    else:
        patience_counter += 1

    if (epoch+1) % 5 == 0 or epoch == 0 or patience_counter == 0:
        print(f'  Epoch {epoch+1:3d}/{META_EPOCHS}: train={train_loss:.4f}, val={val_loss:.4f}, '
              f'acc={val_acc:.3f}, pred_std={pred_std:.4f}')

    if patience_counter >= META_PATIENCE:
        print(f'  Early stopping at epoch {epoch+1}')
        break

meta_model.load_state_dict(best_state)
meta_model.eval()

# Test
meta_test_ds = TensorDataset(torch.FloatTensor(meta_X_test), torch.FloatTensor(meta_y_test))
meta_test_loader = DataLoader(meta_test_ds, batch_size=128, shuffle=False)
t_preds, t_tgts = [], []
with torch.no_grad():
    for X_b, y_b in meta_test_loader:
        pred, _ = meta_model(X_b.to(device))
        t_preds.append(pred.squeeze(-1).cpu().numpy())
        t_tgts.append(y_b.numpy())
test_acc = directional_accuracy(np.concatenate(t_preds), np.concatenate(t_tgts))
simple_avg = meta_X_test[:, INPUT_DIM:].mean(axis=1)
simple_acc = directional_accuracy(simple_avg, meta_y_test)

elapsed = time.time() - t0
print(f'\n  Test dir_acc: {test_acc:.3f}')
print(f'  Simple average baseline: {simple_acc:.3f}')
print(f'  Ensemble lift: {(test_acc - simple_acc)*100:+.1f} pp')
print(f'  Time: {elapsed/60:.1f} min')

torch.save(meta_model.state_dict(), 'models/trained/best_meta_ensemble_model.pth')
print('  Saved: models/trained/best_meta_ensemble_model.pth')

results['meta_ensemble'] = {
    'val_loss': best_val_loss, 'val_acc': val_acc,
    'test_acc': test_acc, 'simple_avg_acc': simple_acc,
    'epochs': epoch + 1, 'time_min': elapsed / 60,
}

## 8. Train VAE (Phase 3)

In [None]:
print(f'\n{"="*60}')
print('Training VAE Anomaly Detector')
print(f'{"="*60}')

vae_samples = X_train[:, -1, :]
print(f'VAE training samples: {vae_samples.shape}')

VAE_EPOCHS = 100
VAE_LR = 1e-3
VAE_PATIENCE = 20

vae_model = VariationalAutoEncoder(input_dim=INPUT_DIM, latent_dim=32).to(device)
vae_optimizer = torch.optim.Adam(vae_model.parameters(), lr=VAE_LR)

n_vae = len(vae_samples)
n_vae_train = int(n_vae * 0.85)
vae_train_ds = TensorDataset(torch.FloatTensor(vae_samples[:n_vae_train]))
vae_val_ds = TensorDataset(torch.FloatTensor(vae_samples[n_vae_train:]))
vae_train_loader = DataLoader(vae_train_ds, batch_size=64, shuffle=True)
vae_val_loader = DataLoader(vae_val_ds, batch_size=64, shuffle=False)

def vae_loss_fn(recon, x, mu, logvar):
    recon_loss = F.mse_loss(recon, x, reduction='sum') / x.size(0)
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp()) / x.size(0)
    return recon_loss + kl_loss

best_vae_loss = float('inf')
best_vae_state = None
vae_patience = 0
t0 = time.time()

for epoch in range(VAE_EPOCHS):
    vae_model.train()
    total_loss, n_b = 0.0, 0
    for (batch,) in vae_train_loader:
        batch = batch.to(device)
        vae_optimizer.zero_grad()
        recon, mu, logvar = vae_model(batch)
        loss = vae_loss_fn(recon, batch, mu, logvar)
        loss.backward()
        vae_optimizer.step()
        total_loss += loss.item(); n_b += 1
    train_loss = total_loss / max(n_b, 1)

    vae_model.eval()
    v_loss, v_n = 0.0, 0
    with torch.no_grad():
        for (batch,) in vae_val_loader:
            batch = batch.to(device)
            recon, mu, logvar = vae_model(batch)
            v_loss += vae_loss_fn(recon, batch, mu, logvar).item(); v_n += 1
    val_loss = v_loss / max(v_n, 1)

    if val_loss < best_vae_loss:
        best_vae_loss = val_loss
        best_vae_state = {k: v.cpu().clone() for k, v in vae_model.state_dict().items()}
        vae_patience = 0
    else:
        vae_patience += 1

    if (epoch+1) % 10 == 0 or epoch == 0 or vae_patience == 0:
        print(f'  Epoch {epoch+1:3d}/{VAE_EPOCHS}: train={train_loss:.4f}, val={val_loss:.4f}')

    if vae_patience >= VAE_PATIENCE:
        print(f'  Early stopping at epoch {epoch+1}')
        break

vae_model.load_state_dict(best_vae_state)
torch.save(vae_model.state_dict(), 'models/trained/vae_anomaly_detector.pth')

vae_model.eval()
errors = []
with torch.no_grad():
    for i in range(0, len(vae_samples), 128):
        batch = torch.FloatTensor(vae_samples[i:i+128]).to(device)
        recon, _, _ = vae_model(batch)
        err = ((recon - batch) ** 2).mean(dim=1).cpu().numpy()
        errors.extend(err)
errors = np.array(errors)
elapsed = time.time() - t0

print(f'\n  Reconstruction error: p50={np.percentile(errors, 50):.4f}, '
      f'p95={np.percentile(errors, 95):.4f}, p99={np.percentile(errors, 99):.4f}')
print(f'  Time: {elapsed/60:.1f} min')
print(f'  Saved: models/trained/vae_anomaly_detector.pth')

results['vae'] = {
    'val_loss': best_vae_loss,
    'p50': float(np.percentile(errors, 50)),
    'p95': float(np.percentile(errors, 95)),
    'p99': float(np.percentile(errors, 99)),
    'epochs': epoch + 1,
}

## 9. Summary & Download

In [None]:
print(f'\n{"="*60}')
print('TRAINING SUMMARY')
print(f'{"="*60}')
print(f'{"Model":<25} {"Val Loss":>10} {"Val Acc":>10} {"Test Acc":>10} {"PredStd":>10} {"Epochs":>8} {"Recov":>6}')
print('-' * 81)
for name in [n for n, _ in base_configs] + ['lightgbm', 'meta_ensemble', 'vae']:
    r = results.get(name, {})
    vl = f"{r.get('val_loss', 0):.4f}" if 'val_loss' in r else 'N/A'
    va = f"{r.get('val_acc', 0):.3f}" if 'val_acc' in r else 'N/A'
    ta = f"{r.get('test_acc', 0):.3f}" if 'test_acc' in r else 'N/A'
    ps = f"{r.get('test_pred_std', 0):.4f}" if 'test_pred_std' in r else 'N/A'
    ep = str(r.get('epochs', 'N/A'))
    rec = str(r.get('collapse_recoveries', '-'))
    print(f'{name:<25} {vl:>10} {va:>10} {ta:>10} {ps:>10} {ep:>8} {rec:>6}')

# Check for collapsed models
collapsed = [n for n, r in results.items()
             if r.get('test_pred_std', 1.0) < 0.001 and 'test_pred_std' in r]
if collapsed:
    print(f'\n*** WARNING: {len(collapsed)} model(s) collapsed to constant predictions: {collapsed}')
    print('*** These weights will produce ~0 predictions in production.')
else:
    print(f'\nAll models producing varied predictions (no collapse detected)')

has_deriv = bool(derivatives_dfs) or fear_greed_df is not None
print(f'\nInput dim: {INPUT_DIM} (46 single-pair + 15 cross-asset + 7 derivatives, padded to 98)')
print(f'Derivatives data: {"yes" if has_deriv else "no (7 features zero-padded)"}')
print(f'Label: {LABEL_HORIZON}-bar ({LABEL_HORIZON * 5}-min) forward return, scale={LABEL_SCALE}')
print(f'Loss: v6 (BCE*20 + 10*sep_margin(0.10) + 5*mag_floor)')
print(f'QT optimizer: wd=0, attention LR=0.1x (v7 — prevents attention weight collapse)')
print(f'Other models: wd=1e-4, LR={LR:.1e}')
print(f'Training data: {sum(len(df) for df in pair_dfs.values()):,} bars across {len(pair_dfs)} pairs')

In [None]:
# ================================================================
# Section 9b: Confidence-Stratified Accuracy Analysis
# ================================================================
# Does model confidence actually correlate with accuracy?
# If high-confidence predictions aren't more accurate, the bot's
# confidence filtering is useless. This analysis validates the
# signal before deploying.
# ================================================================

print(f'\n{"="*60}')
print('CONFIDENCE-STRATIFIED ACCURACY ANALYSIS')
print(f'{"="*60}')

def confidence_stratified_analysis(model_name, predictions, targets, n_quintiles=5):
    predictions = np.array(predictions).flatten()
    targets = np.array(targets).flatten()
    confidence = np.abs(predictions)
    correct = ((predictions > 0) & (targets > 0)) | ((predictions < 0) & (targets < 0))
    try:
        quintile_edges = np.percentile(confidence, np.linspace(0, 100, n_quintiles + 1))
        quintile_edges = np.unique(quintile_edges)
        if len(quintile_edges) < 3:
            print(f'  {model_name}: insufficient confidence spread for quintile analysis')
            return None
    except Exception as e:
        print(f'  {model_name}: quintile computation failed: {e}')
        return None
    rows = []
    for i in range(len(quintile_edges) - 1):
        lo, hi = quintile_edges[i], quintile_edges[i + 1]
        if i == len(quintile_edges) - 2:
            mask = (confidence >= lo) & (confidence <= hi)
        else:
            mask = (confidence >= lo) & (confidence < hi)
        n = mask.sum()
        if n < 10:
            continue
        acc = correct[mask].mean()
        avg_conf = confidence[mask].mean()
        rows.append({
            'quintile': i + 1,
            'conf_range': f'{lo:.3f}-{hi:.3f}',
            'n_samples': int(n),
            'accuracy': float(acc),
            'avg_confidence': float(avg_conf),
        })
    return rows

# Collect predictions from all models on test set
print('\nCollecting test-set predictions from all models...\n')

model_test_preds = {}

# DL models
for name in ['quantum_transformer', 'bidirectional_lstm', 'dilated_cnn', 'cnn', 'gru']:
    if name not in trained_base_models:
        continue
    model = trained_base_models[name]
    model.eval()
    preds_list = []
    for i in range(0, len(X_test), 128):
        batch = torch.FloatTensor(X_test[i:i+128]).to(device)
        with torch.no_grad():
            output = model(batch)
            pred = output[0] if isinstance(output, tuple) else output
            preds_list.append(torch.tanh(pred.squeeze(-1)).cpu().numpy())
    model_test_preds[name] = np.concatenate(preds_list)

# LightGBM
if 'lightgbm' in trained_base_models:
    lgb_feats_test = _prepare_lgb_features(X_test)
    lgb_probs = trained_base_models['lightgbm'].predict(lgb_feats_test)
    model_test_preds['lightgbm'] = np.clip((lgb_probs - 0.5) * 2.0, -1.0, 1.0)

# Meta-ensemble
try:
    meta_preds_list = []
    meta_test_loader_q = DataLoader(
        TensorDataset(torch.FloatTensor(meta_X_test), torch.FloatTensor(meta_y_test)),
        batch_size=128, shuffle=False)
    meta_model.eval()
    with torch.no_grad():
        for X_b, y_b in meta_test_loader_q:
            pred, conf = meta_model(X_b.to(device))
            meta_preds_list.append(torch.tanh(pred.squeeze(-1)).cpu().numpy())
    model_test_preds['meta_ensemble'] = np.concatenate(meta_preds_list)
except Exception as e:
    print(f'  Meta-ensemble predictions failed: {e}')

# Run analysis for each model
print(f'{"Model":<22s} {"Quintile":>8s} {"Conf Range":>14s} {"N":>6s} {"Accuracy":>10s} {"AvgConf":>8s}')
print('-' * 72)

all_strat_results = {}
for name, preds in model_test_preds.items():
    rows = confidence_stratified_analysis(name, preds, y_test)
    if rows is None:
        continue
    all_strat_results[name] = rows
    for r in rows:
        print(f'{name:<22s} {r["quintile"]:>8d} {r["conf_range"]:>14s} {r["n_samples"]:>6d} '
              f'{r["accuracy"]:>10.3f} {r["avg_confidence"]:>8.3f}')
    # Check monotonicity
    accs = [r['accuracy'] for r in rows]
    if len(accs) >= 3:
        increasing = sum(1 for i in range(1, len(accs)) if accs[i] > accs[i-1])
        total = len(accs) - 1
        mono_score = increasing / total if total > 0 else 0
        verdict = 'GOOD' if mono_score >= 0.6 else 'WEAK' if mono_score >= 0.3 else 'BAD'
        print(f'  -> Monotonicity: {increasing}/{total} ({mono_score:.0%}) -- {verdict}')
    print()

# Overall summary
print(f'\n{"="*60}')
print('CONFIDENCE FILTERING VERDICT')
print(f'{"="*60}')
good_models, weak_models, bad_models = [], [], []
for name, rows in all_strat_results.items():
    accs = [r['accuracy'] for r in rows]
    if len(accs) < 3:
        continue
    spread = accs[-1] - accs[0]
    if spread > 0.03:
        good_models.append((name, spread))
    elif spread > 0.01:
        weak_models.append((name, spread))
    else:
        bad_models.append((name, spread))

if good_models:
    print(f'\nModels where confidence filtering HELPS (top-bottom spread > 3pp):')
    for name, spread in good_models:
        print(f'  {name}: {spread*100:+.1f}pp')
if weak_models:
    print(f'\nModels with WEAK confidence signal (1-3pp spread):')
    for name, spread in weak_models:
        print(f'  {name}: {spread*100:+.1f}pp')
if bad_models:
    print(f'\nModels where confidence filtering is USELESS (<1pp spread):')
    for name, spread in bad_models:
        print(f'  {name}: {spread*100:+.1f}pp')

print(f'\nRecommendation: Set confidence threshold based on the meta-ensemble\'s')
print(f'top quintile accuracy. If meta-ensemble shows good monotonicity,')
print(f'filter predictions below the 40th percentile confidence.')

In [None]:
# Zip and save to Google Drive + browser download
zip_path = 'trained_models_98dim.zip'
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
    for f in os.listdir('models/trained'):
        if f.endswith(('.pth', '.pkl', '.json')):
            full = os.path.join('models/trained', f)
            zf.write(full, f)
            size_mb = os.path.getsize(full) / 1e6
            print(f'  Added: {f} ({size_mb:.1f} MB)')

zip_size = os.path.getsize(zip_path) / 1e6
print(f'\nZip: {zip_path} ({zip_size:.1f} MB)')

# Save to Google Drive
drive_out = f'/content/drive/My Drive/{DRIVE_FOLDER}/{zip_path}'
shutil.copy2(zip_path, drive_out)
print(f'Saved to Drive: {drive_out}')

# Also try browser download
try:
    from google.colab import files
    files.download(zip_path)
    print('Browser download started')
except Exception as e:
    print(f'Browser download failed ({e})')
    print(f'Download from Google Drive instead: {drive_out}')

## 10. Post-Download

After downloading `trained_models_98dim.zip` from Google Drive, unzip and copy:

```bash
cd ~/Downloads
unzip trained_models_98dim.zip -d trained_models_98dim
cp trained_models_98dim/*.pth ~/Downloads/bitcoin-trading-bot-renaissance/models/trained/
```

Then restart the bot — it will auto-detect the new 98-dim weights via `_detect_input_dim()`.