# Importing Libs

In [1]:
# ML
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb
from catboost import CatBoostClassifier, Pool

from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    Normalizer,
    OneHotEncoder,
    LabelEncoder,
    OrdinalEncoder
)

from sklearn.pipeline import Pipeline
from sklearn.base import clone

from sklearn.metrics import (
    accuracy_score,
    mean_squared_error,
    r2_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    balanced_accuracy_score,
    average_precision_score,
    roc_auc_score,
    brier_score_loss
)
from sklearn.model_selection import (
    TimeSeriesSplit
)

import optuna

import sklearn

# DL
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import torch.nn.functional as F

# TA tools
import ta

# basic
import numpy as np
import pandas as pd

# visualize
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML, display

from tqdm.notebook import tqdm

import logging, time

# options
pd.set_option('display.max_columns', None)

# versions used
print(f"pandas=={pd.__version__}")
print(f"numpy=={np.__version__}")
print(f"sklearn=={sklearn.__version__}")
print(f"optuna=={optuna.__version__}")

pandas==2.2.2
numpy==2.0.2
sklearn==1.6.1
optuna==4.4.0


# Data

We have loaded data in a `stock_loader_experiments.ipynb` notebook  using our cctx library.

In [2]:
# let's take Gazprom stocks data
df_base = pd.read_parquet('BTC_USD_since_2025_03_01.parquet')

# converting and deleting some columns
df_base['ts'] = pd.to_datetime(df_base['ts'])

df_base.sample(1)

Unnamed: 0,ts,Open,High,Low,Close,Volume
99054,2025-05-08 18:54:00+00:00,101375.55,101380.96,101349.2,101349.21,2.79631


There is a problem with consistency in data (i.e. some minutes are missed), so let's check it

In [3]:
def check_minute_difference(datetime1, datetime2):
  difference = datetime1 - datetime2
  return difference.total_seconds() == 60.0

In [4]:
for i in range(df_base.shape[0]):
    if i == df_base.shape[0] - 1:
        break
    if not check_minute_difference(df_base['ts'].iloc[i+1], df_base['ts'].iloc[i]):
        print(i)

So on real-world implementation we would want to create a training dataset manually with realtime data, for now we will treat each row as a single minute

# Classic ML experiments

Before using any DL approaches we want to build a baseline classic ML approach which will work as a guideline when looking at alternatives which DL techniques that can may or not help us

In [5]:
time_periods = [5, 10, 15]
percentage_changes = [0.1, 0.15, 0.2]

## Features

Let's add indicators that can be used by several approaches to enhance its results

### Classical indicators

In [6]:
def add_basic_feats(df):
    df = df.copy()
    df['HL_RANGE'] = (df['High'] - df['Low']) / df['Close']
    df['LOG_VOL'] = np.log1p(df['Volume'])
    lr = np.log(df['Close']).diff()
    for period in time_periods:
        df['RV_{}M'.format(period)] = np.sqrt(lr.pow(2).rolling(period, min_periods=period).sum())
    # Time of day
    # mins = df.index.hour * 60 + df.index.minute
    # df['tod_sin'] = np.sin(2*np.pi*mins/ (24*60))
    # df['tod_cos'] = np.cos(2*np.pi*mins/ (24*60))
    return df

df_base = add_basic_feats(df_base)

#### SMA & EMA

Moving Averages (SMA & EMA): The Simple Moving Average (SMA) and Exponential Moving Average (EMA) smooth out price data to reveal the underlying trend. They are fundamental trend-following indicators. Intraday traders often plot short-term moving averages (e.g. 5-minute or 15-minute SMA) to identify trend direction. A common strategy is the moving average crossover – for example, if a short-term EMA crosses above a longer-term EMA, it generates a buy signal (anticipating upward momentum). These crossovers have been used on MOEX stocks to catch emerging trends. In one study, various SMA lengths were tested in an intraday RSI strategy; shorter-period SMAs improved profitability during downtrends, indicating the importance of choosing an appropriate moving average length for the market condition ￼. Overall, moving averages help traders filter out noise and decide if momentum favors long or short positions. They are also components of other indicators (for instance, the MACD uses EMA calculations).

In [7]:
df_base['SMA_9'] = ta.trend.sma_indicator(df_base['Close'], 9)
df_base['SMA_10'] = ta.trend.sma_indicator(df_base['Close'], 10)
df_base['SMA_21'] = ta.trend.sma_indicator(df_base['Close'], 21)

df_base['EMA_9'] = ta.trend.ema_indicator(df_base['Close'], 9)
df_base['EMA_10'] = ta.trend.ema_indicator(df_base['Close'], 10)
df_base['EMA_21'] = ta.trend.ema_indicator(df_base['Close'], 21)

#### RSI

Relative Strength Index (RSI): RSI is a popular momentum oscillator that measures the magnitude of recent price gains vs. losses on a 0–100 scale ￼. It is used to identify overbought conditions (RSI above 70, indicating prices may have risen too fast) and oversold conditions (RSI below 10) ￼. RSI has shown its value in many markets; for example, studies on emerging markets found that RSI signals can generate accurate buy/sell prompts and even produce abnormal returns ￼. Traders often use RSI on intraday MOEX charts to anticipate reversals – if a stock’s RSI dips below 10, it may be poised for a bounce (buy signal), whereas an RSI above 70 could warn of an upcoming pullback (sell signal). Importantly, combining RSI with other indicators strengthens its effectiveness. Research suggests the best results come from using RSI alongside complementary signals like moving averages ￼, which confirm the trend context. For instance, if the RSI gives an oversold reading and at the same time the price is bouncing off a key moving average support, the confluence increases confidence in a buy trade. This indicator’s ubiquity among traders makes it a self-fulfilling tool at times – many MOEX trading algorithms monitor RSI levels, contributing to short-term support and resistance around those threshold values.

In [8]:
df_base['RSI_9'] = ta.momentum.rsi(df_base['Close'], 9)
df_base['RSI_11'] = ta.momentum.rsi(df_base['Close'], 11)

#### MACD

Moving Average Convergence Divergence (MACD): MACD is another momentum/trend indicator that calculates the difference between two EMAs and a signal line (the EMA of that difference). It oscillates above and below zero, highlighting changes in trend momentum. A positive MACD indicates upward momentum, while a negative MACD indicates downward momentum; the crossing of MACD above its signal line is a classic bullish signal (and vice versa for bearish). MACD is widely favored for intraday trading because it combines aspects of trend following and momentum in one indicator. Traders on the MOEX use MACD histograms and crossovers to spot trend reversals or trend strength changes in stocks or the MOEX index itself. A strong use case is pairing MACD with RSI: one popular strategy requires MACD line crossing above the signal and RSI coming out of oversold territory to trigger a buy, capturing both trend and momentum confirmation. In fact, backtests of strategies combining MACD and RSI have shown high success rates – one such strategy yielded about a 73% win rate over hundreds of trades, with an average 0.88% gain per trade ￼. This demonstrates how MACD, especially in combination with RSI, can be a powerful tool for timing intraday entries and exits.

In [9]:

# the best parameters for intraday trading mentioned here -> https://market-bulls.com/macd-indicator-trading-strategies/
macd = ta.trend.MACD(
    close=df_base['Close'],
    window_slow=17,
    window_fast=8,
    window_sign=9
)

df_base['MACD'] = macd.macd()
df_base['MACD_Signal'] = macd.macd_signal()
df_base['MACD_Hist'] = macd.macd_diff()

#### Bollinger Bands

Bollinger Bands: Bollinger Bands are a volatility indicator consisting of a moving average (typically 20-period SMA) and upper/lower bands set a certain number of standard deviations away (often 2σ). The bands widen when volatility increases and contract when volatility drops. Intraday traders use Bollinger Bands to identify potential breakouts or mean-reversion opportunities. For example, when price repeatedly touches the upper band, the market may be overextended to the upside (potential reversal or short setup), whereas a sharp move outside the bands could signal an emerging breakout with increased volatility. In high-frequency trading contexts, Bollinger Bands help gauge if current price swings are outside normal volatility ranges ￼. On the MOEX, which can experience sudden moves due to news or low liquidity in certain stocks, Bollinger Band signals are quite useful. A common strategy is to fade extreme moves – if a Russian stock’s price spikes well above the upper band on an intraday chart, traders might short expecting a pullback toward the mean. Conversely, touching the lower band after a steady decline could present a buy-the-dip opportunity. However, it’s important to confirm with other indicators (Bollinger Band breakouts combined with volume spikes or momentum divergences provide stronger evidence of a true breakout or reversal).

In [10]:
# the best parameters for intraday trading mentioned here -> https://www.stockdaddy.in/blog/bollinger-bands-strategy
bb = ta.volatility.BollingerBands(close=df_base['Close'], window=20, window_dev=2)

df_base['BB_mavg'] = bb.bollinger_mavg()
df_base['BB_hband'] = bb.bollinger_hband()
df_base['BB_lband'] = bb.bollinger_lband()
df_base['BB_width'] = bb.bollinger_wband()
df_base['BB_pband'] = bb.bollinger_pband()
df_base['BB_hband_ind'] = bb.bollinger_hband_indicator()
df_base['BB_lband_ind'] = bb.bollinger_lband_indicator()

#### OBV

On-Balance Volume (OBV) and Volume Indicators: Volume-based indicators play a crucial role in validating price movements. OBV is a simple yet powerful indicator that accumulates volume, adding volume on up days and subtracting on down days ￼. It effectively measures buying and selling pressure. Rising OBV indicates that volume is flowing into an asset (buyers are dominant), often foreshadowing an upward breakout, while falling OBV signals distribution (selling pressure) ￼. Traders use OBV intraday to confirm trends: if price is climbing and OBV is also steadily rising, it suggests the uptrend is backed by strong volume (and likely to continue). If price makes new highs but OBV fails to reach a new high (a bearish divergence), it can warn that the rally is losing support and may reverse. On MOEX, where certain stocks can have erratic volume, OBV helps filter false price moves. For instance, a sudden price jump on low volume is treated suspiciously by algo-traders – if OBV doesn’t confirm the move, they may avoid the bait. Other volume indicators like the Volume Weighted Average Price (VWAP) are also popular intraday tools (VWAP is often used by institutional traders as a benchmark; prices above VWAP indicate an uptrend with strength, below VWAP indicates a downtrend). In general, volume indicators complement price-based indicators by adding the dimension of market participation, which is key in the relatively smaller and sometimes volatile Russian market. Combining volume signals with price signals is a widely recommended practice – for example, a breakout above a resistance level is far more convincing if accompanied by a surge in OBV or volume, confirming that big players are driving the move ￼.

In [11]:
obv = ta.volume.OnBalanceVolumeIndicator(
    close=df_base['Close'],
    volume=df_base['Volume']
)

df_base['OBV'] = obv.on_balance_volume()

####  Stochastic oscilator

The Stochastic Oscillator (similar to RSI, used to indicate overbought/oversold levels based on recent closing prices relative to price range) is often used on short time frames for MOEX stocks to pick turning points.

In [12]:
stoch = ta.momentum.StochasticOscillator(
    high=df_base['High'],
    low=df_base['Low'],
    close=df_base['Close'],
    window=9,
    smooth_window=3
)

df_base['Stoch_K'] = stoch.stoch()
df_base['Stoch_D'] = stoch.stoch_signal()

#### Commodity Channel Index

The Commodity Channel Index (CCI) is another momentum indicator highlighted in studies for trend detection ￼.

In [13]:
cci = ta.trend.CCIIndicator(
    high=df_base['High'],
    low=df_base['Low'],
    close=df_base['Close'],
    window=14,
    constant=0.010 # more sensitive
)

df_base['CCI'] = cci.cci()

#### Average True Range

Average True Range (ATR) is commonly monitored to gauge intraday volatility – for example, a widening ATR on a stock like Sberbank might imply the next price swing could be larger than usual, prompting traders to adjust stop-loss distances.

In [14]:
atr = ta.volatility.AverageTrueRange(
    high=df_base['High'],
    low=df_base['Low'],
    close=df_base['Close'],
    window=9
)

df_base['ATR_9'] = atr.average_true_range()

#### Ichimoku Cloud

Ichimoku Cloud (a comprehensive trend system) is sometimes applied to index futures or highly liquid Russian equities to map support/resistance and trend momentum at a glance.

In [15]:
ichimoku = ta.trend.IchimokuIndicator(
    high=df_base['High'],
    low=df_base['Low'],
    window1=9,
    window2=18,
    window3=34,
    visual=False
)

df_base['Ichimoku_A'] = ichimoku.ichimoku_a()
df_base['Ichimoku_B'] = ichimoku.ichimoku_b()
df_base['Ichimoku_Base'] = ichimoku.ichimoku_base_line()
df_base['Ichimoku_Conversion'] = ichimoku.ichimoku_conversion_line()

### Summing up

In [16]:
indicators = df_base.columns[df_base.columns.str.contains('SMA|EMA|RSI|MACD|BB|Stoch|CCI|ATR|Ichimoku|OBV|Volume|Open|Close|High|Low')]

In [17]:
df_base

Unnamed: 0,ts,Open,High,Low,Close,Volume,HL_RANGE,LOG_VOL,RV_5M,RV_10M,RV_15M,SMA_9,SMA_10,SMA_21,EMA_9,EMA_10,EMA_21,RSI_9,RSI_11,MACD,MACD_Signal,MACD_Hist,BB_mavg,BB_hband,BB_lband,BB_width,BB_pband,BB_hband_ind,BB_lband_ind,OBV,Stoch_K,Stoch_D,CCI,ATR_9,Ichimoku_A,Ichimoku_B,Ichimoku_Base,Ichimoku_Conversion
0,2025-03-01 00:00:00+00:00,84349.95,84390.05,84324.42,84338.54,14.42832,7.781733e-04,2.736205,,,,,,,,,,,,,,,,,,,,0.0,0.0,14.428320,,,,0.000000,,84357.235,,
1,2025-03-01 00:01:00+00:00,84337.70,84337.70,84269.50,84274.88,9.99548,8.092566e-04,2.397484,,,,,,,,,,,,,,,,,,,,0.0,0.0,4.432840,,,,0.000000,,84329.775,,
2,2025-03-01 00:02:00+00:00,84274.88,84282.60,84266.01,84266.02,5.58371,1.968765e-04,1.884598,,,,,,,,,,,,,,,,,,,,0.0,0.0,-1.150870,,,,0.000000,,84328.030,,
3,2025-03-01 00:03:00+00:00,84266.01,84300.00,84257.15,84299.99,4.66581,5.083037e-04,1.734450,,,,,,,,,,,,,,,,,,,,0.0,0.0,3.514940,,,,0.000000,,84323.600,,
4,2025-03-01 00:04:00+00:00,84299.99,84346.10,84295.74,84295.75,5.14888,5.974204e-04,1.816270,,,,,,,,,,,,,,,,,,,,0.0,0.0,-1.633940,,,,0.000000,,84323.600,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243048,2025-08-16 18:48:00+00:00,117699.99,117700.00,117699.99,117699.99,0.30569,8.496177e-08,0.266732,0.000343,0.000362,0.000380,117677.673333,117678.474,117686.643333,117683.140376,117683.146412,117686.587855,60.413835,57.810797,-1.811754,-5.102110,3.290357,117686.2010,117713.121243,117659.280757,0.045749,0.756108,0.0,0.0,36089.681254,99.985727,87.283138,130.162273,10.485465,117664.97,117693.500,117664.97,117664.97
243049,2025-08-16 18:49:00+00:00,117700.00,117700.00,117699.99,117699.99,0.71471,8.496177e-08,0.539244,0.000320,0.000348,0.000363,117679.894444,117679.905,117686.857619,117686.510301,117686.208883,117687.806232,60.413835,57.810797,0.243478,-4.032993,4.276471,117686.4260,117713.725582,117659.126418,0.046394,0.748429,0.0,0.0,36090.395964,99.985727,99.985727,130.166626,9.321524,117664.97,117693.500,117664.97,117664.97
243050,2025-08-16 18:50:00+00:00,117699.99,117700.00,117699.99,117699.99,0.60947,8.496177e-08,0.475905,0.000304,0.000345,0.000363,117682.115556,117681.904,117687.071905,117689.206241,117688.714541,117688.913847,60.413835,57.810797,1.658368,-2.894721,4.553088,117686.8255,117714.668117,117658.982883,0.047316,0.736409,0.0,0.0,36091.005434,99.985727,99.985727,127.976246,8.286911,117664.97,117693.430,117664.97,117664.97
243051,2025-08-16 18:51:00+00:00,117700.00,117700.00,117659.49,117659.50,3.46699,3.442986e-04,1.496715,0.000420,0.000487,0.000499,117679.837778,117679.854,117685.524286,117683.264993,117683.402806,117686.239861,35.173552,36.797245,-1.903274,-2.696431,0.793158,117685.4765,117715.762550,117655.190450,0.051469,0.071147,0.0,0.0,36087.538444,42.192407,80.721287,-73.363108,11.867254,117664.97,117693.430,117664.97,117664.97


## Building experiments

Let's now define several types of values to predict, we will try to solve to types of tasks: classification and regression and in final sollution we will combine them together.

- Classification: we want to understand whether the price will rise or fall by $x\%$ in a period of $y\%$ hours or days
- Regression: we want to understand what will be the price of a stock in $y\%$ hours or days, or we want to predict several prices for several corresponding time stamps

In [18]:
time_periods = [5, 10, 15]
percentage_changes = [0.1, 0.15, 0.2]

In [19]:
for lag in time_periods:
    df_base[f'Close_in_{lag}_min'] = df_base['Close'].shift(periods=-1 * lag)

for lag in time_periods:
    df_base[f'Ret_in_{lag}m'] = (df_base[f'Close_in_{lag}_min'] - df_base['Close']) / df_base['Close']

df_base.dropna(inplace=True)

for lag in time_periods:
    df_base[f'Target_cls_simple_{lag}_min'] = df_base[f'Close_in_{lag}_min'] > df_base['Close']
    df_base[f'Target_cls_simple_{lag}_min'] = df_base[f'Target_cls_simple_{lag}_min'].astype(int)
    for change in percentage_changes:
        df_base[f'Target_cls_{change}%_{lag}_min'] = df_base[f'Close_in_{lag}_min'] >= df_base['Close'] * (1 + change / 100)
        df_base[f'Target_cls_{change}%_{lag}_min'] = df_base[f'Target_cls_{change}%_{lag}_min'].astype(int)

In [20]:
def lag_stack(df, cols, L: int):
    # Lags [t, t-1, ..., t-(L-1)]
    parts = []
    for i in range(L):
        parts.append(df[cols].shift(i).add_suffix(f'_t-{i}'))
    return pd.concat(parts, axis=1)

### Classification experiments

We will try feeding different models several lagged variables

In [21]:
base_cols = [
    'Close', 'Open', 'HL_RANGE', 'LOG_VOL', 'RV_15M',
    'SMA_9', 'EMA_9', 'RSI_9', 'MACD', 'MACD_Signal',
    'BB_pband', 'BB_width', 'ATR_9', 'CCI', 'Stoch_K', 'Stoch_D'
    # 'Ichimoku_A', 'Ichimoku_B'
]
L = 8  # number of "ticks behind" to include
lagged = lag_stack(df_base, base_cols, L).astype('float32')

In [22]:
# purged walk-forward split
def walk_forward_splits_row_embargo(index, n_splits=5, embargo_minutes=60, bars_per_minute=1):
    tss = TimeSeriesSplit(n_splits=n_splits)
    embargo_rows = int(embargo_minutes * bars_per_minute)
    for tr, te in tss.split(range(len(index))):
        tr, te = np.asarray(tr), np.asarray(te)
        cutoff_pos = max(0, te[0] - embargo_rows)
        keep = tr <= cutoff_pos
        yield tr[keep], te


In [23]:
# HELPERS

def fold_metrics(y_true, proba, thr=0.5):
    y_pred = (proba >= thr).astype(int)
    return dict(
        thr=thr,
        acc=accuracy_score(y_true, y_pred),
        bacc=balanced_accuracy_score(y_true, y_pred),
        f1=f1_score(y_true, y_pred, zero_division=0),
        prec1=precision_score(y_true, y_pred, pos_label=1, zero_division=0),
        rec1=recall_score(y_true, y_pred, pos_label=1, zero_division=0),
        roc_auc=roc_auc_score(y_true, proba),
        pr_auc=average_precision_score(y_true, proba),
        brier=brier_score_loss(y_true, proba),
    )

def evaluate_thresholds(y_true, proba, thresholds=np.round(np.linspace(0.3, 0.8, 11), 3)):
    rows = []
    for t in thresholds:
        m = fold_metrics(y_true, proba, thr=t)
        rows.append(m)
    df = pd.DataFrame(rows)
    best = df.sort_values('f1', ascending=False).iloc[0]
    return df, float(best['thr'])

### LOG REG (calibrated)

In [None]:
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger("logreg_exp")

def run_logreg_experiment(
    X, y, splits, *,
    progress=True,
    calibration="isotonic",
    verbose_lr=0
):
    fold_rows = []
    it = enumerate(splits, 1)
    if progress:
        it = tqdm(it, total=len(splits), desc="LogReg folds", leave=False)

    for k, (tr, te) in it:
        t0 = time.time()
        Xtr, Xte = X.iloc[tr], X.iloc[te]
        ytr, yte = y.iloc[tr], y.iloc[te]
        pos_rate = float(ytr.mean())

        # pipeline
        pipe = Pipeline([
            ('scaler', StandardScaler()),
            ('clf', LogisticRegression(
                solver='lbfgs',
                max_iter=2000,
                class_weight='balanced',
                verbose=verbose_lr,
                n_jobs=-1
            ))
        ])

        # calibrate probs on train only
        cal = CalibratedClassifierCV(pipe, method=calibration, cv=3)
        logger.info(f"[Fold {k}] train={len(tr):,} test={len(te):,} pos_rate={pos_rate:.3f} | calibrating={calibration}")
        cal.fit(Xtr, ytr)

        proba = cal.predict_proba(Xte)[:, 1]
        m = fold_metrics(yte, proba)
        m['fold'] = k
        # threshold sweep (optional)
        thr_tbl = evaluate_thresholds(yte, proba)
        m['best_thr_f1'] = thr_tbl[1]
        m['sec'] = time.time() - t0
        fold_rows.append(m)

        logger.info(f"[Fold {k}] AUC={m['roc_auc']:.3f} PR={m['pr_auc']:.3f} "
                    f"F1@0.5={m['f1']:.3f} Brier={m['brier']:.4f} "
                    f"best_thr(F1)={m['best_thr_f1']:.2f} time={m['sec']:.1f}s")

    return pd.DataFrame(fold_rows)

- "isotonic" (better prob quality, slower) or "sigmoid" (faster)
- verbose_lr=0                   # pass 1 for solver logs if you switch to solver='saga'/'liblinear'

#### Minutes Ahed = **5**


In [None]:
H = 5  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_logreg_experiment(X, y, splits).to_html()))

LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:42:22,659 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.504 | calibrating=isotonic
2025-08-16 23:42:26,462 INFO: [Fold 1] AUC=0.510 PR=0.504 F1@0.5=0.475 Brier=0.2501 best_thr(F1)=0.30 time=3.8s
2025-08-16 23:42:26,475 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=isotonic
2025-08-16 23:42:31,428 INFO: [Fold 2] AUC=0.508 PR=0.507 F1@0.5=0.193 Brier=0.2503 best_thr(F1)=0.30 time=5.0s
2025-08-16 23:42:31,445 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.499 | calibrating=isotonic
2025-08-16 23:42:38,418 INFO: [Fold 3] AUC=0.515 PR=0.511 F1@0.5=0.347 Brier=0.2498 best_thr(F1)=0.30 time=7.0s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.506468,0.505884,0.475393,0.501157,0.452148,0.510019,0.504179,0.250125,1,0.3,3.813623
1,0.5,0.509051,0.50707,0.19262,0.529412,0.117726,0.50835,0.506959,0.250279,2,0.3,4.964749
2,0.5,0.511607,0.509996,0.347253,0.516636,0.261514,0.514968,0.510718,0.24984,3,0.3,6.989069


In [None]:
for P in percentage_changes:
    H = 5  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_logreg_experiment(X, y, splits).to_html()))

Minutes ahead: 5, Percentage change: 0.1


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:42:38,492 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.143 | calibrating=isotonic
2025-08-16 23:42:40,919 INFO: [Fold 1] AUC=0.688 PR=0.231 F1@0.5=0.018 Brier=0.0995 best_thr(F1)=0.30 time=2.4s
2025-08-16 23:42:40,931 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.131 | calibrating=isotonic
2025-08-16 23:42:46,555 INFO: [Fold 2] AUC=0.726 PR=0.204 F1@0.5=0.002 Brier=0.0826 best_thr(F1)=0.30 time=5.6s
2025-08-16 23:42:46,570 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.119 | calibrating=isotonic
2025-08-16 23:42:53,547 INFO: [Fold 3] AUC=0.696 PR=0.207 F1@0.5=0.003 Brier=0.0929 best_thr(F1)=0.30 time=7.0s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.881168,0.503743,0.017926,0.424242,0.009156,0.6883,0.23053,0.099473,1,0.3,2.435608
1,0.5,0.903039,0.500328,0.001595,0.375,0.000799,0.72596,0.203616,0.082637,2,0.3,5.636157
2,0.5,0.890464,0.500606,0.002821,0.461538,0.001415,0.695739,0.206998,0.092892,3,0.3,6.990667


Minutes ahead: 5, Percentage change: 0.15


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:42:53,594 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.074 | calibrating=isotonic
2025-08-16 23:42:56,807 INFO: [Fold 1] AUC=0.747 PR=0.174 F1@0.5=0.000 Brier=0.0519 best_thr(F1)=0.30 time=3.2s
2025-08-16 23:42:56,823 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.066 | calibrating=isotonic
2025-08-16 23:43:02,851 INFO: [Fold 2] AUC=0.770 PR=0.138 F1@0.5=0.002 Brier=0.0408 best_thr(F1)=0.30 time=6.0s
2025-08-16 23:43:02,866 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.059 | calibrating=isotonic
2025-08-16 23:43:10,258 INFO: [Fold 3] AUC=0.748 PR=0.142 F1@0.5=0.003 Brier=0.0462 best_thr(F1)=0.30 time=7.4s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.941436,0.5,0.0,0.0,0.0,0.747489,0.173976,0.051948,1,0.3,3.219823
1,0.5,0.954993,0.500533,0.00229,0.4,0.001148,0.770201,0.137727,0.040822,2,0.3,6.043598
2,0.5,0.949002,0.500758,0.003029,1.0,0.001517,0.747705,0.142088,0.04624,3,0.3,7.406144


Minutes ahead: 5, Percentage change: 0.2


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:43:10,310 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.040 | calibrating=isotonic
2025-08-16 23:43:13,413 INFO: [Fold 1] AUC=0.799 PR=0.127 F1@0.5=0.000 Brier=0.0265 best_thr(F1)=0.45 time=3.1s
2025-08-16 23:43:13,424 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.034 | calibrating=isotonic
2025-08-16 23:43:19,633 INFO: [Fold 2] AUC=0.804 PR=0.102 F1@0.5=0.005 Brier=0.0208 best_thr(F1)=0.30 time=6.2s
2025-08-16 23:43:19,649 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.030 | calibrating=isotonic
2025-08-16 23:43:28,306 INFO: [Fold 3] AUC=0.793 PR=0.105 F1@0.5=0.004 Brier=0.0227 best_thr(F1)=0.30 time=8.7s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.971235,0.499973,0.0,0.0,0.0,0.799016,0.126572,0.026476,1,0.45,3.110289
1,0.5,0.977819,0.501138,0.004635,0.5,0.002328,0.803924,0.101595,0.020784,2,0.3,6.22007
2,0.5,0.975753,0.501063,0.004242,1.0,0.002125,0.793341,0.105383,0.022713,3,0.3,8.671814


#### Minutes Ahed = **10**


In [None]:
H = 10  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_logreg_experiment(X, y, splits).to_html()))

LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:43:28,363 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.505 | calibrating=isotonic
2025-08-16 23:43:30,283 INFO: [Fold 1] AUC=0.515 PR=0.505 F1@0.5=0.520 Brier=0.2499 best_thr(F1)=0.35 time=1.9s
2025-08-16 23:43:30,298 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=isotonic
2025-08-16 23:43:34,650 INFO: [Fold 2] AUC=0.507 PR=0.510 F1@0.5=0.178 Brier=0.2523 best_thr(F1)=0.30 time=4.4s
2025-08-16 23:43:34,665 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.500 | calibrating=isotonic
2025-08-16 23:43:40,779 INFO: [Fold 3] AUC=0.520 PR=0.516 F1@0.5=0.403 Brier=0.2498 best_thr(F1)=0.35 time=6.1s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.509593,0.509984,0.519676,0.502494,0.538075,0.514763,0.50542,0.249915,1,0.35,1.928987
1,0.5,0.503137,0.504131,0.178114,0.52125,0.107408,0.507365,0.510229,0.252281,2,0.3,4.365978
2,0.5,0.519198,0.517695,0.402746,0.524697,0.326792,0.520051,0.516432,0.24983,3,0.35,6.129076


In [None]:
for P in percentage_changes:
    H = 10  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_logreg_experiment(X, y, splits).to_html()))

Minutes ahead: 10, Percentage change: 0.1


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:43:40,847 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.218 | calibrating=isotonic
2025-08-16 23:43:43,532 INFO: [Fold 1] AUC=0.636 PR=0.285 F1@0.5=0.013 Brier=0.1459 best_thr(F1)=0.30 time=2.7s
2025-08-16 23:43:43,547 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.201 | calibrating=isotonic
2025-08-16 23:43:48,617 INFO: [Fold 2] AUC=0.666 PR=0.256 F1@0.5=0.001 Brier=0.1293 best_thr(F1)=0.30 time=5.1s
2025-08-16 23:43:48,632 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.188 | calibrating=isotonic
2025-08-16 23:43:55,924 INFO: [Fold 3] AUC=0.644 PR=0.262 F1@0.5=0.001 Brier=0.1401 best_thr(F1)=0.30 time=7.3s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.815064,0.502544,0.012683,0.522727,0.006419,0.635914,0.28528,0.145921,1,0.3,2.691344
1,0.5,0.839079,0.500225,0.000962,0.75,0.000481,0.666148,0.255572,0.129308,2,0.3,5.084261
2,0.5,0.823327,0.500162,0.001459,0.277778,0.000732,0.644068,0.261619,0.140128,3,0.3,7.306718


Minutes ahead: 10, Percentage change: 0.15


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:43:55,973 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.133 | calibrating=isotonic
2025-08-16 23:43:58,568 INFO: [Fold 1] AUC=0.685 PR=0.220 F1@0.5=0.000 Brier=0.0921 best_thr(F1)=0.30 time=2.6s
2025-08-16 23:43:58,581 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.120 | calibrating=isotonic
2025-08-16 23:44:04,007 INFO: [Fold 2] AUC=0.714 PR=0.184 F1@0.5=0.000 Brier=0.0787 best_thr(F1)=0.30 time=5.4s
2025-08-16 23:44:04,022 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.111 | calibrating=isotonic
2025-08-16 23:44:12,786 INFO: [Fold 3] AUC=0.692 PR=0.190 F1@0.5=0.001 Brier=0.0878 best_thr(F1)=0.30 time=8.8s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.891833,0.5,0.0,0.0,0.0,0.68537,0.22022,0.092125,1,0.3,2.603239
1,0.5,0.909082,0.5,0.0,0.0,0.0,0.714015,0.184484,0.078713,2,0.3,5.437672
2,0.5,0.897746,0.499997,0.000505,0.1,0.000253,0.69186,0.190448,0.087754,3,0.3,8.778599


Minutes ahead: 10, Percentage change: 0.2


LogReg folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:12,837 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.083 | calibrating=isotonic
2025-08-16 23:44:15,396 INFO: [Fold 1] AUC=0.736 PR=0.184 F1@0.5=0.035 Brier=0.0567 best_thr(F1)=0.30 time=2.6s
2025-08-16 23:44:15,409 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.073 | calibrating=isotonic
2025-08-16 23:44:20,819 INFO: [Fold 2] AUC=0.752 PR=0.135 F1@0.5=0.000 Brier=0.0482 best_thr(F1)=0.30 time=5.4s
2025-08-16 23:44:20,834 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.067 | calibrating=isotonic
2025-08-16 23:44:28,954 INFO: [Fold 3] AUC=0.738 PR=0.139 F1@0.5=0.002 Brier=0.0517 best_thr(F1)=0.30 time=8.1s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,fold,best_thr_f1,sec
0,0.5,0.935678,0.50833,0.03487,0.459184,0.018123,0.735753,0.183609,0.056736,1,0.3,2.565972
1,0.5,0.946755,0.5,0.0,0.0,0.0,0.752151,0.135255,0.048176,2,0.3,5.422934
2,0.5,0.942753,0.500424,0.001801,0.5,0.000902,0.738034,0.139199,0.051737,3,0.3,8.133592


### Lightgbm

In [None]:
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger("gbdt_exp")

def run_gbdt_experiment(
    X, y, splits, *,
    progress=True,
    early_stopping_rounds=100,
    calibrate=False,
    calibration="sigmoid"
):
    fold_rows = []

    base_params = dict(
        n_estimators=2000,
        learning_rate=0.03,
        max_depth=-1,
        num_leaves=63,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='binary',
        n_jobs=-1,
        verbosity=-1
    )

    it = enumerate(splits, 1)
    if progress:
        it = tqdm(it, total=len(splits), desc="LightGBM folds", leave=False)

    for k, (tr, te) in it:
        t0 = time.time()
        Xtr, Xte = X.iloc[tr], X.iloc[te]
        ytr, yte = y.iloc[tr], y.iloc[te]
        pos_rate = float(ytr.mean())
        logger.info(f"[Fold {k}] train={len(tr):,} test={len(te):,} pos_rate={pos_rate:.3f} | calibrating={calibration}")

        # imbalance
        pos = int(ytr.sum()); neg = len(ytr) - pos
        params = base_params.copy()
        params['scale_pos_weight'] = (neg / max(pos, 1))

        clf = lgb.LGBMClassifier(**params)

        # --- EARLY STOPPING via callbacks (works across versions)
        callbacks = [
            lgb.early_stopping(early_stopping_rounds, verbose=False),
            lgb.log_evaluation(period=0)  # 0 -> no periodic prints
        ]
        clf.fit(
            Xtr, ytr,
            eval_set=[(Xte, yte)],
            eval_metric="binary_logloss",
            callbacks=callbacks
        )

        proba = clf.predict_proba(Xte)[:, 1]

        # optional probability calibration (train-only CV)
        if calibrate:
            # freeze best iteration if available to speed up/calibrate fairly
            best_iter = getattr(clf, "best_iteration_", None)
            cal_est = lgb.LGBMClassifier(**params)
            if best_iter is not None:
                cal_est.set_params(n_estimators=best_iter)
            cal = CalibratedClassifierCV(cal_est, method=calibration, cv=3)
            logger.info(f"[LightGBM Fold {k}] calibrating={calibration}")
            cal.fit(Xtr, ytr)
            proba = cal.predict_proba(Xte)[:, 1]

        # metrics (includes class-1 precision/recall)
        m50 = fold_metrics(yte, proba, thr=0.5)
        thr_tbl, best_thr = evaluate_thresholds(yte, proba)
        m = {**m50, 'best_thr_f1': float(best_thr), 'fold': k, 'sec': time.time() - t0}
        fold_rows.append(m)

        logger.info(
            f"[LightGBM Fold {k}] AUC={m['roc_auc']:.3f} PR={m['pr_auc']:.3f} "
            f"F1@0.5={m['f1']:.3f} P1@0.5={m['prec1']:.3f} R1@0.5={m['rec1']:.3f} "
            f"best_thr={best_thr:.2f} time={m['sec']:.1f}s"
        )

    return pd.DataFrame(fold_rows)

- calibrate=False - optional prob calibration
- calibration="sigmoid" - or "isotonic"

#### Minutes Ahed = **5**


In [None]:
H = 5  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_gbdt_experiment(X, y, splits).to_html()))

LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:29,012 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.504 | calibrating=sigmoid
2025-08-16 23:44:30,796 INFO: [LightGBM Fold 1] AUC=0.511 PR=0.505 F1@0.5=0.553 P1@0.5=0.501 R1@0.5=0.617 best_thr=0.30 time=1.8s
2025-08-16 23:44:30,809 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:44:32,922 INFO: [LightGBM Fold 2] AUC=0.515 PR=0.508 F1@0.5=0.359 P1@0.5=0.517 R1@0.5=0.275 best_thr=0.30 time=2.1s
2025-08-16 23:44:32,938 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:44:35,448 INFO: [LightGBM Fold 3] AUC=0.519 PR=0.513 F1@0.5=0.455 P1@0.5=0.513 R1@0.5=0.410 best_thr=0.40 time=2.5s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.506546,0.50773,0.552794,0.500912,0.616666,0.511289,0.50486,0.249934,0.3,1,1.790694
1,0.5,0.511374,0.510177,0.358695,0.516696,0.274695,0.514998,0.507947,0.249888,0.3,2,2.126179
2,0.5,0.513285,0.512619,0.455499,0.512647,0.409814,0.518858,0.513054,0.249708,0.4,3,2.525238


In [None]:
for P in percentage_changes:
    H = 5  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_gbdt_experiment(X, y, splits).to_html()))

Minutes ahead: 5, Percentage change: 0.1


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:35,504 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.143 | calibrating=sigmoid
2025-08-16 23:44:37,235 INFO: [LightGBM Fold 1] AUC=0.663 PR=0.200 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.7s
2025-08-16 23:44:37,252 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.131 | calibrating=sigmoid
2025-08-16 23:44:39,268 INFO: [LightGBM Fold 2] AUC=0.663 PR=0.175 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.0s
2025-08-16 23:44:39,286 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.119 | calibrating=sigmoid
2025-08-16 23:44:41,587 INFO: [LightGBM Fold 3] AUC=0.671 PR=0.182 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.3s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.881556,0.5,0.0,0.0,0.0,0.663107,0.199904,0.104756,0.3,1,1.739693
1,0.5,0.903091,0.5,0.0,0.0,0.0,0.66257,0.174506,0.088611,0.3,2,2.032488
2,0.5,0.89049,0.5,0.0,0.0,0.0,0.671299,0.182252,0.097188,0.3,3,2.318489


Minutes ahead: 5, Percentage change: 0.15


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:41,633 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.074 | calibrating=sigmoid
2025-08-16 23:44:43,445 INFO: [LightGBM Fold 1] AUC=0.680 PR=0.114 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.8s
2025-08-16 23:44:43,458 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.066 | calibrating=sigmoid
2025-08-16 23:44:45,456 INFO: [LightGBM Fold 2] AUC=0.741 PR=0.115 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.0s
2025-08-16 23:44:45,473 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.059 | calibrating=sigmoid
2025-08-16 23:44:47,953 INFO: [LightGBM Fold 3] AUC=0.728 PR=0.121 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.5s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.941436,0.5,0.0,0.0,0.0,0.679676,0.113767,0.055283,0.3,1,1.819134
1,0.5,0.955018,0.5,0.0,0.0,0.0,0.740526,0.115497,0.043319,0.3,2,2.010078
2,0.5,0.948925,0.5,0.0,0.0,0.0,0.727832,0.121097,0.04826,0.3,3,2.496623


Minutes ahead: 5, Percentage change: 0.2


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:47,999 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.040 | calibrating=sigmoid
2025-08-16 23:44:49,813 INFO: [LightGBM Fold 1] AUC=0.739 PR=0.074 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.8s
2025-08-16 23:44:49,827 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.034 | calibrating=sigmoid
2025-08-16 23:44:52,161 INFO: [LightGBM Fold 2] AUC=0.747 PR=0.077 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.3s
2025-08-16 23:44:52,178 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.030 | calibrating=sigmoid
2025-08-16 23:44:54,505 INFO: [LightGBM Fold 3] AUC=0.727 PR=0.068 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.3s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.971286,0.5,0.0,0.0,0.0,0.738578,0.073544,0.027977,0.3,1,1.8223
1,0.5,0.977819,0.5,0.0,0.0,0.0,0.747091,0.077444,0.021811,0.3,2,2.347252
2,0.5,0.975702,0.5,0.0,0.0,0.0,0.726696,0.068217,0.023616,0.3,3,2.343561


#### Minutes Ahed = **10**


In [None]:
H = 10  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_gbdt_experiment(X, y, splits).to_html()))

LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:44:54,564 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.505 | calibrating=sigmoid
2025-08-16 23:44:56,290 INFO: [LightGBM Fold 1] AUC=0.501 PR=0.491 F1@0.5=0.611 P1@0.5=0.496 R1@0.5=0.795 best_thr=0.30 time=1.7s
2025-08-16 23:44:56,301 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:44:58,306 INFO: [LightGBM Fold 2] AUC=0.512 PR=0.512 F1@0.5=0.402 P1@0.5=0.512 R1@0.5=0.330 best_thr=0.30 time=2.0s
2025-08-16 23:44:58,327 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.500 | calibrating=sigmoid
2025-08-16 23:45:00,777 INFO: [LightGBM Fold 3] AUC=0.515 PR=0.513 F1@0.5=0.521 P1@0.5=0.504 R1@0.5=0.539 best_thr=0.30 time=2.5s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.500839,0.504881,0.611054,0.496128,0.795276,0.501149,0.490831,0.250132,0.3,1,1.732159
1,0.5,0.506391,0.506833,0.401578,0.51181,0.330414,0.511727,0.512141,0.249891,0.3,2,2.015504
2,0.5,0.508663,0.508901,0.521212,0.504457,0.539118,0.515172,0.513118,0.249816,0.3,3,2.470771


In [None]:
for P in percentage_changes:
    H = 10  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_gbdt_experiment(X, y, splits).to_html()))

Minutes ahead: 10, Percentage change: 0.1


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:00,832 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.218 | calibrating=sigmoid
2025-08-16 23:45:02,589 INFO: [LightGBM Fold 1] AUC=0.620 PR=0.263 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.8s
2025-08-16 23:45:02,605 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.201 | calibrating=sigmoid
2025-08-16 23:45:04,668 INFO: [LightGBM Fold 2] AUC=0.653 PR=0.246 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.1s
2025-08-16 23:45:04,686 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.188 | calibrating=sigmoid
2025-08-16 23:45:07,191 INFO: [LightGBM Fold 3] AUC=0.618 PR=0.240 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.5s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.814961,0.5,0.0,0.0,0.0,0.619941,0.262634,0.151797,0.3,1,1.765182
1,0.5,0.839027,0.5,0.0,0.0,0.0,0.652854,0.246299,0.136361,0.3,2,2.078479
2,0.5,0.823534,0.5,0.0,0.0,0.0,0.618264,0.239667,0.144987,0.3,3,2.52268


Minutes ahead: 10, Percentage change: 0.15


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:07,244 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.133 | calibrating=sigmoid
2025-08-16 23:45:08,970 INFO: [LightGBM Fold 1] AUC=0.651 PR=0.189 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.7s
2025-08-16 23:45:08,982 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.120 | calibrating=sigmoid
2025-08-16 23:45:11,012 INFO: [LightGBM Fold 2] AUC=0.526 PR=0.120 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.0s
2025-08-16 23:45:11,035 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.111 | calibrating=sigmoid
2025-08-16 23:45:13,375 INFO: [LightGBM Fold 3] AUC=0.676 PR=0.175 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.4s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.891833,0.5,0.0,0.0,0.0,0.651206,0.189377,0.09696,0.3,1,1.732882
1,0.5,0.909082,0.5,0.0,0.0,0.0,0.526154,0.119767,0.083868,0.3,2,2.041661
2,0.5,0.897952,0.5,0.0,0.0,0.0,0.675503,0.174697,0.091187,0.3,3,2.362444


Minutes ahead: 10, Percentage change: 0.2


LightGBM folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:13,428 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.083 | calibrating=sigmoid
2025-08-16 23:45:15,174 INFO: [LightGBM Fold 1] AUC=0.638 PR=0.131 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=1.8s
2025-08-16 23:45:15,190 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.073 | calibrating=sigmoid
2025-08-16 23:45:17,276 INFO: [LightGBM Fold 2] AUC=0.626 PR=0.083 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.1s
2025-08-16 23:45:17,294 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.067 | calibrating=sigmoid
2025-08-16 23:45:19,669 INFO: [LightGBM Fold 3] AUC=0.693 PR=0.103 F1@0.5=0.000 P1@0.5=0.000 R1@0.5=0.000 best_thr=0.30 time=2.4s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.935885,0.5,0.0,0.0,0.0,0.637602,0.13099,0.060523,0.3,1,1.75328
1,0.5,0.946755,0.5,0.0,0.0,0.0,0.62608,0.083372,0.050928,0.3,2,2.101254
2,0.5,0.942753,0.5,0.0,0.0,0.0,0.692608,0.103246,0.053871,0.3,3,2.392755


### CatBoost

In [None]:
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger("catboost_exp")


def run_catboost_experiment(
    X, y, splits, *,
    progress=True,
    iterations=3000, learning_rate=0.03, depth=6,
    calibrate=False, calibration="sigmoid"
):
    fold_rows = []
    it = enumerate(splits, 1)
    if progress:
        it = tqdm(it, total=len(splits), desc="CatBoost folds", leave=False)

    for k, (tr, te) in it:
        t0 = time.time()
        Xtr, Xte = X.iloc[tr], X.iloc[te]
        ytr, yte = y.iloc[tr], y.iloc[te]
        pos_rate = float(ytr.mean())
        logger.info(f"[Fold {k}] train={len(tr):,} test={len(te):,} pos_rate={pos_rate:.3f} | calibrating={calibration}")

        pos = ytr.sum()
        neg = len(ytr) - pos
        cw1 = max(1.0, (neg / max(pos, 1)))   # weight for class 1
        params = dict(
            iterations=iterations, learning_rate=learning_rate, depth=depth,
            loss_function='Logloss', eval_metric='AUC',
            od_type='Iter', od_wait=150, random_seed=42, verbose=False,
            class_weights=[1.0, cw1]
        )

        model = CatBoostClassifier(**params)
        model.fit(Pool(Xtr, ytr), eval_set=Pool(Xte, yte), verbose=False)

        proba = model.predict_proba(Xte)[:, 1]

        if calibrate:
            cal = CalibratedClassifierCV(clone(model), method=calibration, cv=3)
            logger.info(f"[CatBoost Fold {k}] calibrating={calibration}")
            cal.fit(Xtr, ytr)
            proba = cal.predict_proba(Xte)[:, 1]

        m50 = fold_metrics(yte, proba, thr=0.5)
        thr_tbl, best_thr = evaluate_thresholds(yte, proba)
        m = {**m50, 'best_thr_f1': best_thr, 'fold': k, 'sec': time.time() - t0}
        fold_rows.append(m)

        logger.info(f"[CatBoost Fold {k}] AUC={m['roc_auc']:.3f} PR={m['pr_auc']:.3f} "
                    f"F1@0.5={m['f1']:.3f} P1@0.5={m['prec1']:.3f} R1@0.5={m['rec1']:.3f} "
                    f"best_thr={best_thr:.2f} time={m['sec']:.1f}s")

    return pd.DataFrame(fold_rows)

#### Minutes Ahed = **5**


In [None]:
H = 5  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_catboost_experiment(X, y, splits).to_html()))

CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:19,736 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.504 | calibrating=sigmoid
2025-08-16 23:45:21,234 INFO: [CatBoost Fold 1] AUC=0.511 PR=0.502 F1@0.5=0.506 P1@0.5=0.502 R1@0.5=0.509 best_thr=0.30 time=1.5s
2025-08-16 23:45:21,248 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:45:22,774 INFO: [CatBoost Fold 2] AUC=0.512 PR=0.509 F1@0.5=0.293 P1@0.5=0.511 R1@0.5=0.205 best_thr=0.30 time=1.5s
2025-08-16 23:45:22,789 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:45:26,497 INFO: [CatBoost Fold 3] AUC=0.516 PR=0.515 F1@0.5=0.515 P1@0.5=0.505 R1@0.5=0.525 best_thr=0.30 time=3.7s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.50794,0.507956,0.505937,0.502498,0.509424,0.51057,0.501544,0.25004,0.3,1,1.504713
1,0.5,0.507062,0.505534,0.292858,0.511318,0.205191,0.512172,0.508636,0.249913,0.3,2,1.539524
2,0.5,0.508818,0.508923,0.515067,0.505403,0.525107,0.515722,0.51505,0.250004,0.3,3,3.722083


In [None]:
for P in percentage_changes:
    H = 5  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_catboost_experiment(X, y, splits).to_html()))

Minutes ahead: 5, Percentage change: 0.1


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:26,577 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.143 | calibrating=sigmoid
2025-08-16 23:45:28,099 INFO: [CatBoost Fold 1] AUC=0.691 PR=0.225 F1@0.5=0.306 P1@0.5=0.204 R1@0.5=0.613 best_thr=0.55 time=1.5s
2025-08-16 23:45:28,111 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.131 | calibrating=sigmoid
2025-08-16 23:45:30,246 INFO: [CatBoost Fold 2] AUC=0.732 PR=0.208 F1@0.5=0.281 P1@0.5=0.189 R1@0.5=0.553 best_thr=0.50 time=2.1s
2025-08-16 23:45:30,265 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.119 | calibrating=sigmoid
2025-08-16 23:45:33,092 INFO: [CatBoost Fold 3] AUC=0.702 PR=0.209 F1@0.5=0.282 P1@0.5=0.177 R1@0.5=0.685 best_thr=0.55 time=2.8s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.67116,0.645909,0.306259,0.204139,0.612819,0.690997,0.224947,0.209647,0.55,1,1.530285
1,0.5,0.726392,0.648832,0.281335,0.1887,0.552625,0.73152,0.207965,0.175127,0.5,2,2.146609
2,0.5,0.617321,0.64699,0.281629,0.177253,0.68498,0.70194,0.209345,0.211327,0.55,3,2.845576


Minutes ahead: 5, Percentage change: 0.15


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:33,143 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.074 | calibrating=sigmoid
2025-08-16 23:45:34,973 INFO: [CatBoost Fold 1] AUC=0.747 PR=0.165 F1@0.5=0.215 P1@0.5=0.129 R1@0.5=0.643 best_thr=0.60 time=1.8s
2025-08-16 23:45:34,987 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.066 | calibrating=sigmoid
2025-08-16 23:45:37,115 INFO: [CatBoost Fold 2] AUC=0.772 PR=0.133 F1@0.5=0.190 P1@0.5=0.115 R1@0.5=0.552 best_thr=0.60 time=2.1s
2025-08-16 23:45:37,130 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.059 | calibrating=sigmoid
2025-08-16 23:45:40,391 INFO: [CatBoost Fold 3] AUC=0.754 PR=0.145 F1@0.5=0.184 P1@0.5=0.107 R1@0.5=0.674 best_thr=0.65 time=3.3s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.724378,0.68636,0.214685,0.128841,0.643298,0.7468,0.164517,0.181888,0.6,1,1.837931
1,0.5,0.788468,0.675643,0.190034,0.114787,0.551665,0.772291,0.133087,0.144881,0.6,2,2.140817
2,0.5,0.694942,0.685232,0.184229,0.106686,0.674419,0.754209,0.144717,0.185617,0.65,3,3.275415


Minutes ahead: 5, Percentage change: 0.2


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:40,442 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.040 | calibrating=sigmoid
2025-08-16 23:45:41,978 INFO: [CatBoost Fold 1] AUC=0.800 PR=0.120 F1@0.5=0.152 P1@0.5=0.085 R1@0.5=0.702 best_thr=0.70 time=1.5s
2025-08-16 23:45:41,994 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.034 | calibrating=sigmoid
2025-08-16 23:45:44,054 INFO: [CatBoost Fold 2] AUC=0.808 PR=0.094 F1@0.5=0.131 P1@0.5=0.074 R1@0.5=0.551 best_thr=0.70 time=2.1s
2025-08-16 23:45:44,070 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.030 | calibrating=sigmoid
2025-08-16 23:45:47,017 INFO: [CatBoost Fold 3] AUC=0.803 PR=0.106 F1@0.5=0.119 P1@0.5=0.065 R1@0.5=0.724 best_thr=0.65 time=3.0s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.774343,0.739405,0.151636,0.084993,0.702338,0.800328,0.119612,0.17372,0.7,1,1.542694
1,0.5,0.837736,0.697444,0.130844,0.074243,0.55064,0.807952,0.093968,0.119945,0.7,2,2.075279
2,0.5,0.739768,0.731933,0.119056,0.064863,0.723698,0.802962,0.105608,0.168036,0.65,3,2.962637


#### Minutes Ahed = **10**


In [None]:
H = 10  # minutes ahead
target_col_basic = f'Target_cls_simple_{H}_min' # simple sign

X = lagged.loc[df_base[target_col_basic].index]
y = df_base[target_col_basic].astype(int).reindex(X.index)
mask = ~X.isna().any(axis=1) & y.notna()
X, y = X[mask], y[mask]

embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
display(HTML(run_catboost_experiment(X, y, splits).to_html()))

CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:47,080 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.505 | calibrating=sigmoid
2025-08-16 23:45:48,883 INFO: [CatBoost Fold 1] AUC=0.514 PR=0.505 F1@0.5=0.500 P1@0.5=0.502 R1@0.5=0.499 best_thr=0.40 time=1.8s
2025-08-16 23:45:48,895 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.499 | calibrating=sigmoid
2025-08-16 23:45:50,528 INFO: [CatBoost Fold 2] AUC=0.524 PR=0.521 F1@0.5=0.347 P1@0.5=0.529 R1@0.5=0.258 best_thr=0.30 time=1.6s
2025-08-16 23:45:50,544 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.500 | calibrating=sigmoid
2025-08-16 23:45:54,634 INFO: [CatBoost Fold 3] AUC=0.522 PR=0.515 F1@0.5=0.558 P1@0.5=0.509 R1@0.5=0.617 best_thr=0.30 time=4.1s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.509025,0.508881,0.500342,0.50211,0.498586,0.514104,0.504733,0.250119,0.4,1,1.809795
1,0.5,0.512924,0.513564,0.346781,0.529002,0.257933,0.524139,0.521116,0.249895,0.3,2,1.645319
2,0.5,0.515222,0.516018,0.558113,0.509388,0.617146,0.522469,0.515489,0.250682,0.3,3,4.104945


In [None]:
for P in percentage_changes:
    H = 10  # minutes ahead
    target_col_percent = f'Target_cls_{P}%_{H}_min' # >= +0.2% in 15m

    X = lagged.loc[df_base[target_col_percent].index]
    y = df_base[target_col_percent].astype(int).reindex(X.index)
    mask = ~X.isna().any(axis=1) & y.notna()
    X, y = X[mask], y[mask]

    embargo = max(time_periods)  # 5/15/30/60/120/180 -> pick max you ever predict
    splits = list(walk_forward_splits_row_embargo(X.index, n_splits=3, embargo_minutes=embargo))
    print(f"Minutes ahead: {H}, Percentage change: {P}")
    display(HTML(run_catboost_experiment(X, y, splits).to_html()))

Minutes ahead: 10, Percentage change: 0.1


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:45:54,690 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.218 | calibrating=sigmoid
2025-08-16 23:45:56,316 INFO: [CatBoost Fold 1] AUC=0.642 PR=0.284 F1@0.5=0.357 P1@0.5=0.273 R1@0.5=0.515 best_thr=0.50 time=1.6s
2025-08-16 23:45:56,328 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.201 | calibrating=sigmoid
2025-08-16 23:45:57,770 INFO: [CatBoost Fold 2] AUC=0.680 PR=0.269 F1@0.5=0.344 P1@0.5=0.260 R1@0.5=0.510 best_thr=0.45 time=1.5s
2025-08-16 23:45:57,787 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.188 | calibrating=sigmoid
2025-08-16 23:46:00,095 INFO: [CatBoost Fold 3] AUC=0.651 PR=0.263 F1@0.5=0.357 P1@0.5=0.240 R1@0.5=0.692 best_thr=0.50 time=2.3s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.655718,0.601524,0.356546,0.272519,0.51549,0.641947,0.283674,0.221769,0.5,1,1.633792
1,0.5,0.687143,0.615672,0.344301,0.259801,0.510266,0.679725,0.26895,0.220852,0.45,2,1.453517
2,0.5,0.559739,0.611634,0.356749,0.240342,0.691835,0.650727,0.262569,0.230985,0.5,3,2.324028


Minutes ahead: 10, Percentage change: 0.15


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:46:00,145 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.133 | calibrating=sigmoid
2025-08-16 23:46:01,210 INFO: [CatBoost Fold 1] AUC=0.687 PR=0.217 F1@0.5=0.283 P1@0.5=0.187 R1@0.5=0.581 best_thr=0.55 time=1.1s
2025-08-16 23:46:01,222 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.120 | calibrating=sigmoid
2025-08-16 23:46:02,913 INFO: [CatBoost Fold 2] AUC=0.725 PR=0.194 F1@0.5=0.262 P1@0.5=0.178 R1@0.5=0.502 best_thr=0.45 time=1.7s
2025-08-16 23:46:02,928 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.111 | calibrating=sigmoid
2025-08-16 23:46:05,465 INFO: [CatBoost Fold 3] AUC=0.698 PR=0.194 F1@0.5=0.267 P1@0.5=0.164 R1@0.5=0.704 best_thr=0.55 time=2.6s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.681179,0.637289,0.28286,0.186905,0.581284,0.687201,0.21667,0.225128,0.55,1,1.071374
1,0.5,0.743435,0.634849,0.26247,0.177671,0.50213,0.725219,0.193795,0.184948,0.45,2,1.70301
2,0.5,0.604617,0.64875,0.266596,0.164422,0.7042,0.698411,0.19438,0.21079,0.55,3,2.551821


Minutes ahead: 10, Percentage change: 0.2


CatBoost folds:   0%|          | 0/3 [00:00<?, ?it/s]

2025-08-16 23:46:05,515 INFO: [Fold 1] train=38,721 test=38,727 pos_rate=0.083 | calibrating=sigmoid
2025-08-16 23:46:07,020 INFO: [CatBoost Fold 1] AUC=0.734 PR=0.178 F1@0.5=0.221 P1@0.5=0.134 R1@0.5=0.639 best_thr=0.60 time=1.5s
2025-08-16 23:46:07,032 INFO: [Fold 2] train=77,448 test=38,727 pos_rate=0.073 | calibrating=sigmoid
2025-08-16 23:46:08,620 INFO: [CatBoost Fold 2] AUC=0.756 PR=0.143 F1@0.5=0.196 P1@0.5=0.119 R1@0.5=0.564 best_thr=0.60 time=1.6s
2025-08-16 23:46:08,637 INFO: [Fold 3] train=116,175 test=38,727 pos_rate=0.067 | calibrating=sigmoid
2025-08-16 23:46:11,015 INFO: [CatBoost Fold 3] AUC=0.744 PR=0.146 F1@0.5=0.189 P1@0.5=0.108 R1@0.5=0.734 best_thr=0.60 time=2.4s


Unnamed: 0,thr,acc,bacc,f1,prec1,rec1,roc_auc,pr_auc,brier,best_thr_f1,fold,sec
0,0.5,0.711984,0.677872,0.221416,0.133919,0.638743,0.733809,0.178081,0.213377,0.6,1,1.511983
1,0.5,0.754306,0.664283,0.1963,0.11885,0.563531,0.756256,0.142952,0.18356,0.6,2,1.599169
2,0.5,0.639244,0.683686,0.188911,0.108409,0.733875,0.743881,0.146298,0.201634,0.6,3,2.394572
