<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/7/74/Logo_%C3%89cole_normale_sup%C3%A9rieure_-_PSL_%28ENS-PSL%29.svg"
             alt="ENS-PSL"
             width="500"
             style="margin-right: 30px; display: inline-block; vertical-align: middle;"/>
    <img src="https://upload.wikimedia.org/wikipedia/en/3/3f/Qube_Research_%26_Technologies_Logo.svg"
             alt="QRT"
             width="200"
             style="display: inline-block; vertical-align: middle;"/>
</p>


# QRT - Predicting Directional Performance of Asset Allocations 
**Meta-Decision Learning on Systematic Portfolio Signals**

## Data Challenge  
**Powered by ENS**

<h3><span style="color:#800000;"><strong>Authored by:</strong> <em>Alexandre Mathias DONNAT, Sr</em></span></h3>

**Currently ranked 79/355:** on *https://challengedata.ens.fr/challenges/167*

This notebook tackles a meta-decision problem in systematic trading:

Given the recent behaviour of a portfolio allocation, should we trust it (follow its weights), or fade it (take the opposite position)?

Each row of the dataset represents a systematic allocation of assets, defined by:

- 20 past daily returns
- 20 past signed liquidity measures
- its median daily turnover
- its group identifier
- and its future return (target)

The goal is not to predict the magnitude of the return, but only its sign.

Formally, we solve a binary directional classification problem under a time-series constraint, optimising the official metric:

$$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}\left[\text{sign}(\hat{r}_i) = \text{sign}(r_i)\right]$$

This is a decision problem, not a regression problem.


# 1. Mathematical Formulation

## 1.1 Allocation structure

For an allocation $S$ at date $t$, with $M$ assets:

$$r_{S,t+1} = \sum_{i=1}^{M} w_{S,t,i} \cdot r_{i,t+1}$$

where:

- $w_{S,t,i}$ are portfolio weights satisfying: $\sum_{i=1}^{M} |w_{S,t,i}| = 1$
- $r_{i,t+1}$ is the return of asset $i$ on the next session.

Each row in the dataset gives:

$$\{r_{S,t}, r_{S,t-1}, \ldots, r_{S,t-19}\}$$

plus liquidity proxies and turnover summaries.

Our objective is to learn a function:

$$f: \mathbb{R}^d \to \{0, 1\}$$

such that:

- $1$ → trust allocation
- $0$ → fade allocation

## 1.2 Nature of the metric

The challenge metric only evaluates directional correctness.

- Predicting $+0.0001$ or $+0.10$ is identical.
- Magnitude is irrelevant.
- Only the sign matters.

This has important consequences:

- MSE optimisation is suboptimal.
- Calibration matters less than directional stability.
- Overfitting noise in magnitude does not improve score.

The problem becomes a binary classification under heavy noise, with:

- weak signal-to-noise ratio,
- temporal regime shifts,
- heterogeneity across allocation groups.

# 2. Data Description

## 2.1 Training set

The dataset consists of:

- 527,073 observations in training,
- 31,870 observations in test.

Each row corresponds to a tuple: $(TS, ALLOCATION)$

Columns include:

- `RET_1` to `RET_20` — past daily allocation returns,
- `SIGNED_VOLUME_1` to `SIGNED_VOLUME_20`,
- `MEDIAN_DAILY_TURNOVER`,
- `GROUP`,
- `TARGET` (future allocation return).

The target is transformed into:

$$y = \mathbb{1}[TARGET > 0]$$

## 2.2 Time-series structure

The dataset is indexed by:

- anonymised timestamps (`TS`)
- allocation identifiers

Important:

- `TS` labels are ordered but not guaranteed continuous.
- No shuffling is allowed.
- Validation must respect temporal ordering.

All splits are therefore performed via:

- train = first 80% TS
- validation = last 20% TS

This ensures strict leakage-free evaluation.

# 3. Feature Engineering Philosophy

The approach used in this notebook is deliberately controlled.

We avoid:

- extreme feature explosion,
- over-engineered statistical transformations,
- cross-sectional leakage tricks.

Instead, we construct:

## 3.1 Rolling return descriptors

For windows $k \in \{3, 5, 10, 20\}$:

- mean return
- volatility
- min / max
- sign fraction
- entropy of sign distribution
- linear slope (trend proxy)

These capture:

- short-term momentum,
- stability,
- directional persistence,
- trend strength.

## 3.2 Scale-free normalisations

We introduce:

$$\text{ret\_norm}_1 = \frac{RET_1}{\sigma_5}$$

This removes volatility scale and improves directional comparability across allocations.

## 3.3 Liquidity behaviour

Analogous rolling statistics are computed on signed volumes:

- mean
- volatility
- extrema

Liquidity interacts with return stability and turnover.

## 3.4 Limited interactions

We include only a few economically motivated interactions:

- momentum × turnover
- normalized return × entropy

We deliberately avoid high-order combinatorics.

# 4. Model

We use LightGBM with:

- binary objective,
- early stopping,
- categorical handling for: `GROUP`, `ALLOCATION`, `TS`

Key hyperparameters:

- `learning_rate = 0.03`
- `num_leaves = 128`
- `min_data_in_leaf = 200`
- `feature_fraction = 0.8`
- `bagging_fraction = 0.8`
- L2 regularisation

The model is trained on the first 80% of timestamps and validated on the last 20%.

The final model is refit on the full training data using the selected number of boosting rounds.

# 5. Results

Best public score achieved: $0.5154$

Validation accuracy: $\approx 0.526$

This indicates:

- real predictive signal,
- controlled generalisation gap,
- absence of extreme overfitting.

Extensive attempts at feature inflation (tail risk, cross-sectional ranks, allocation priors, regime calibration) did not improve public performance, suggesting that the current representation is close to the signal frontier under this modelling class.

# 6. Interpretation

This challenge illustrates a key principle:

A good data challenge is rarely won by a more complex model.
It is won by a better formulation.

The core signal here lies in:

- recent directional stability,
- short-term volatility scaling,
- regime heterogeneity encoded by `TS` and `GROUP`.

Beyond a certain complexity threshold, additional features introduce variance rather than signal.

# 7. Code Pipeline

The following sections implement:

- Data loading and preprocessing
- Feature engineering
- Time-based split
- LightGBM training with early stopping
- Final refit and submission generation


In [None]:
import numpy as np
import pandas as pd

import lightgbm as lgb
from sklearn.metrics import accuracy_score

X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")
X_test  = pd.read_csv("X_test.csv")

df = X_train.merge(y_train, on="ROW_ID")
df["y"] = (df["target"] > 0).astype(int)

df = df.sort_values(["TS", "ALLOCATION"]).reset_index(drop=True)

ret_cols = [f"RET_{i}" for i in range(1, 21)]
vol_cols = [f"SIGNED_VOLUME_{i}" for i in range(1, 21)]

print(df.shape, X_test.shape)
print("Baseline always-long:", df["y"].mean())

(527073, 47) (31870, 45)
Baseline always-long: 0.5071840143585423


In [None]:
def sign_entropy(p: pd.Series) -> pd.Series:
    eps = 1e-9
    return -(p*np.log(p+eps) + (1-p)*np.log(1-p+eps))

def add_features(frame: pd.DataFrame) -> pd.DataFrame:
    f = frame.copy()

    for k in [3, 5, 10, 20]:
        cols = [f"RET_{i}" for i in range(1, k+1)]
        f[f"ret_mean_{k}"] = f[cols].mean(axis=1)
        f[f"ret_std_{k}"]  = f[cols].std(axis=1)
        f[f"ret_min_{k}"]  = f[cols].min(axis=1)
        f[f"ret_max_{k}"]  = f[cols].max(axis=1)
        f[f"sign_frac_{k}"] = (f[cols] > 0).mean(axis=1)
        f[f"sign_entropy_{k}"] = sign_entropy(f[f"sign_frac_{k}"])

        # slope proxy 
        t = np.arange(k)
        # fast closed-form slope: cov(t,x)/var(t)
        denom = ((t - t.mean())**2).sum()
        centered_t = (t - t.mean())
        Xk = f[cols].to_numpy()
        slope = (Xk @ centered_t) / denom
        f[f"ret_slope_{k}"] = slope

    # Momentum + scale-free versions
    f["vol_5"] = f[[f"RET_{i}" for i in range(1, 6)]].std(axis=1) + 1e-6
    f["ret_norm_1"] = f["RET_1"] / f["vol_5"]
    f["ret_norm_3"] = f["RET_1"] / (f[[f"RET_{i}" for i in range(1, 4)]].std(axis=1) + 1e-6)
    f["ret_norm_10"] = f["RET_1"] / (f[[f"RET_{i}" for i in range(1, 11)]].std(axis=1) + 1e-6)

    # Signed volumes: same family of stats
    for k in [3, 5, 10, 20]:
        cols = [f"SIGNED_VOLUME_{i}" for i in range(1, k+1)]
        f[f"vol_mean_{k}"] = f[cols].mean(axis=1)
        f[f"vol_std_{k}"]  = f[cols].std(axis=1)
        f[f"vol_min_{k}"]  = f[cols].min(axis=1)
        f[f"vol_max_{k}"]  = f[cols].max(axis=1)

    # Interactions (few, not insane)
    f["mom5_x_turn"] = f["ret_mean_5"] * f["MEDIAN_DAILY_TURNOVER"].astype(float)
    f["norm1_x_entropy5"] = f["ret_norm_1"] * f["sign_entropy_5"]

    # Categorical encodings (simple)
    f["GROUP"] = f["GROUP"].astype("category")
    f["ALLOCATION"] = f["ALLOCATION"].astype("category")
    f["TS"] = f["TS"].astype("category")

    return f

df_feat = add_features(df)
test_feat = add_features(X_test)

print("Features added. Train columns:", df_feat.shape[1], "Test columns:", test_feat.shape[1])

Features added. Train columns: 97 Test columns: 95


In [None]:
# Time-based split by TS
ts_unique = df_feat["TS"].cat.categories.tolist() if str(df_feat["TS"].dtype) == "category" else sorted(df_feat["TS"].unique())
cut = int(0.8 * len(ts_unique))
ts_train = set(ts_unique[:cut])
ts_val   = set(ts_unique[cut:])

train_mask = df_feat["TS"].isin(ts_train)
val_mask   = df_feat["TS"].isin(ts_val)

# Feature set: drop raw identifiers + target columns, keep engineered + base lags
drop_cols = {"ROW_ID", "target", "y"}  
# keep raw lags too (LGBM may exploit non-linear patterns)
feature_cols = [c for c in df_feat.columns if c not in drop_cols]

X_tr = df_feat.loc[train_mask, feature_cols]
y_tr = df_feat.loc[train_mask, "y"]
X_va = df_feat.loc[val_mask, feature_cols]
y_va = df_feat.loc[val_mask, "y"]

# LightGBM datasets
dtrain = lgb.Dataset(X_tr, label=y_tr, categorical_feature=["GROUP", "ALLOCATION", "TS"], free_raw_data=False)
dvalid = lgb.Dataset(X_va, label=y_va, categorical_feature=["GROUP", "ALLOCATION", "TS"], free_raw_data=False)

params = dict(
    objective="binary",
    metric="binary_logloss",
    learning_rate=0.03,
    num_leaves=128,
    max_depth=-1,
    min_data_in_leaf=200,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=1,
    lambda_l2=5.0,
    lambda_l1=0.0,
    min_gain_to_split=0.0,
    verbose=-1,
)

model = lgb.train(
    params,
    dtrain,
    num_boost_round=5000,
    valid_sets=[dtrain, dvalid],
    valid_names=["train", "valid"],
    callbacks=[
        lgb.early_stopping(stopping_rounds=200),
        lgb.log_evaluation(period=200),
    ],
)

# Evaluate
va_proba = model.predict(X_va, num_iteration=model.best_iteration)
va_pred = (va_proba > 0.5).astype(int)
acc_val = accuracy_score(y_va, va_pred)
baseline_val = y_va.mean() 

print(f"VAL accuracy: {acc_val:.6f} | delta vs always-long: {acc_val - baseline_val:+.6f} | best_iter={model.best_iteration}")

Training until validation scores don't improve for 200 rounds
[200]	train's binary_logloss: 0.576813	valid's binary_logloss: 0.691339
Early stopping, best iteration is:
[46]	train's binary_logloss: 0.651171	valid's binary_logloss: 0.690901
VAL accuracy: 0.526022 | delta vs always-long: +0.009715 | best_iter=46


In [None]:
# Refit on full data using best_iteration
X_full = df_feat[feature_cols]
y_full = df_feat["y"]

dfull = lgb.Dataset(X_full, label=y_full, categorical_feature=["GROUP", "ALLOCATION", "TS"], free_raw_data=False)

final_model = lgb.train(
    params,
    dfull,
    num_boost_round=model.best_iteration,  # lock the complexity
    valid_sets=[dfull],
    valid_names=["train"],
    callbacks=[lgb.log_evaluation(period=300)],
)

X_te = test_feat[feature_cols]
te_proba = final_model.predict(X_te)
te_pred = (te_proba > 0.5).astype(int)

submission = pd.DataFrame({
    "ROW_ID": X_test["ROW_ID"],
    "prediction": te_pred
})
submission.to_csv("submission_lgbm.csv", index=False)

print(submission.head())
print("Pred mean:", te_pred.mean())

   ROW_ID  prediction
0  527073           1
1  527074           1
2  527075           0
3  527076           0
4  527077           0
Pred mean: 0.613680577345466
