<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/7/74/Logo_%C3%89cole_normale_sup%C3%A9rieure_-_PSL_%28ENS-PSL%29.svg"
             alt="ENS-PSL"
             width="475"
             style="margin-right: 30px; display: inline-block; vertical-align: middle;"/>
    <img src="https://challengedata.ens.fr/logo/public/CFM_CoRGB_300dpi_Tight_box_Er2kNvB.png"
             alt="Crédit Agricole Assurances"
             width="260"
             style="display: inline-block; vertical-align: middle;"/>
</p>

# Capital Fund Management - High Frequency Market Data Microstructure Classification
**Deep Learning for Stock Identity Recognition from Order Book Sequences**

## Data Challenge 
**Powered by ENS** 

<h3><span style="color:#800000;"><strong>Authored by:</strong> <em>Alexandre Mathias DONNAT, Sr</em></span></h3>

**Curently ranked 28/253** on *https://challengedata.ens.fr/challenges/146*

This notebook presents a deep-learning framework for identifying anonymous equities from short sequences of order-book events.
The objective is to classify each 100-event sequence into one of 24 equity classes (over 1000+ unique different and unknown 100-event sequences), using both numerical microstructure variables and categorical market-event descriptors.

Each training sample consists of:

- **1000+ unique, different and unknow 100 consecutive events** for a single stock,

- **Microstructural features** describing bid/ask dynamics, trade activity, order flow direction, and venue information,

- An **anonymous label** eqt_code_cat ∈ {0,…,23} representing a stock id

The challenge is therefore to learn the microstructure signature of each equity.

## Understanding the modeling problem

Each `obs_id` corresponds to:

- a contiguous sequence of **100 market events**,
- describing **one stock**,
- but the identity of that stock is **anonymized**.

Our task is:

> **Given 100 high-frequency events, predict which of the 24 hidden stocks generated them.**

Microstructure patterns differ across equities due to:

- **volatility regimes**
- **liquidity and spread behavior**
- **order imbalances**
- **typical trade sizes and venues**
- **market activity cycles**

These differences create **statistical fingerprints** that deep learning can extract.


## Description of the data

#### 1) `x_train.csv` - Order-book event features

Contains 104 850 events, forming 1 048 sequences of 100 events each;

In [1]:
import pandas as pd
df = pd.read_csv("x_train.csv")
df.head()

Unnamed: 0,obs_id,venue,order_id,action,side,price,bid,ask,bid_size,ask_size,trade,flux
0,0,4,0,A,A,0.3,0.0,0.01,100,1,False,100
1,0,4,1,A,B,-0.17,0.0,0.01,100,1,False,100
2,0,4,2,D,A,0.28,0.0,0.01,100,1,False,-100
3,0,4,3,A,A,0.3,0.0,0.01,100,1,False,100
4,0,4,4,D,A,0.37,0.0,0.01,100,1,False,-100


#### Variables include:

**Numerical features:**
- **prices**: `price`, `bid`, `ask`
- **volumes**: `bid_size`, `ask_size`
- **flow**: `flux` (signed order flow)

**Categorical features:**
- `venue` – trading venue identifier
- `side` – order side (A/B, representing bid/ask or buy/sell)
- `action` – order action type (A=Add, D=Delete, etc.)
- `trade` – boolean flag indicating if event is a trade

These are anonymized but carry meaningful **microstructural patterns** that distinguish equities.


#### 2) `y_train.csv` - Labels

In [2]:
y = pd.read_csv("y_train.csv")
y.head()

Unnamed: 0,obs_id,eqt_code_cat
0,0,10
1,1,15
2,2,0
3,3,13
4,4,0


For each sequence ID (`obs_id`), we have the true class:

- **`eqt_code_cat`** — integer from 0 to 23 representing the anonymous stock identity

#### 3)  `x_test.csv`- Sequences to classify

Same format as x_train, without labels.

Our final output simillary to y_train.csv, also contains : obs_id, eqt_code_cat but predicted from x_test.csv.


## How the scoring works

Evaluation uses accuracy:

$$\text{Accuracy} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}\{\hat{y}_i = y_i\}$$

Therefore, all model selection must rely on internal validation.

## The problem to solve

We face a multiclass sequence classification task:

- **24 classes**
- **160,800 sequences** (from 16,080,000 rows ÷ 100 time steps)
- Each sequence: **100 time steps** × (numerical + categorical features)

Microstructure data are non-stationary, noisy, and order-dependent, which justifies the use of recurrent deep models.

## Modeling pipeline

Below is the full pipeline including preprocessing, sequence reconstruction, model architecture, mathematical foundations, and parameter choices.

### I - Preprocessing

#### I.1 Reconstructing sequences

We reshape the long table (16,080,000 rows) into sequences:

$$X \in \mathbb{R}^{N_{\text{seq}} \times 100 \times d}$$

with:
- $N_{\text{seq}} = 160{,}800$
- $100$ events per sequence
- $d =$ number of features

This restores temporal order.

#### I.2 Feature structuring

**Numerical features**

We keep them as real-valued vectors, optionally scaled for stability.

**Categorical features**

We encode:
- `venue`
- `action` (action_type)
- `side`
- `trade` (trade_type)

as embeddings, i.e., learnable vectors:

$$\text{Embed}(c) = W_c \in \mathbb{R}^k$$

**Why embeddings?**
- reduce sparsity
- capture similarity relations
- improve convergence
- standard in sequence models (analogy with NLP)

Typical dimension: $k = 8$ or $16$.

#### I.3 Train-validation split

We use an 80/20 split at the sequence level:
- ensures independence
- avoids leakage across time steps

### II - Model architecture

The architecture follows a **Bi-GRU sequence encoder** with heavy regularization.

#### II.1 Embedding layers

Each categorical channel becomes a dense representation:

$$e_t = [e_t^{(\text{venue})}, e_t^{(\text{action})}, e_t^{(\text{side})}, e_t^{(\text{trade})}]$$

These embeddings are concatenated with numerical features.

#### II.2 Bidirectional GRU encoder

A Gated Recurrent Unit (GRU) at each time step computes:

$$h_t = \text{GRU}(x_t, h_{t-1})$$

In bidirectional mode:

$$h_t^{\text{bi}} = [h_t^{\rightarrow}, h_t^{\leftarrow}]$$

**Why GRU?**
- fewer parameters than LSTM
- good for noisy HF data
- stable hidden dynamics
- works very well with ~100 time steps

**Parameter choice:**
- hidden dimension = 128
- recurrent_dropout = 0.25
- Bi-directional for richer temporal context

#### II.3 Mathematical rationale for the GRU

GRU applies gates:

$$\begin{align}
z_t &= \sigma(W_z x_t + U_z h_{t-1}) \\
r_t &= \sigma(W_r x_t + U_r h_{t-1}) \\
\tilde{h}_t &= \tanh(W_h x_t + U_h (r_t \odot h_{t-1})) \\
h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
\end{align}$$

These gates allow the model to:
- retain long-range dependencies,
- suppress noise,
- adapt to regime changes in microstructure events.

#### II.4 Regularisation strategy

We employ:
- **Dropout** (0.3–0.5)
- **Recurrent dropout**
- **Layer Normalization**
- **L2 weight penalties**

**Purpose:**
- avoid overfitting on only ~160k sequences
- enhance generalization
- stabilize training

Mathematically, L2 adds:

$$L_{\text{total}} = L_{\text{CE}} + \lambda \|W\|_2^2$$

#### II.5 Softmax classifier + Label Smoothing

The final dense layer outputs:

$$\hat{p}(y = k \mid x) = \text{softmax}(Wh + b)_k$$

We apply label smoothing, modifying the target distribution:

$$y_k^{\text{LS}} = (1 - \epsilon) \mathbb{1}_{k=y} + \frac{\epsilon}{24}$$

with $\epsilon = 0.1$.

This reduces over-confidence and improves generalization.

### III - Training strategy

#### III.1 Early stopping

Stops when validation accuracy stops improving.

**Reasoning:**
- Stop if $\Delta \text{val\_acc} < 0$ for 5 epochs

#### III.2 Learning-rate scheduling

We use:
- `ReduceLROnPlateau`

which updates:

$$\eta \leftarrow \eta / 3$$

if validation does not improve.

#### III.3 Model Checkpoint

We save the best performing epoch, not the last one.

This prevents late-epoch degradation due to overfitting.

### IV — Re-training and Prediction

After training, we:
1. Reload the best GRU model (`best_model_cfm_gru.h5`)
2. Apply identical preprocessing to `x_test`
3. Predict class probabilities
4. Take argmax across the 24 classes
5. Produce `y_prediction.csv` accordingly

## Idea behind sequence classification

The model attempts to learn a mapping:

$$f : \mathbb{R}^{100 \times d} \longrightarrow \{0, \ldots, 23\}$$

This is equivalent to learning the probability distribution:

$$p(y \mid x_{1:100})$$

Microstructure features embed subtle statistical signals:
- spread regimes
- volatility bursts
- trade aggressor patterns
- venue frequency distributions
- bid/ask oscillation symmetry

The Bi-GRU captures these patterns in its hidden state trajectory.

## Possible further improvements

#### 1. Larger or deeper sequence models
Transformers (Performer, TFT, Informer)

#### 2. Ensembles
GRU + LSTM + CNN + transformer hybrid

#### 3. Advanced feature engineering
Event clustering, microstructure volatility metrics, OFI measures

#### 4. Data augmentation
Randomized rescaling, local shuffling, spread perturbation


# 0. Modules & configuration

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.model_selection import train_test_split
import gc

pd.set_option("display.max_columns", 50)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

SEQ_LEN   = 100   # sequence length
N_CLASSES = 24    # eqt_code_cat ∈ {0,...,23}

# 1. Data Loading

In [None]:
X_train = pd.read_csv("x_train.csv")
X_test  = pd.read_csv("x_test.csv")
y_train = pd.read_csv("y_train.csv")  # obs_id + eqt_code_cat

print("X_train :", X_train.shape)
print("X_test  :", X_test.shape)
print("y_train :", y_train.shape)

X_train.head()

X_train : (16080000, 12)
X_test  : (8160000, 12)
y_train : (160800, 2)


Unnamed: 0,obs_id,venue,order_id,action,side,price,bid,ask,bid_size,ask_size,trade,flux
0,0,4,0,A,A,0.3,0.0,0.01,100,1,False,100
1,0,4,1,A,B,-0.17,0.0,0.01,100,1,False,100
2,0,4,2,D,A,0.28,0.0,0.01,100,1,False,-100
3,0,4,3,A,A,0.3,0.0,0.01,100,1,False,100
4,0,4,4,D,A,0.37,0.0,0.01,100,1,False,-100


# 2. Features preparation
## 2.1 Sequential Numerical Features

In [None]:
for df in (X_train, X_test):
    df["log_bid_size"] = np.log1p(df["bid_size"].clip(lower=0))
    df["log_ask_size"] = np.log1p(df["ask_size"].clip(lower=0))
    df["log_abs_flux"] = np.log1p(df["flux"].abs())

num_cols_seq = ["price", "bid", "ask", "log_bid_size", "log_ask_size", "log_abs_flux"]
len(num_cols_seq), num_cols_seq

(6, ['price', 'bid', 'ask', 'log_bid_size', 'log_ask_size', 'log_abs_flux'])

## 2.2 Categorial variables encoding (embeddings)

In [None]:
# Concat for common mapping train+test
all_venue  = pd.concat([X_train["venue"],  X_test["venue"]], axis=0)
all_action = pd.concat([X_train["action"], X_test["action"]], axis=0)
all_side   = pd.concat([X_train["side"],   X_test["side"]], axis=0)
all_trade  = pd.concat([X_train["trade"],  X_test["trade"]], axis=0)

venue2idx  = {v: i for i, v in enumerate(sorted(all_venue.unique()))}
action2idx = {v: i for i, v in enumerate(sorted(all_action.unique()))}
side2idx   = {v: i for i, v in enumerate(sorted(all_side.unique()))}
trade2idx  = {v: i for i, v in enumerate(sorted(all_trade.unique()))}

for df in (X_train, X_test):
    df["venue_idx"]  = df["venue"].map(venue2idx).astype("int32")
    df["action_idx"] = df["action"].map(action2idx).astype("int32")
    df["side_idx"]   = df["side"].map(side2idx).astype("int32")
    df["trade_idx"]  = df["trade"].map(trade2idx).astype("int32")

VENUE_VOCAB  = len(venue2idx)
ACTION_VOCAB = len(action2idx)
SIDE_VOCAB   = len(side2idx)
TRADE_VOCAB  = len(trade2idx)

VENUE_VOCAB, ACTION_VOCAB, SIDE_VOCAB, TRADE_VOCAB

(6, 3, 2, 2)

# 3. Sequential Tensors Building
## 3.1 Utilitary function

In [None]:
def build_sequences(df, seq_len=SEQ_LEN):
    """
    Build sequential tensors from a DataFrame X_train or X_test.
    - Sort by (obs_id, order_id)
    - Reshape in (n_obs, seq_len, features)
    Return : num_seq, venue_seq, action_seq, side_seq, trade_seq, obs_ids_sorted
    """
    df_sorted = df.sort_values(["obs_id", "order_id"]).reset_index(drop=True)
    
    obs_ids = df_sorted["obs_id"].to_numpy()
    # obs_ids is repeated seq_len times per sequence, we retrieve the unique in order
    obs_ids_unique, idx_first = np.unique(obs_ids, return_index=True)
    obs_ids_sorted = obs_ids_unique  # already sorted by construction
    
    n_obs = len(obs_ids_sorted)
    assert len(df_sorted) == n_obs * seq_len, \
        f"len(df_sorted)={len(df_sorted)} != n_obs*seq_len={n_obs}*{seq_len}"
    
    # Select columns
    num_array   = df_sorted[num_cols_seq].to_numpy().astype("float32")
    venue_array = df_sorted["venue_idx"].to_numpy().astype("int32")
    action_array= df_sorted["action_idx"].to_numpy().astype("int32")
    side_array  = df_sorted["side_idx"].to_numpy().astype("int32")
    trade_array = df_sorted["trade_idx"].to_numpy().astype("int32")
    
    num_seq    = num_array.reshape(n_obs, seq_len, -1)
    venue_seq  = venue_array.reshape(n_obs, seq_len)
    action_seq = action_array.reshape(n_obs, seq_len)
    side_seq   = side_array.reshape(n_obs, seq_len)
    trade_seq  = trade_array.reshape(n_obs, seq_len)
    
    return num_seq, venue_seq, action_seq, side_seq, trade_seq, obs_ids_sorted

## 3.2 Sequential train

In [None]:
train_num_seq, train_venue_seq, train_action_seq, train_side_seq, train_trade_seq, obs_ids_train_sorted = build_sequences(X_train)

train_num_seq.shape, train_venue_seq.shape, len(obs_ids_train_sorted)

((160800, 100, 6), (160800, 100), 160800)

## 3.3 Label alignement (fixed memory)

Here we avoid "`.set_index().loc[...]`" that could return MemoryError.
We simply do :

- Sort y_train by obs_id,
- We verify sorted list of obs_id de y_train = obs_ids_train_sorted,
- We extract eqt_code_cat in the correct order

In [None]:
# We first ensure that there is exactly one line per obs_id
counts = y_train["obs_id"].value_counts()
assert counts.max() == 1, "More than one line per obs_id in y_train?"

y_train_sorted = y_train.sort_values("obs_id").reset_index(drop=True)

# Check alignment
obs_y = y_train_sorted["obs_id"].to_numpy()
assert np.array_equal(obs_y, obs_ids_train_sorted), "Mismatch obs_id between X_train and y_train after sorting."

y_seq = y_train_sorted["eqt_code_cat"].to_numpy().astype("int32")
y_seq.shape

(160800,)

## 3.4 Split train / validation (by sequence)

In [None]:
n_obs_train = train_num_seq.shape[0]
idx_all = np.arange(n_obs_train, dtype="int32")

train_idx, val_idx = train_test_split(
    idx_all,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y_seq
)

train_num_seq_ = train_num_seq[train_idx]
val_num_seq_   = train_num_seq[val_idx]

train_venue_seq_  = train_venue_seq[train_idx]
val_venue_seq_    = train_venue_seq[val_idx]

train_action_seq_ = train_action_seq[train_idx]
val_action_seq_   = train_action_seq[val_idx]

train_side_seq_   = train_side_seq[train_idx]
val_side_seq_     = train_side_seq[val_idx]

train_trade_seq_  = train_trade_seq[train_idx]
val_trade_seq_    = train_trade_seq[val_idx]

y_train_seq = y_seq[train_idx]
y_val_seq   = y_seq[val_idx]

train_num_seq_.shape, val_num_seq_.shape

((128640, 100, 6), (32160, 100, 6))

## 3.5. Sequential test

In [9]:
test_num_seq, test_venue_seq, test_action_seq, test_side_seq, test_trade_seq, obs_ids_test_sorted = build_sequences(X_test)

test_num_seq.shape, len(obs_ids_test_sorted)


((81600, 100, 6), 81600)

## 3.6 Memory cooling dataframes

In [None]:
del X_train, X_test
gc.collect()

0

# 4.  GRU bidirectionnal regularized model

In [None]:
N_NUM = train_num_seq_.shape[-1]

EMB_DIM_VENUE  = 8
EMB_DIM_ACTION = 8
EMB_DIM_SIDE   = 4
EMB_DIM_TRADE  = 4

GRU_UNITS      = 64
DROPOUT_SEQ    = 0.20
REC_DROPOUT    = 0.10
DROPOUT_DENSE  = 0.40
L2_REG         = 1e-4

# Inputs
num_input    = keras.Input(shape=(SEQ_LEN, N_NUM),      name="num_seq")
venue_input  = keras.Input(shape=(SEQ_LEN,), dtype="int32", name="venue_seq")
action_input = keras.Input(shape=(SEQ_LEN,), dtype="int32", name="action_seq")
side_input   = keras.Input(shape=(SEQ_LEN,), dtype="int32", name="side_seq")
trade_input  = keras.Input(shape=(SEQ_LEN,), dtype="int32", name="trade_seq")

# Embeddings
venue_emb  = layers.Embedding(VENUE_VOCAB,  EMB_DIM_VENUE,  mask_zero=False)(venue_input)
action_emb = layers.Embedding(ACTION_VOCAB, EMB_DIM_ACTION, mask_zero=False)(action_input)
side_emb   = layers.Embedding(SIDE_VOCAB,   EMB_DIM_SIDE,   mask_zero=False)(side_input)
trade_emb  = layers.Embedding(TRADE_VOCAB,  EMB_DIM_TRADE,  mask_zero=False)(trade_input)

# Concatenate
x_seq = layers.Concatenate(axis=-1)([num_input, venue_emb, action_emb, side_emb, trade_emb])
x_seq = layers.LayerNormalization()(x_seq)

# GRU forward / backward
gru_fwd = layers.GRU(
    GRU_UNITS,
    return_sequences=False,
    dropout=DROPOUT_SEQ,
    recurrent_dropout=REC_DROPOUT,
    name="gru_forward"
)
gru_bwd = layers.GRU(
    GRU_UNITS,
    return_sequences=False,
    go_backwards=True,
    dropout=DROPOUT_SEQ,
    recurrent_dropout=REC_DROPOUT,
    name="gru_backward"
)

x_fwd = gru_fwd(x_seq)
x_bwd = gru_bwd(x_seq)

x = layers.Concatenate(name="bi_concat")([x_fwd, x_bwd])  # 128 dim
x = layers.BatchNormalization()(x)

x = layers.Dense(
    64,
    activation="selu",
    kernel_regularizer=keras.regularizers.l2(L2_REG),
    name="dense_hidden"
)(x)

x = layers.Dropout(DROPOUT_DENSE, name="dropout_hidden")(x)

outputs = layers.Dense(N_CLASSES, activation="softmax", name="logits")(x)

model = keras.Model(
    inputs={
        "num_seq":    num_input,
        "venue_seq":  venue_input,
        "action_seq": action_input,
        "side_seq":   side_input,
        "trade_seq":  trade_input,
    },
    outputs=outputs,
    name="cfm_gru_bidir_regularized"
)

model.summary()

# 5. Model Compilation

In [None]:
loss_fn = keras.losses.SparseCategoricalCrossentropy(
    from_logits=False
)

optimizer = keras.optimizers.Adam(learning_rate=3e-3)

model.compile(
    optimizer=optimizer,
    loss=loss_fn,
    metrics=["accuracy"],
)


# 6. Training with early stopping + ReduceLROnPlateau
Notes : 1h+ time to run cell

In [None]:
EPOCHS     = 35
BATCH_SIZE = 128

checkpoint = keras.callbacks.ModelCheckpoint(
    "best_model_cfm_gru.h5",
    monitor="val_accuracy",
    mode="max",
    save_best_only=True,
    verbose=1
)

early_stop = keras.callbacks.EarlyStopping(
    monitor="val_accuracy",
    mode="max",
    patience=6,
    restore_best_weights=True,
    verbose=1
)

reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor="val_loss",
    factor=0.5,
    patience=3,
    min_lr=1e-5,
    verbose=1
)

history = model.fit(
    {
        "num_seq":    train_num_seq_,
        "venue_seq":  train_venue_seq_,
        "action_seq": train_action_seq_,
        "side_seq":   train_side_seq_,
        "trade_seq":  train_trade_seq_,
    },
    y_train_seq,
    validation_data=(
        {
            "num_seq":    val_num_seq_,
            "venue_seq":  val_venue_seq_,
            "action_seq": val_action_seq_,
            "side_seq":   val_side_seq_,
            "trade_seq":  val_trade_seq_,
        },
        y_val_seq,
    ),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[checkpoint, early_stop, reduce_lr],
    verbose=2
)

Epoch 1/35

Epoch 1: val_accuracy improved from None to 0.28265, saving model to best_model_cfm_gru.h5




1005/1005 - 155s - 154ms/step - accuracy: 0.1699 - loss: 2.6204 - val_accuracy: 0.2826 - val_loss: 2.1475 - learning_rate: 0.0030
Epoch 2/35

Epoch 2: val_accuracy improved from 0.28265 to 0.37711, saving model to best_model_cfm_gru.h5




1005/1005 - 156s - 155ms/step - accuracy: 0.2887 - loss: 2.1261 - val_accuracy: 0.3771 - val_loss: 1.8240 - learning_rate: 0.0030
Epoch 3/35

Epoch 3: val_accuracy improved from 0.37711 to 0.41835, saving model to best_model_cfm_gru.h5




1005/1005 - 157s - 156ms/step - accuracy: 0.3493 - loss: 1.9252 - val_accuracy: 0.4183 - val_loss: 1.6872 - learning_rate: 0.0030
Epoch 4/35

Epoch 4: val_accuracy improved from 0.41835 to 0.46029, saving model to best_model_cfm_gru.h5




1005/1005 - 152s - 151ms/step - accuracy: 0.3893 - loss: 1.8010 - val_accuracy: 0.4603 - val_loss: 1.5505 - learning_rate: 0.0030
Epoch 5/35

Epoch 5: val_accuracy improved from 0.46029 to 0.49263, saving model to best_model_cfm_gru.h5




1005/1005 - 148s - 147ms/step - accuracy: 0.4168 - loss: 1.7187 - val_accuracy: 0.4926 - val_loss: 1.4629 - learning_rate: 0.0030
Epoch 6/35

Epoch 6: val_accuracy improved from 0.49263 to 0.52043, saving model to best_model_cfm_gru.h5




1005/1005 - 216s - 215ms/step - accuracy: 0.4389 - loss: 1.6528 - val_accuracy: 0.5204 - val_loss: 1.4008 - learning_rate: 0.0030
Epoch 7/35

Epoch 7: val_accuracy improved from 0.52043 to 0.52534, saving model to best_model_cfm_gru.h5




1005/1005 - 161s - 160ms/step - accuracy: 0.4576 - loss: 1.6035 - val_accuracy: 0.5253 - val_loss: 1.3754 - learning_rate: 0.0030
Epoch 8/35

Epoch 8: val_accuracy improved from 0.52534 to 0.54956, saving model to best_model_cfm_gru.h5




1005/1005 - 154s - 153ms/step - accuracy: 0.4691 - loss: 1.5676 - val_accuracy: 0.5496 - val_loss: 1.3104 - learning_rate: 0.0030
Epoch 9/35

Epoch 9: val_accuracy improved from 0.54956 to 0.55124, saving model to best_model_cfm_gru.h5




1005/1005 - 175s - 174ms/step - accuracy: 0.4808 - loss: 1.5342 - val_accuracy: 0.5512 - val_loss: 1.2984 - learning_rate: 0.0030
Epoch 10/35

Epoch 10: val_accuracy did not improve from 0.55124
1005/1005 - 141s - 141ms/step - accuracy: 0.4922 - loss: 1.5076 - val_accuracy: 0.5506 - val_loss: 1.2943 - learning_rate: 0.0030
Epoch 11/35

Epoch 11: val_accuracy improved from 0.55124 to 0.56794, saving model to best_model_cfm_gru.h5




1005/1005 - 124s - 123ms/step - accuracy: 0.5004 - loss: 1.4814 - val_accuracy: 0.5679 - val_loss: 1.2543 - learning_rate: 0.0030
Epoch 12/35

Epoch 12: val_accuracy improved from 0.56794 to 0.56990, saving model to best_model_cfm_gru.h5




1005/1005 - 123s - 122ms/step - accuracy: 0.5064 - loss: 1.4636 - val_accuracy: 0.5699 - val_loss: 1.2471 - learning_rate: 0.0030
Epoch 13/35

Epoch 13: val_accuracy improved from 0.56990 to 0.58047, saving model to best_model_cfm_gru.h5




1005/1005 - 131s - 131ms/step - accuracy: 0.5135 - loss: 1.4445 - val_accuracy: 0.5805 - val_loss: 1.2091 - learning_rate: 0.0030
Epoch 14/35

Epoch 14: val_accuracy improved from 0.58047 to 0.58604, saving model to best_model_cfm_gru.h5




1005/1005 - 126s - 126ms/step - accuracy: 0.5191 - loss: 1.4263 - val_accuracy: 0.5860 - val_loss: 1.1876 - learning_rate: 0.0030
Epoch 15/35

Epoch 15: val_accuracy did not improve from 0.58604
1005/1005 - 122s - 121ms/step - accuracy: 0.5230 - loss: 1.4154 - val_accuracy: 0.5729 - val_loss: 1.2287 - learning_rate: 0.0030
Epoch 16/35

Epoch 16: val_accuracy improved from 0.58604 to 0.59350, saving model to best_model_cfm_gru.h5




1005/1005 - 138s - 138ms/step - accuracy: 0.5282 - loss: 1.4002 - val_accuracy: 0.5935 - val_loss: 1.1726 - learning_rate: 0.0030
Epoch 17/35

Epoch 17: val_accuracy improved from 0.59350 to 0.59916, saving model to best_model_cfm_gru.h5




1005/1005 - 137s - 136ms/step - accuracy: 0.5333 - loss: 1.3907 - val_accuracy: 0.5992 - val_loss: 1.1736 - learning_rate: 0.0030
Epoch 18/35

Epoch 18: val_accuracy improved from 0.59916 to 0.60351, saving model to best_model_cfm_gru.h5




1005/1005 - 127s - 126ms/step - accuracy: 0.5379 - loss: 1.3762 - val_accuracy: 0.6035 - val_loss: 1.1516 - learning_rate: 0.0030
Epoch 19/35

Epoch 19: val_accuracy did not improve from 0.60351
1005/1005 - 162s - 161ms/step - accuracy: 0.5427 - loss: 1.3639 - val_accuracy: 0.6006 - val_loss: 1.1665 - learning_rate: 0.0030
Epoch 20/35

Epoch 20: val_accuracy did not improve from 0.60351
1005/1005 - 188s - 187ms/step - accuracy: 0.5474 - loss: 1.3550 - val_accuracy: 0.5960 - val_loss: 1.1720 - learning_rate: 0.0030
Epoch 21/35

Epoch 21: val_accuracy improved from 0.60351 to 0.60435, saving model to best_model_cfm_gru.h5




1005/1005 - 113s - 112ms/step - accuracy: 0.5486 - loss: 1.3490 - val_accuracy: 0.6044 - val_loss: 1.1437 - learning_rate: 0.0030
Epoch 22/35

Epoch 22: val_accuracy improved from 0.60435 to 0.62006, saving model to best_model_cfm_gru.h5




1005/1005 - 122s - 121ms/step - accuracy: 0.5520 - loss: 1.3407 - val_accuracy: 0.6201 - val_loss: 1.1103 - learning_rate: 0.0030
Epoch 23/35

Epoch 23: val_accuracy did not improve from 0.62006
1005/1005 - 137s - 136ms/step - accuracy: 0.5547 - loss: 1.3341 - val_accuracy: 0.6101 - val_loss: 1.1346 - learning_rate: 0.0030
Epoch 24/35

Epoch 24: val_accuracy did not improve from 0.62006
1005/1005 - 124s - 123ms/step - accuracy: 0.5578 - loss: 1.3273 - val_accuracy: 0.6122 - val_loss: 1.1247 - learning_rate: 0.0030
Epoch 25/35

Epoch 25: val_accuracy did not improve from 0.62006

Epoch 25: ReduceLROnPlateau reducing learning rate to 0.001500000013038516.
1005/1005 - 112s - 112ms/step - accuracy: 0.5459 - loss: 1.3624 - val_accuracy: 0.6101 - val_loss: 1.1417 - learning_rate: 0.0030
Epoch 26/35

Epoch 26: val_accuracy did not improve from 0.62006
1005/1005 - 113s - 113ms/step - accuracy: 0.5448 - loss: 1.3700 - val_accuracy: 0.6154 - val_loss: 1.1223 - learning_rate: 0.0015
Epoch 27/35





1005/1005 - 113s - 112ms/step - accuracy: 0.5560 - loss: 1.3300 - val_accuracy: 0.6211 - val_loss: 1.0915 - learning_rate: 0.0015
Epoch 28/35

Epoch 28: val_accuracy improved from 0.62114 to 0.62677, saving model to best_model_cfm_gru.h5




1005/1005 - 115s - 114ms/step - accuracy: 0.5635 - loss: 1.3095 - val_accuracy: 0.6268 - val_loss: 1.0880 - learning_rate: 0.0015
Epoch 29/35

Epoch 29: val_accuracy improved from 0.62677 to 0.63088, saving model to best_model_cfm_gru.h5




1005/1005 - 134s - 133ms/step - accuracy: 0.5684 - loss: 1.2924 - val_accuracy: 0.6309 - val_loss: 1.0733 - learning_rate: 0.0015
Epoch 30/35

Epoch 30: val_accuracy improved from 0.63088 to 0.63165, saving model to best_model_cfm_gru.h5




1005/1005 - 135s - 135ms/step - accuracy: 0.5731 - loss: 1.2784 - val_accuracy: 0.6317 - val_loss: 1.0661 - learning_rate: 0.0015
Epoch 31/35

Epoch 31: val_accuracy did not improve from 0.63165
1005/1005 - 142s - 141ms/step - accuracy: 0.5772 - loss: 1.2667 - val_accuracy: 0.6295 - val_loss: 1.0715 - learning_rate: 0.0015
Epoch 32/35

Epoch 32: val_accuracy improved from 0.63165 to 0.63209, saving model to best_model_cfm_gru.h5




1005/1005 - 127s - 126ms/step - accuracy: 0.5758 - loss: 1.2633 - val_accuracy: 0.6321 - val_loss: 1.0608 - learning_rate: 0.0015
Epoch 33/35

Epoch 33: val_accuracy improved from 0.63209 to 0.63355, saving model to best_model_cfm_gru.h5




1005/1005 - 131s - 131ms/step - accuracy: 0.5815 - loss: 1.2524 - val_accuracy: 0.6336 - val_loss: 1.0571 - learning_rate: 0.0015
Epoch 34/35

Epoch 34: val_accuracy did not improve from 0.63355
1005/1005 - 125s - 125ms/step - accuracy: 0.5829 - loss: 1.2467 - val_accuracy: 0.6336 - val_loss: 1.0596 - learning_rate: 0.0015
Epoch 35/35

Epoch 35: val_accuracy did not improve from 0.63355
1005/1005 - 142s - 141ms/step - accuracy: 0.5725 - loss: 1.2783 - val_accuracy: 0.6310 - val_loss: 1.0635 - learning_rate: 0.0015
Restoring model weights from the end of the best epoch: 33.


# 7. Best Model recall and validation check

In [None]:
best_model = keras.models.load_model("best_model_cfm_gru.h5")

val_metrics = best_model.evaluate(
    {
        "num_seq":    val_num_seq_,
        "venue_seq":  val_venue_seq_,
        "action_seq": val_action_seq_,
        "side_seq":   val_side_seq_,
        "trade_seq":  val_trade_seq_,
    },
    y_val_seq,
    batch_size=BATCH_SIZE,
    verbose=1
)

print("Validation loss, accuracy :", val_metrics)



[1m252/252[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 35ms/step - accuracy: 0.6336 - loss: 1.0571
Validation loss, accuracy : [1.0571269989013672, 0.6335510015487671]


# 7.2. Submission DataFrame & sanity checks

In [None]:
# Generate predictions on test set
y_pred_proba = best_model.predict(
    {
        "num_seq":    test_num_seq,
        "venue_seq":  test_venue_seq,
        "action_seq": test_action_seq,
        "side_seq":   test_side_seq,
        "trade_seq":  test_trade_seq,
    },
    batch_size=BATCH_SIZE,
    verbose=1
)

# Get the class with highest probability
y_pred_test = y_pred_proba.argmax(axis=1).astype("int32")

submission = pd.DataFrame({
    "obs_id": obs_ids_test_sorted,
    "eqt_code_cat": y_pred_test
})

print(submission.head())
print(submission["eqt_code_cat"].value_counts().sort_index())

print("NaN in submission:")
print(submission.isna().sum())

[1m638/638[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 43ms/step
   obs_id  eqt_code_cat
0       0            11
1       1             3
2       2            19
3       3             6
4       4            22
eqt_code_cat
0     3268
1     3252
2     2856
3     2642
4     5045
5     4490
6     4246
7     3415
8     2643
9     4309
10    3385
11    2038
12    7137
13    3957
14    1954
15    2440
16    2003
17    3913
18    2309
19    5508
20    1833
21    2244
22    3570
23    3143
Name: count, dtype: int64
NaN dans submission :
obs_id          0
eqt_code_cat    0
dtype: int64


# 7.3. Export CSV

In [None]:
submission.to_csv("y_prediction.csv", index=False)
print(">> y_prediction.csv created")

>> y_prediction.csv created
