# Data Analysis using the FI-2010 dataset

* FI-2010 is a [publicly available dataset](https://arxiv.org/abs/1705.03233) for mid-price forecasting from limit order book data. 

* Five stocks from NASDAQ Nordic stock market for 10 consecutive days. About 4000000 samples in total. 


## The data: 
```text
.
└── BenchmarkDatasets
    └── BenchmarkDatasets
        ├── Auction
        │   ├── 1.Auction_Zscore
        │   │   ├── Auction_Zscore_Testing
        │   │   └── Auction_Zscore_Training
        │   ├── 2.Auction_MinMax
        │   │   ├── Auction_MinMax_Testing
        │   │   └── Auction_MinMax_Training
        │   └── 3.Auction_DecPre
        │       ├── Auction_DecPre_Testing
        │       └── Auction_DecPre_Training
        └── NoAuction
            ├── 1.NoAuction_Zscore
            │   ├── NoAuction_Zscore_Testing
            │   └── NoAuction_Zscore_Training
            ├── 2.NoAuction_MinMax
            │   ├── NoAuction_MinMax_Testing
            │   └── NoAuction_MinMax_Training
            └── 3.NoAuction_DecPre
                ├── NoAuction_DecPre_Testing
                └── NoAuction_DecPre_Training
```

FI-2010 Benchmark datasets release that is bundled with multiple preprocessed versions of the LOB. The folder names: 


### Has Auction?
* **Auction** Auction includes samples from opening and closing auctions of the day. These behave differently from continuous trading and contain large batch orders or jumps in price. 

* **NoAuction** This removes auction periods, leaving only continuous trading hours. A *cleaner* dataset. 

## Zscore, MinMax, DecPre
Preprocessing and normalization schemes. How was the limit order book normalized before written out. 

* **Zscore:** 
    * each feature has been standardized: $$x' = \frac{x-\mu}{\sigma}$$.
    * zero mean, unit variance.
    * helps convergence.

* **MinMax:**
    * Each feature scaled to [0,1]
    * Preserves shape of the distribution, sensitive to outliers

* **DecPre:**
    * Prices divided by a fixed constant to reduce decimal precision
    * Input is smaller and easier without relying on per-feature statistics.
    * Sometimes used in microstructure studies.


### For XGBoost -> Transformer pipeline, 
Start with **NoAuction** data and **ZScore**. 

```text
.
├── NoAuction_Zscore_Testing
│   ├── Test_Dst_NoAuction_ZScore_CF_1.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_2.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_3.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_4.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_5.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_6.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_7.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_8.txt
│   └── Test_Dst_NoAuction_ZScore_CF_9.txt
└── NoAuction_Zscore_Training
    ├── Train_Dst_NoAuction_ZScore_CF_1.txt
    ├── Train_Dst_NoAuction_ZScore_CF_2.txt
    ├── Train_Dst_NoAuction_ZScore_CF_3.txt
    ├── Train_Dst_NoAuction_ZScore_CF_4.txt
    ├── Train_Dst_NoAuction_ZScore_CF_5.txt
    ├── Train_Dst_NoAuction_ZScore_CF_6.txt
    ├── Train_Dst_NoAuction_ZScore_CF_7.txt
    ├── Train_Dst_NoAuction_ZScore_CF_8.txt
    └── Train_Dst_NoAuction_ZScore_CF_9.txt
```

One sample per row. Last column is the label (-1/0/1 or 0/1/2 based on preprocessing). 

## Loader and Parser

Skipped for now, because I found a clean data in .csv

In [1]:
import os 
import re 
import glob
from tqdm import tqdm
import numpy as np
import pandas as pd

def load_FI2010_split(train_dir, test_dir):
    def load_folder(folder):
        print(f"Loading folder {folder}")
        all_files = glob.glob(os.path.join(folder, "*.txt"))
        
        data_list = []
        for file in tqdm(all_files):
            print(f"Loading {file}")
            df = pd.read_csv(file, header=None, delim_whitespace=True)
            data_list.append(df)
        big = pd.concat(data_list, axis=0, ignore_index=True)
        return big
    
    train_df = load_folder(train_dir)
    test_df = load_folder(test_dir)

    X_train = train_df.iloc[:, :-1].astype(np.float32).values
    y_train = train_df.iloc[:, -1].astype(np.int64).values

    X_test = test_df.iloc[:, :-1].astype(np.float32).values
    y_test = test_df.iloc[:, -1].astype(np.int64).values

    return (X_train, y_train), (X_test, y_test)

train_dir = "published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training"
test_dir = "published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing"

In [2]:
(X_train, y_train), (X_test, y_test) = load_FI2010_split(train_dir, test_dir)

Loading folder published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training


  0%|          | 0/9 [00:00<?, ?it/s]

Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_8.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)
  0%|          | 0/9 [00:03<?, ?it/s]


KeyboardInterrupt: 

**TODO:** finish the parser

## Reading preprocessed CSVs instead? 

I am assuming that these are ZScore without auction data. 

In [3]:
train = pd.read_csv('FI2010_train.csv', header=0, index_col=0)
test = pd.read_csv('FI2010_test.csv', header=0, index_col=0)

X_train = train.iloc[:, :-1].astype(np.float32)
y_train = train.iloc[:, -1].astype(np.int64)

X_test = test.iloc[:, :-1].astype(np.float32)
y_test = test.iloc[:, -1].astype(np.int64)

### Features

#### Confirm this is the ZScore normalized one: 

In [4]:
import seaborn as sns

feature_0 = X_train.loc[:, '0']
feature_0.describe() # <---- ZScore standardized

count    3.624000e+05
mean     2.155767e-08
std      1.000000e+00
min     -1.065622e+00
25%     -9.833034e-01
50%     -5.266306e-01
75%      1.156003e+00
max      1.355919e+00
Name: 0, dtype: float64

#### Are there any Nan's? 

In [5]:
# Checks for NaNs
print(f"X_train nans: {X_train.isna().any().any()}")
print(f"X_test nans: {X_test.isna().any().any()}")
print(f"y_train nans: {y_train.isna().any().any()}")
print(f"y_test nans: {y_test.isna().any().any()}")

X_train nans: False
X_test nans: False
y_train nans: False
y_test nans: False


### Target vars
The original datasets use: 
* 1 -> Down
* 2 -> Stationary
* 3 -> Up

Remap y = y-1 to use 0,1,2

In [6]:
# UNCOMMENT to remap y to 0,1,2
y_test = y_test - 1 
y_train = y_train - 1

In [7]:
# -> target vars:
y_test.unique(), y_train.unique()

(array([1, 2, 0]), array([1, 0, 2]))

#### Sanity check

FI-2010 rows are LOB snapshots. The first two columns are typically the best ask price and ask volumne, the next two are best bid and bid volumne

In [8]:
def compute_mid(df):
    best_ask = df.iloc[:, 0]
    best_bid = df.iloc[:, 2]
    mid = (best_ask + best_bid) / 2
    return mid

Direction mid price $k$ steps ahead. 

**No nice signal.** Chad says: 

That’s actually a super interesting finding 👍 — it means the labels in your preprocessed CSV are not computed on-the-fly from best bid/ask, but were baked in during preprocessing. That’s exactly how the FI-2010 BenchmarkDatasets were released:

They took the raw LOBSTER/ITCH order book.

They defined the label as mid-price movement after 10 events (not 10 rows in the normalized file, but 10 events in the original event stream).

Then they normalized everything (Z-score, MinMax, DecPre) and saved feature + label together.

👉 So if you try to recompute labels just from your normalized CSV, they won’t line up — you’ve lost the raw event stream ordering. That’s why you don’t see agreement even with longer lags. 

And that's ausome. 

In [9]:
def labels_from_mid(mid:pd.Series, horizon=10, eps:float=1e-5):
    future_mid = mid.shift(-horizon)
    diff = future_mid - mid
    labels = pd.Series(1, index=mid.index)  # Default to 0 (down)
    labels[diff > eps] = 0  # Up
    labels[diff < -eps] = 2  # Stationary
    return labels

### Baseline

This uses XGBoost based strategy

In [10]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    'objective': 'multi:softprob',
    'num_class': 3,
    'eval_metric': 'mlogloss', # <--- use r2 instead for regression? 
    'eta': 0.1,
    'max_depth': 6,
    'seed': 42,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist'
}


In [11]:
bst = xgb.train(params, dtrain, num_boost_round=200,
                evals=[(dtrain, 'train'), (dtest, 'test')],
                verbose_eval=50)

[0]	train-mlogloss:1.05562	test-mlogloss:1.05988
[50]	train-mlogloss:0.74344	test-mlogloss:0.81278
[100]	train-mlogloss:0.71353	test-mlogloss:0.80772
[150]	train-mlogloss:0.69418	test-mlogloss:0.80565
[199]	train-mlogloss:0.67858	test-mlogloss:0.80677


In [21]:
y_pred = bst.predict(dtest).argmax(axis=1)
print(classification_report(y_test, y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.6662    0.6118    0.6378     12237
           1     0.6307    0.8050    0.7073      8748
           2     0.6510    0.5667    0.6059     10952

    accuracy                         0.6492     31937
   macro avg     0.6493    0.6612    0.6503     31937
weighted avg     0.6513    0.6492    0.6459     31937



## Deep learning for sequential modelling

In [22]:
import torch 
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [12]:
X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
0,0.318116,-0.564619,0.313539,-0.551889,0.319726,-0.731228,0.312891,-0.425448,0.319404,-0.844157,...,-0.139927,-0.816832,-0.825238,0.0,0.0,0.0,2.0,2.0,2.0,2.0
1,0.318116,-0.662079,0.313539,-0.551889,0.320706,-0.751891,0.312891,-0.425448,0.320383,-0.854876,...,-0.139927,0.464300,0.452887,0.0,0.0,0.0,2.0,2.0,2.0,2.0
2,0.317136,-0.723163,0.313539,-0.551889,0.316787,-0.731228,0.312891,-0.425448,0.317445,-0.762942,...,-0.139927,-0.798788,-0.807237,0.0,0.0,0.0,3.0,3.0,2.0,2.0
3,0.317136,-0.585895,0.313539,-0.551889,0.318747,-0.307628,0.312891,-0.425448,0.319404,-0.561348,...,-0.139927,0.465974,0.454558,0.0,0.0,0.0,2.0,2.0,3.0,2.0
4,0.317136,-0.585895,0.313539,-0.551889,0.318747,-0.307628,0.312891,-0.425448,0.319404,-0.561348,...,-0.139927,-0.410306,-0.419666,0.0,0.0,0.0,1.0,1.0,1.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
362395,1.328480,-0.608544,1.329517,-0.550603,1.327892,-0.795514,1.330085,-0.719126,1.328302,-0.870542,...,-0.139927,-0.697642,-0.706328,0.0,0.0,0.0,1.0,1.0,1.0,1.0
362396,1.328480,-0.608544,1.329517,-0.550603,1.327892,-0.795514,1.330085,-0.611748,1.328302,-0.870542,...,-0.139927,0.460395,0.448992,0.0,0.0,0.0,2.0,2.0,1.0,1.0
362397,1.328480,-0.608544,1.329517,-0.550603,1.327892,-0.795514,1.330085,-0.611748,1.328302,-0.870542,...,-0.139927,-0.827738,-0.836118,0.0,0.0,0.0,2.0,2.0,2.0,1.0
362398,1.328480,-0.668256,1.329517,-0.354503,1.327892,-0.795514,1.330085,-0.611748,1.328302,-0.870542,...,-0.139927,0.590491,0.578783,0.0,0.0,0.0,2.0,2.0,2.0,1.0


In [14]:
print(f"Shape of X_train: {X_train.shape, y_train.shape}. Classes: {np.unique(y_train, return_counts=True)}. ")

Shape of X_train: ((362400, 148), (362400,)). Classes: (array([0, 1, 2]), array([139206,  88749, 134445])). 


### Windowed Dataset

In [15]:
import torch
from torch.utils.data import DataLoader, Dataset


class WindowedDataset(Dataset):
    """A PyTorch Dataset that creates sliding windows over time series data.

    Args:
        Dataset (_type_): _description_

    """

    def __init__(self,
                 X:pd.DataFrame,  # noqa: N803
                 y:pd.Series,
                 window_length:int=50,
                 stride:int=1) -> None:
        """Initialize the WindowedDataset.

        Args:
            X (pd.DataFrame): Feature dataframe.
            y (pd.Series): Target labels.
            window_length (int, optional): Length of each window. Defaults to 50.
            stride (int, optional): Stride between windows. Defaults to 1.

        Raises:
            ValueError: If X and y have different lengths or if window_length is too long.

        """
        if len(X) != len(y):
            msg = "X and y must have the same length."
            raise ValueError(msg)
        self.X = X.to_numpy(np.float32, copy=False)
        self.y = y.to_numpy(np.int64, copy=False)
        self.window_length = int(window_length)
        self.stride = int(stride)
        self.max_start = len(self.y) - self.window_length
        if self.max_start <= 0:
            msg = "Sequence length is too long for the dataset."
            raise ValueError(msg)
        self.starts = np.arange(0, self.max_start + 1, self.stride, dtype=np.int64)


    def __len__(self) -> int:
        """Get the number of samples in the dataset.

        Returns:
            int: _description_

        """
        return len(self.starts)


    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        """Get a windowed sample from the dataset.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            tuple[torch.Tensor, torch.Tensor]: _description_

        """
        start = int(self.starts[idx])
        end = start + self.window_length
        x = self.X[start:end] # (T, F)
        y = self.y[end - 1]  # Label is at the end of the sequence
        return torch.from_numpy(x), torch.tensor(y, dtype=torch.long)        

In [None]:
# --- PREPARE WINDOWED DATASETS ---
window_length, stride = 100, 1
train_dataset = WindowedDataset(X_train, y_train, window_length, stride)
test_dataset = WindowedDataset(X_test, y_test, window_length, stride)

train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=2, drop_last=True, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2, pin_memory=True)

feat_dim = X_train.shape[1] # number of features
print(f"Windows: Train: {len(train_dataset)}, test: {len(test_dataset)}, feature dimension: {feat_dim}")  # noqa: T201

Windows: Train: 362301, test: 31838, feature dimension: 148


### Positional Encoding + Transformer Encoder

In [None]:
import math
import torch.nn as nn

class PositionalEncoding(nn.Module):
    pass