# Data Analysis using the FI-2010 dataset

* FI-2010 is a [publicly available dataset](https://arxiv.org/abs/1705.03233) for mid-price forecasting from limit order book data. 

* Five stocks from NASDAQ Nordic stock market for 10 consecutive days. About 4000000 samples in total. 


## The data: 
```text
.
└── BenchmarkDatasets
    └── BenchmarkDatasets
        ├── Auction
        │   ├── 1.Auction_Zscore
        │   │   ├── Auction_Zscore_Testing
        │   │   └── Auction_Zscore_Training
        │   ├── 2.Auction_MinMax
        │   │   ├── Auction_MinMax_Testing
        │   │   └── Auction_MinMax_Training
        │   └── 3.Auction_DecPre
        │       ├── Auction_DecPre_Testing
        │       └── Auction_DecPre_Training
        └── NoAuction
            ├── 1.NoAuction_Zscore
            │   ├── NoAuction_Zscore_Testing
            │   └── NoAuction_Zscore_Training
            ├── 2.NoAuction_MinMax
            │   ├── NoAuction_MinMax_Testing
            │   └── NoAuction_MinMax_Training
            └── 3.NoAuction_DecPre
                ├── NoAuction_DecPre_Testing
                └── NoAuction_DecPre_Training
```

FI-2010 Benchmark datasets release that is bundled with multiple preprocessed versions of the LOB. The folder names: 


### Has Auction?
* **Auction** Auction includes samples from opening and closing auctions of the day. These behave differently from continuous trading and contain large batch orders or jumps in price. 

* **NoAuction** This removes auction periods, leaving only continuous trading hours. A *cleaner* dataset. 

## Zscore, MinMax, DecPre
Preprocessing and normalization schemes. How was the limit order book normalized before written out. 

* **Zscore:** 
    * each feature has been standardized: $$x' = \frac{x-\mu}{\sigma}$$.
    * zero mean, unit variance.
    * helps convergence.

* **MinMax:**
    * Each feature scaled to [0,1]
    * Preserves shape of the distribution, sensitive to outliers

* **DecPre:**
    * Prices divided by a fixed constant to reduce decimal precision
    * Input is smaller and easier without relying on per-feature statistics.
    * Sometimes used in microstructure studies.


### For XGBoost -> Transformer pipeline, 
Start with **NoAuction** data and **ZScore**. 

```text
.
├── NoAuction_Zscore_Testing
│   ├── Test_Dst_NoAuction_ZScore_CF_1.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_2.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_3.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_4.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_5.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_6.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_7.txt
│   ├── Test_Dst_NoAuction_ZScore_CF_8.txt
│   └── Test_Dst_NoAuction_ZScore_CF_9.txt
└── NoAuction_Zscore_Training
    ├── Train_Dst_NoAuction_ZScore_CF_1.txt
    ├── Train_Dst_NoAuction_ZScore_CF_2.txt
    ├── Train_Dst_NoAuction_ZScore_CF_3.txt
    ├── Train_Dst_NoAuction_ZScore_CF_4.txt
    ├── Train_Dst_NoAuction_ZScore_CF_5.txt
    ├── Train_Dst_NoAuction_ZScore_CF_6.txt
    ├── Train_Dst_NoAuction_ZScore_CF_7.txt
    ├── Train_Dst_NoAuction_ZScore_CF_8.txt
    └── Train_Dst_NoAuction_ZScore_CF_9.txt
```

One sample per row. Last column is the label (-1/0/1 or 0/1/2 based on preprocessing). 

## Loader and Parser

In [8]:
import os 
import re 
import glob
from tqdm import tqdm
import numpy as np
import pandas as pd

def load_FI2010_split(train_dir, test_dir):
    def load_folder(folder):
        print(f"Loading folder {folder}")
        all_files = glob.glob(os.path.join(folder, "*.txt"))
        
        data_list = []
        for file in tqdm(all_files):
            print(f"Loading {file}")
            df = pd.read_csv(file, header=None, delim_whitespace=True)
            data_list.append(df)
        big = pd.concat(data_list, axis=0, ignore_index=True)
        return big
    
    train_df = load_folder(train_dir)
    test_df = load_folder(test_dir)

    X_train = train_df.iloc[:, :-1].astype(np.float32).values
    y_train = train_df.iloc[:, -1].astype(np.int64).values

    X_test = test_df.iloc[:, :-1].astype(np.float32).values
    y_test = test_df.iloc[:, -1].astype(np.int64).values

    return (X_train, y_train), (X_test, y_test)

train_dir = "published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training"
test_dir = "published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing"

In [9]:
(X_train, y_train), (X_test, y_test) = load_FI2010_split(train_dir, test_dir)

Loading folder published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training


  0%|          | 0/9 [00:00<?, ?it/s]

Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_8.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)
  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_9.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_2.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_3.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_1.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_4.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_5.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_7.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Training/Train_Dst_NoAuction_ZScore_CF_6.txt


100%|██████████| 9/9 [09:15<00:00, 61.73s/it]


Loading folder published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing


  0%|          | 0/9 [00:00<?, ?it/s]

Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_1.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)
  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_3.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_2.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_6.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_7.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_5.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_4.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_9.txt


  df = pd.read_csv(file, header=None, delim_whitespace=True)


Loading published/BenchmarkDatasets/BenchmarkDatasets/NoAuction/1.NoAuction_Zscore/NoAuction_Zscore_Testing/Test_Dst_NoAuction_ZScore_CF_8.txt


100%|██████████| 9/9 [00:19<00:00,  2.11s/it]


IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

**TODO:** finish the parser

## Reading preprocessed CSVs instead? 

I am assuming that these are ZScore without auction data. 

In [37]:
train = pd.read_csv('FI2010_train.csv', header=0, index_col=0)
test = pd.read_csv('FI2010_test.csv', header=0, index_col=0)

X_train = train.iloc[:, :-1].astype(np.float32)
y_train = train.iloc[:, -1].astype(np.int64)

X_test = test.iloc[:, :-1].astype(np.float32)
y_test = test.iloc[:, -1].astype(np.int64)

### Features

#### Confirm this is the ZScore normalized one: 

In [68]:
import seaborn as sns

feature_0 = X_train.loc[:, '0']
feature_0.describe() # <---- ZScore standardized

count    3.624000e+05
mean     2.155767e-08
std      1.000000e+00
min     -1.065622e+00
25%     -9.833034e-01
50%     -5.266306e-01
75%      1.156003e+00
max      1.355919e+00
Name: 0, dtype: float64

#### Are there any Nan's? 

In [None]:
# Checks for NaNs
print(f"X_train nans: {X_train.isna().any().any()}")
print(f"X_test nans: {X_test.isna().any().any()}")
print(f"y_train nans: {y_train.isna().any().any()}")
print(f"y_test nans: {y_test.isna().any().any()}")

X_train nans: False
X_test nans: False
y_train nans: False
y_test nans: False


### Target vars
The original datasets use: 
* 1 -> Down
* 2 -> Stationary
* 3 -> Up

Remap y = y-1 to use 0,1,2

In [83]:
# UNCOMMENT to remap y to 0,1,2
# y_test = y_test - 1 
# y_train = y_train - 1

In [84]:
# -> target vars:
y_test.unique(), y_train.unique()

(array([1, 2, 0]), array([1, 0, 2]))

#### Sanity check

FI-2010 rows are LOB snapshots. The first two columns are typically the best ask price and ask volumne, the next two are best bid and bid volumne

In [88]:
def compute_mid(df):
    best_ask = df.iloc[:, 0]
    best_bid = df.iloc[:, 2]
    mid = (best_ask + best_bid) / 2
    return mid

Direction mid price $k$ steps ahead. 

**No nice signal.** Chad says: 

That’s actually a super interesting finding 👍 — it means the labels in your preprocessed CSV are not computed on-the-fly from best bid/ask, but were baked in during preprocessing. That’s exactly how the FI-2010 BenchmarkDatasets were released:

They took the raw LOBSTER/ITCH order book.

They defined the label as mid-price movement after 10 events (not 10 rows in the normalized file, but 10 events in the original event stream).

Then they normalized everything (Z-score, MinMax, DecPre) and saved feature + label together.

👉 So if you try to recompute labels just from your normalized CSV, they won’t line up — you’ve lost the raw event stream ordering. That’s why you don’t see agreement even with longer lags. 

And that's ausome. 

In [144]:
def labels_from_mid(mid:pd.Series, horizon=10, eps:float=1e-5):
    future_mid = mid.shift(-horizon)
    diff = future_mid - mid
    labels = pd.Series(1, index=mid.index)  # Default to 0 (down)
    labels[diff > eps] = 0  # Up
    labels[diff < -eps] = 2  # Stationary
    return labels

In [145]:
mid = compute_mid(X_train)

In [146]:
mid

for horizon in range(1, 21, 2):
    y_derived = labels_from_mid(mid, horizon=horizon, eps=1e-5)
    accuracy = (y_derived == y_train).mean()
    print(f"Horizon: {horizon}, Accuracy: {accuracy:.4f}")

Horizon: 1, Accuracy: 0.3568
Horizon: 3, Accuracy: 0.4252
Horizon: 5, Accuracy: 0.4579
Horizon: 7, Accuracy: 0.4735
Horizon: 9, Accuracy: 0.4801
Horizon: 11, Accuracy: 0.4833
Horizon: 13, Accuracy: 0.4837
Horizon: 15, Accuracy: 0.4826
Horizon: 17, Accuracy: 0.4800
Horizon: 19, Accuracy: 0.4770


array([1, 0, 2])