This dataset was produced and found in a research paper entitled: "Benchmark Dataset for Mid-Price Forecasting of Limit Order Book Data with Machine Learning Methods" (Ntakaris, A., Magris, M., Kanniainen, J., Gabbouj, M., & Iosifidis, A. (2018). Benchmark dataset for mid‐price forecasting of limit order book data with machine learning methods. Journal of Forecasting, 37(8), 852-866.)

This was my first time working such a big and different dataset, which I think was the biggest challenge in this project; figuring out what the data actually looks like and how I can process it to make my life and my model's life easier. 

This data is basically a high-frequency limit order books from the NASDAQ Nordic stock market. The dataset used was the minmax normalized version of time series data for five stocks for a trading period of ten consecutive days. This lead to a data set of a around ~4,000,000 time series samples.


We initally have a file with the following dimensions: (149, 39512). This is just from the first file; there are 9 files of training data and 9 files of test data.

The rows indicate features such as assset price and volume, while the columns indicate the timestamp

Based on the documentation on the dataset, the first 40 rows carry 10 levels of
bid and ask from the order book, along with the volume of each particular price point.
We have a total of 40 entries per timestamp since each side (buy and sell) contains 10
price levels, and each level includes two points: price and volume. In other words, the
limit order book in a single time snapshot shows up as an array of 40 elements.

We then convert the raw data into pandas DataFrames, separating ask prices, ask volumes, bid prices, and bid volumes. Finally, we reshape it into a 3D structure suitable for machine learning models, specially for time series models such as LSTM.

In [1]:
#Project inspired by: https://arxiv.org/pdf/1705.03233
#Data Downloaded from: https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
df = np.loadtxt("/Users/main/Downloads/LOB DATASET/train/Train_Dst_NoAuction_MinMax_CF_1.txt")


In [2]:
#Original size
df.shape

(149, 39512)

In [3]:
#  DataFrame creation
df2 = pd.DataFrame(df[:40, :].T)
dfAskPrices = df2.iloc[:, 0::4]  # Every 4th column starting from 0
dfAskVolumes = df2.iloc[:, 1::4]  # Every 4th column starting from 1
dfBidPrices = df2.iloc[:, 2::4]  # Every 4th column starting from 2
dfBidVolumes = df2.iloc[:, 3::4]  # Every 4th column starting from 3

# Print shapes to verify
print("dfAskPrices shape:", dfAskPrices.shape)
print("dfAskVolumes shape:", dfAskVolumes.shape)
print("dfBidPrices shape:", dfBidPrices.shape)
print("dfBidVolumes shape:", dfBidVolumes.shape)

dfAskPrices shape: (39512, 10)
dfAskVolumes shape: (39512, 10)
dfBidPrices shape: (39512, 10)
dfBidVolumes shape: (39512, 10)


Reshaped data for ease of training.

We can also print out the data in a way that a normal exchange order book would look like

In [5]:
def print_order_book_snapshot(timestamp=0):
    print("Order Book Snapshot at timestamp", timestamp)
    print("\nAsk Side (Sell Orders):")
    print("Level | Price | Volume")
    for level in range(10):
        print(f"{level+1:5d} | {dfAskPrices.iloc[timestamp, level]:6.4f} | {dfAskVolumes.iloc[timestamp, level]:6.4f}")
    
    print("\nBid Side (Buy Orders):")
    print("Level | Price | Volume")
    for level in range(10):
        print(f"{level+1:5d} | {dfBidPrices.iloc[timestamp, level]:6.4f} | {dfBidVolumes.iloc[timestamp, level]:6.4f}")
    
    print("\nBest Ask:", dfAskPrices.iloc[timestamp, 0])
    print("Best Bid:", dfBidPrices.iloc[timestamp, 0])
    print("Spread:", dfAskPrices.iloc[timestamp, 0] - dfBidPrices.iloc[timestamp, 0])

# Run the test for the first timestamp
print_order_book_snapshot(0)

Order Book Snapshot at timestamp 1

Ask Side (Sell Orders):
Level | Price | Volume
    1 | 0.6015 | 0.0086
    2 | 0.6028 | 0.0111
    3 | 0.6026 | 0.0077
    4 | 0.6020 | 0.0290
    5 | 0.6031 | 0.0042
    6 | 0.6007 | 0.0221
    7 | 0.6013 | 0.0039
    8 | 0.6026 | 0.0073
    9 | 0.6029 | 0.0037
   10 | 0.6059 | 0.0033

Bid Side (Buy Orders):
Level | Price | Volume
    1 | 0.6014 | 0.0054
    2 | 0.6010 | 0.0114
    3 | 0.6005 | 0.0131
    4 | 0.6005 | 0.0183
    5 | 0.6005 | 0.0165
    6 | 0.6010 | 0.0051
    7 | 0.5996 | 0.0032
    8 | 0.5995 | 0.0035
    9 | 0.5990 | 0.0033
   10 | 0.5983 | 0.0028

Best Ask: 0.60150044
Best Bid: 0.60142033
Spread: 8.011000000007762e-05
