# Initial EDA

This Notebook outlines the initial EDA carried out on the LOB data. Initially the code has been written on a small sample of the full LOB dataset. To ensure no trends or outliers are missed this EDA will need to be run against the full dataset.

In [1]:
import pandas as pd


In [2]:
# Read in sample data
sample_csv = 'data/output/EDA_lob_output_data_sample.csv' # define path to sampel data

lob_sample = pd.read_csv(sample_csv)

In [3]:
# Reorder columns - this makes data and timestamp easier to read
desired_column_order = ['Timestamp', 'Date', 'Exchange', 'Bid', 'Ask', 'Mid_Price']
lob_sample = lob_sample[desired_column_order]
lob_sample.head()

Unnamed: 0,Timestamp,Date,Exchange,Bid,Ask,Mid_Price
0,0.0,2025-01-02,Exch0,[],[],
1,0.279,2025-01-02,Exch0,"[[1, 6]]",[],
2,1.333,2025-01-02,Exch0,"[[1, 6]]","[[800, 1]]",400.5
3,1.581,2025-01-02,Exch0,"[[1, 6]]","[[799, 1]]",400.0
4,1.643,2025-01-02,Exch0,"[[1, 6]]","[[798, 1]]",399.5


## Tick Time

In [4]:
# Convert Date to datetime data type
lob_sample['Date'] = pd.to_datetime(lob_sample['Date'])

# Sort DataFrame by date
lob_sample = lob_sample.sort_values(['Date','Timestamp'])

# Calculate the difference between Timestamps (Tick Time) grouped by day 
lob_sample['Tick_Time'] = lob_sample.groupby('Date')['Timestamp'].diff()

lob_sample.head()

Unnamed: 0,Timestamp,Date,Exchange,Bid,Ask,Mid_Price,Tick_Time
0,0.0,2025-01-02,Exch0,[],[],,
1,0.279,2025-01-02,Exch0,"[[1, 6]]",[],,0.279
2,1.333,2025-01-02,Exch0,"[[1, 6]]","[[800, 1]]",400.5,1.054
3,1.581,2025-01-02,Exch0,"[[1, 6]]","[[799, 1]]",400.0,0.248
4,1.643,2025-01-02,Exch0,"[[1, 6]]","[[798, 1]]",399.5,0.062


In [5]:
# Group by date and calculate the average number difference
average_ticktime_by_date = lob_sample.groupby('Date')['Tick_Time'].mean()
## would be good to show this as a distribution to se if there are any outliers

# Calculate the average Tick Time across the dataset
average_ticktime = lob_sample['Tick_Time'].mean()

# Count the distinct number of dates
date_count = lob_sample['Date'].nunique()

print(f'The average tick time across the {date_count} is {average_ticktime:.4f} Seconds')

The average tick time across the 3 is 0.0884 Seconds
