## LOB EDA

__Initial Plan__
- Load data & initial look at structure/ size ✅
- Assess data types ✅
- Handle nulls ✅
- Descriptive stats 🟡
  - uni/multivariate non/graphical? what is useful to explore
- Outliers
- Financial technical indicators
  - spread
  - trend
  - momentum
  - volatility
  - volume
  - ratios
  - depth

_N.B- Normalisation will be performed post feature creation to ensure downstream generalisation_

__Load Data__

In [2]:
#import required libraries
import pandas as pd

In [3]:
#load sample csv
lob = pd.read_csv('EDA_lob_output_data_sample.csv')

In [6]:
#dataset dimensions
lob.shape

(1037934, 6)

In [7]:
#dataset info
lob.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1037934 entries, 0 to 1037933
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   Timestamp  1037934 non-null  float64
 1   Exchange   1037934 non-null  object 
 2   Bid        1037934 non-null  object 
 3   Ask        1037934 non-null  object 
 4   Date       1037934 non-null  object 
 5   Mid_Price  1037853 non-null  float64
dtypes: float64(2), object(4)
memory usage: 47.5+ MB


In [9]:
#visual look at first 5 rows
lob.head()

Unnamed: 0,Timestamp,Exchange,Bid,Ask,Date,Mid_Price
0,0.0,Exch0,[],[],2025-01-02,
1,0.279,Exch0,"[[1, 6]]",[],2025-01-02,
2,1.333,Exch0,"[[1, 6]]","[[800, 1]]",2025-01-02,400.5
3,1.581,Exch0,"[[1, 6]]","[[799, 1]]",2025-01-02,400.0
4,1.643,Exch0,"[[1, 6]]","[[798, 1]]",2025-01-02,399.5


In [11]:
#reorder cols
order= ['Timestamp', 'Date', 'Bid', 'Ask', 'Mid_Price']
lob =lob[order]

__Assess data types__

In [13]:
#convert 'Date' to datetime
lob['Date'] = pd.to_datetime(lob['Date'])

#all other types fine but we might want to convert exchange to a category dtype- some benefit re memory/speed?

In [37]:
#check if Bid/ Ask are actual lists
print(lob['Bid'].apply(type).unique())
print(lob['Ask'].apply(type).unique())

[<class 'str'>]
[<class 'str'>]


In [39]:
#convert to lists
#import required libraries
import ast 

#convert
lob['Bid'] = lob['Bid'].apply(ast.literal_eval)
lob['Ask'] = lob['Ask'].apply(ast.literal_eval)

In [42]:
#check if Bid/ Ask are actual lists
print(lob['Bid'].apply(type).unique())
print(lob['Ask'].apply(type).unique())

[<class 'list'>]
[<class 'list'>]


__Handle nulls__

In [23]:
#how many NaN/null in mid price?
missing_mid_price_count = lob['Mid_Price'].isnull().sum()

print(f'Missing "Mid_Price" values: {missing_mid_price_count} ({missing_mid_price_count/len(lob):.4f}% of the sample)')

Missing "Mid_Price" values: 81 (0.0001% of the sample)


In [24]:
#drop missing rows as represents small % of sample (i)
lob = lob.dropna(subset=['Mid_Price'])

In [None]:
#if large portion of whole dataset we should consider interpolation
#lob['Mid_Price'].interpolate(method='linear', inplace=True)

In [46]:
#check for rows where both 'Bid' and 'Ask' are empty lists
empty_bid_ask = lob.apply(lambda row: (not row['Bid']) and (not row['Ask']), axis=1)
empty_bid_ask_count = empty_bid_ask.sum()

print(f'Missing "Bid/Ask" values: {empty_bid_ask_count} ({empty_bid_ask_count/len(lob):.4f}% of the sample)')

Missing "Bid/Ask" values: 0 (0.0000% of the sample)


In [47]:
#not sure why this isn't working- I can visually see an empty list in the top row yet can't identify it
#Ah so empty containers doens't necessarily return as a null 
empty_bid_ask = lob.apply(lambda row: row['Bid'] == [] and row['Ask'] == [], axis=1)
empty_bid_ask_count = empty_bid_ask.sum()

print(f'Missing "Bid/Ask" values: {empty_bid_ask_count} ({empty_bid_ask_count/len(lob):.4f}% of the sample)')

Missing "Bid/Ask" values: 0 (0.0000% of the sample)


Ok I think I actually removed these types when I removed the NaN for mid_price! panic over and also where Ask OR Bid is []

In [31]:
#no missing Timestamp, Exhange or Date

In [49]:
lob.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1037853 entries, 2 to 1037933
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   Timestamp  1037853 non-null  float64       
 1   Date       1037853 non-null  datetime64[ns]
 2   Bid        1037853 non-null  object        
 3   Ask        1037853 non-null  object        
 4   Mid_Price  1037853 non-null  float64       
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 47.5+ MB


__Descriptive stats__

_Date_

In [58]:
#range
date_min = lob['Date'].min()
date_max = lob['Date'].max()

print(f"Date range- {date_min.date()} to {date_max.date()}")

Date range- 2025-01-02 to 2025-01-06


In [61]:
#how many unique dates
unique_dates = lob['Date'].nunique()

print(f"Unique dates- {unique_dates}") #will be small as sample

Unique dates- 3


In [60]:
#freq of dates 
date_counts = lob['Date'].value_counts()
most_common_date = date_counts.idxmax()
frequency_most_common_date = date_counts.max()

print(f"Most common date- {most_common_date.date()} (Frequency- {frequency_most_common_date})")

Most common date- 2025-01-02 (Frequency- 352960)


__Outliers__

__Financial technical indicators__