## Feature Engineering_Price Discovery Features

## Features

- Weighted Mid-Price by Order Imbalance
- Volume Imbalance
- Bid Ask Spread
- Normalised Bid Ask Spread

__Load data__

__Data Source:__ lob_sample_data.parquet

In [1]:
import pandas as pd

df = pd.read_parquet('lob_sample_data.parquet', engine='pyarrow')

In [2]:
df.head()

Unnamed: 0,Timestamp,Exchange,Bid,Ask,Date,Mid_Price
0,0.0,Exch0,[],[],2025-01-02,
1,0.279,Exch0,"[[1, 6]]",[],2025-01-02,
2,1.333,Exch0,"[[1, 6]]","[[800, 1]]",2025-01-02,400.5
3,1.581,Exch0,"[[1, 6]]","[[799, 1]]",2025-01-02,400.0
4,1.643,Exch0,"[[1, 6]]","[[798, 1]]",2025-01-02,399.5


In [3]:
import ast

#convert string to lists
df['Bid'] = df['Bid'].apply(ast.literal_eval)
df['Ask'] = df['Ask'].apply(ast.literal_eval)

__Weighted Mid-Price by Order Imbalance__

Given **Mid Price** ($M$) is already calculated, we calculate **Order Imbalance** ($I$) and the **Weighted Mid Price** ($M_W$) as follows:

**Order Imbalance**
With $V_B$ representing the volume of the best bid and $V_A$ representing the volume of the best ask, the order imbalance ($I$) is calculated by the formula:

$$ I = \frac{V_B - V_A}{V_B + V_A} $$

**Weighted Mid Price by Order Imbalance**
We adjust the mid price based on the order imbalance using the formula:

$$ M_W = M \times (1 + k \times I) $$

Here, $k$ is a scaling factor that modulates the effect of the order imbalance on the mid price. This factor can be empirically determined to best reflect the impact of order volume differences on the price.

This approach allows us to adjust the given mid price considering the balance or imbalance between buy and sell orders, reflecting a more nuanced market value.

In [5]:
#first we need to empirically define k
import numpy as np

#future mid-price 
df['Mid_Price_Future'] = df['Mid_Price'].shift(-1) 

#calc price change
df['Price_Change'] = df['Mid_Price_Future'] - df['Mid_Price']

def calculate_order_imbalance(row):
    if row['Bid'] and row['Ask']:
        volume_bid = sum([bid[1] for bid in row['Bid']])
        volume_ask = sum([ask[1] for ask in row['Ask']])
        return (volume_bid - volume_ask) / (volume_bid + volume_ask)
    return np.nan

#calc order imbalance for each row
df['Order_Imbalance'] = df.apply(calculate_order_imbalance, axis=1)           

In [6]:
#use regression to get value for k

from sklearn.linear_model import LinearRegression
import numpy as np

#filter out nan
filtered_df = df.dropna(subset=['Order_Imbalance', 'Price_Change'])

X = filtered_df[['Order_Imbalance']].values.reshape(-1, 1)
y = filtered_df['Price_Change'].values

#fit linear regression model
model = LinearRegression()
model.fit(X, y)

#coefficient of 'Order_Imbalance' as 'k'
k = model.coef_[0]

print(f"Derived k value: {k}")

Derived k value: -0.10710202645698139


In [7]:
#calc weighted mid price col
import pandas as pd

def calculate_weighted_mid_price(row, k):
    #check if nan
    if pd.isna(row['Mid_Price']):
        return row['Mid_Price']  #return original if nan
    
    imbalance = row['Order_Imbalance']
    
    #adjust mid price based on imbalance
    weighted_mid_price = row['Mid_Price'] * (1 + k * imbalance)
    
    return weighted_mid_price

#apply
df['Weighted_Mid_Price'] = df.apply(calculate_weighted_mid_price, k=k, axis=1)

__Volume Imbalance__

Volume Imbalance is a metric used to assess the balance between supply and demand in a market at any given time, based on order book data. It compares the total volume of buy orders (bids) to the total volume of sell orders (asks), providing insight into potential price movements.

**Formula**

The Volume Imbalance ($VI$) is calculated using the formula:

$$ VI = \frac{V_{\text{bids}} - V_{\text{asks}}}{V_{\text{bids}} + V_{\text{asks}}} $$

where:
- $V_{\text{bids}}$ is the total volume of all bid orders.
- $V_{\text{asks}}$ is the total volume of all ask orders.

**Interpretation**

- $VI$ ranges from -1 to 1.
- A $VI$ closer to 1 indicates a higher volume of bids relative to asks, suggesting upward pressure on price.
- A $VI$ closer to -1 indicates a higher volume of asks relative to bids, suggesting downward pressure on price.
- A $VI$ around 0 indicates a balance between bid and ask volumes, suggesting a stable market condition without clear directional pressure on price.

In [10]:
def calculate_volume_imbalance(row):
    #calculate total bid volume
    total_bid_volume = sum([volume for _, volume in row['Bid']])
    
    #calculate total ask volume
    total_ask_volume = sum([volume for _, volume in row['Ask']])
    
    #error handling (div by 0)
    total_volume = total_bid_volume + total_ask_volume
    if total_volume == 0:
        return 0  #return 0 imbalance when there are no bids or asks
    
    #calculate volume imbalance
    volume_imbalance = (total_bid_volume - total_ask_volume) / total_volume
    return volume_imbalance

#apply
df['Volume_Imbalance'] = df.apply(calculate_volume_imbalance, axis=1)

__Bid Ask Spread__

Bid-ask spread represents the difference between the highest price buyers are willing to pay (**bid price**) and the lowest price sellers are willing to accept (**ask price**). We can use it as a key indicator of market liquidity and efficiency.

We calculate it by:

$$ \text{Bid-Ask Spread} = \text{Ask Price} - \text{Bid Price} $$

where:
- $\text{Ask Price}$ is the lowest price a seller is willing to accept.
- $\text{Bid Price}$ is the highest price a buyer is willing to pay.


A narrower (lower) spread often indicates a more liquid market, where transactions can occur more easily at prices close to the market's consensus value. A wider (higher) spread suggests lower liquidity, higher transaction costs, and potentially more volatility.

In [11]:
def calculate_bid_ask_spread(row):
    if not row['Bid'] or not row['Ask']:  #check if either list is empty
        return None 
    
    #extract highest bid price and lowest ask price
    highest_bid_price = row['Bid'][0][0]  #assuming first entry is the highest bid
    lowest_ask_price = row['Ask'][0][0]  #assuming first entry is the lowest ask
    
    #calculate the spread
    bid_ask_spread = lowest_ask_price - highest_bid_price
    return bid_ask_spread

#apply
df['Bid_Ask_Spread'] = df.apply(calculate_bid_ask_spread, axis=1)

__Normalised Bid Ask Spread__

In [13]:
#really need to remove outliers first here- this is very likely to skew otherwise
#normalise- scaling data to range between 0 and 1
from sklearn.preprocessing import MinMaxScaler

#initialise scaler
scaler = MinMaxScaler()

df['Normalised_Bid_Ask_Spread'] = scaler.fit_transform(df[['Bid_Ask_Spread']])