# Feature Engineering

COmbine all of the feature calculations into a single notebook and output a full featue set as a dataframe.

### Import Libraries

In [1]:
#import required libraries
from utils import aws # used to create aws session and load parquet 
import pandas as pd
import numpy as np
import ast 
import dask.dataframe as dd

### Load LOB Data

Currently this is run using the sample dataset, but going forward will need to be run using the full LOB dataset.

In [2]:
#load sample lob from s3 to a dask dataframe
samp_lob_ddf = aws.load_s3_file_as_ddf("s3://dsmp-ol2/processed-data/lob_sample_data.parquet")

In [3]:
#compute the dask datafram to a pandas dataframe
df = samp_lob_ddf.compute()

In [4]:
df.head()

Unnamed: 0,Timestamp,Exchange,Bid,Ask,Date,Mid_Price
0,0.0,Exch0,[],[],2025-01-02,
1,0.279,Exch0,"[[1, 6]]",[],2025-01-02,
2,1.333,Exch0,"[[1, 6]]","[[800, 1]]",2025-01-02,400.5
3,1.581,Exch0,"[[1, 6]]","[[799, 1]]",2025-01-02,400.0
4,1.643,Exch0,"[[1, 6]]","[[798, 1]]",2025-01-02,399.5


### Preprocessing

In [5]:
#convert string to lists
df['Bid'] = df['Bid'].apply(ast.literal_eval)
df['Ask'] = df['Ask'].apply(ast.literal_eval)

In [6]:
#drop missing rows in mid price
df = df.dropna(subset=['Mid_Price']) # This assumes the number of missing mid prices is small
# TODO following EDA determine if this is the correct decision

## Feature Engineering

### Volume and OBV

- OBV -on balance volume- confirm trends by analysing volume changes. Persistnet rising obv suggests accumulation / buying pressure- associated with upward price movement. Persistent falling obv suggests distribution /selling pressure. Theory is that changes in volume preced price movements. Increase in volume often precedes a change in price direction. Use as trend,reversal and breakout confirmation.


- If today's close is higher than yesterday's close, then OBV = Previous OBV + today's volume.
- If today's close is lower than yesterday's close, then OBV = Previous OBV - today's volume.
- If today's close is equal to yesterday's close, then OBV = Previous OBV.

In [7]:
#calculate order volume at each timestep
df['Volume'] = df['Bid'].apply(lambda x: sum([qty for price, qty in x])) + \
               df['Ask'].apply(lambda x: sum([qty for price, qty in x]))

#calc obv
#should be done using close but we can approximate with mid but might not be as reliable
direction = df['Mid_Price'].diff().apply(lambda x: 1 if x > 0 else (-1 if x < 0 else 0))
df['OBV'] = (direction * df['Volume']).cumsum()

### Order Book Imbalance

Order book imbalance reflects the proportion of buy to sell orders and can indicate potential price movements based on supply and demand dynamics.

calc as:

$$
\text{Imbalance} = \frac{Q_{\text{bid}} - Q_{\text{ask}}}{Q_{\text{bid}} + Q_{\text{ask}}}
$$

where:
- $Q_{\text{bid}}$ is the total quantity of buy orders,
- $Q_{\text{ask}}$ is the total quantity of sell orders.

Positive imbalance suggests a predominance of buy orders, which could indicate upward pressure on prices, while a negative imbalance suggests the opposite.

In [8]:
#calc total quantity
def total_quantity(price_qty_list):
    if not price_qty_list:
        return 0
    total_qty = sum(qty for _, qty in price_qty_list)
    return total_qty

#calc total bid ask quantities
df['Total_Bid_Qty'] = df['Bid'].apply(total_quantity)
df['Total_Ask_Qty'] = df['Ask'].apply(total_quantity)

#calc order book imbalance
df['Order_Book_Imbalance'] = (df['Total_Bid_Qty'] - df['Total_Ask_Qty']) / (df['Total_Bid_Qty'] + df['Total_Ask_Qty'])