## Construction of Order Flow Imbalances (OFI) Features

### Objective
Evaluate the candidate’s understanding of price impact models, Order Flow Imbalance (OFI) construction, and the ability to extend OFI concepts across different order book granularities and cross-asset relationships. This skill is foundational to our research efforts.

### Resources Provided
-	Dataset: first_25000_rows.csv
-	Research Paper: [Cross-impact of order flow imbalance in equity markets](https://www.tandfonline.com/doi/full/10.1080/14697688.2023.2236159#d1e1), by R. Cont, M. Cucuringu and C. Zhang, *Quantitative Finance* **23** (2023), 1373–1393

### Task Instructions
You must construct the following OFI features (your output should produce one value per chosen timestamp):
- Best-Level OFI
- Multi-Level OFI
- Integrated OFI
- Cross-Asset OFI


In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

In [2]:
df = pd.read_csv("first_25000_rows.csv")
df["ts_recv"] = pd.to_datetime(df["ts_recv"])
df["ts_event"] = pd.to_datetime(df["ts_event"])
df = df.sort_values(by="ts_event")
df

Unnamed: 0,ts_recv,ts_event,rtype,publisher_id,instrument_id,action,side,depth,price,size,...,ask_sz_08,bid_ct_08,ask_ct_08,bid_px_09,ask_px_09,bid_sz_09,ask_sz_09,bid_ct_09,ask_ct_09,symbol
0,2024-10-21 11:54:29.221230963+00:00,2024-10-21 11:54:29.221064336+00:00,10,2,38,C,B,1,233.62,2,...,155,1,7,233.25,234.13,55,400,2,1,AAPL
1,2024-10-21 11:54:29.223936626+00:00,2024-10-21 11:54:29.223769812+00:00,10,2,38,A,B,0,233.67,2,...,155,1,7,233.25,234.13,55,400,2,1,AAPL
2,2024-10-21 11:54:29.225196809+00:00,2024-10-21 11:54:29.225030400+00:00,10,2,38,A,B,0,233.67,3,...,155,1,7,233.25,234.13,55,400,2,1,AAPL
3,2024-10-21 11:54:29.712600612+00:00,2024-10-21 11:54:29.712434212+00:00,10,2,38,A,B,2,233.52,200,...,155,1,7,233.25,234.13,55,400,2,1,AAPL
4,2024-10-21 11:54:29.764839221+00:00,2024-10-21 11:54:29.764673165+00:00,10,2,38,C,B,2,233.52,200,...,155,1,7,233.25,234.13,55,400,2,1,AAPL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,2024-10-21 13:04:16.583694069+00:00,2024-10-21 13:04:16.583527688+00:00,10,2,38,A,B,2,233.46,200,...,105,1,2,233.25,234.50,55,63,2,4,AAPL
4996,2024-10-21 13:04:17.976627074+00:00,2024-10-21 13:04:17.976461017+00:00,10,2,38,A,A,1,233.69,200,...,105,1,2,233.25,234.50,55,63,2,4,AAPL
4997,2024-10-21 13:04:20.085804687+00:00,2024-10-21 13:04:20.085638629+00:00,10,2,38,C,B,2,233.46,200,...,105,2,2,233.24,234.50,1,63,1,4,AAPL
4998,2024-10-21 13:04:20.085817362+00:00,2024-10-21 13:04:20.085651109+00:00,10,2,38,A,B,3,233.44,200,...,105,1,2,233.25,234.50,55,63,2,4,AAPL


Compute the **bid order flows** $(\mathrm{OF}_{i, n}^{m, b})$ and **ask order flows** $(\mathrm{OF}_{i, n}^{m, a})$ as follows. Notice that there are 10 levels of order book given in the dataset.

In [3]:
M = 10
for i in range(M):
    idx = f"{i:02d}"
    bid_price = df[f"bid_px_{idx}"]
    ask_price = df[f"ask_px_{idx}"]
    bid_size = df[f"bid_sz_{idx}"]
    ask_size = df[f"ask_sz_{idx}"]

    bid_price_prev = bid_price.shift(1)
    ask_price_prev = ask_price.shift(1)
    bid_size_prev = bid_size.shift(1)
    ask_size_prev = ask_size.shift(1)

    bid_of = np.select([bid_price > bid_price_prev, bid_price == bid_price_prev, bid_price < bid_price_prev],
                       [bid_size, bid_size - bid_size_prev, -bid_size])
    ask_of = np.select([ask_price > ask_price_prev, ask_price == ask_price_prev, ask_price < ask_price_prev],
                       [-ask_size, ask_size - ask_size_prev, ask_size])
    df[f"bid_of_{idx}"] = bid_of
    df[f"ask_of_{idx}"] = ask_of

Recall that the **Best-level OFI** is defined as $$\mathrm{OFI}_{i, t}^{1, h}\coloneqq\sum_{n=N(t-h)+1}^{N(t)} \mathrm{OF}_{i, n}^{1, b}-\mathrm{OF}_{i, n}^{1, a},$$
where $N(t-h)+1$ and $N(t)$ are the indexes of the first and the last order book event in the interval $(t-h, t]$.

In [4]:
df_ofi = df.copy()
df_ofi.set_index("ts_event", inplace=True)
df_ofi.index.name = None
best_level_ofi = df_ofi["bid_of_00"].sub(df_ofi["ask_of_00"]).resample("1min").sum()
best_level_ofi

2024-10-21 11:54:00+00:00    -531.0
2024-10-21 11:55:00+00:00   -1217.0
2024-10-21 11:56:00+00:00     334.0
2024-10-21 11:57:00+00:00     960.0
2024-10-21 11:58:00+00:00     422.0
                              ...  
2024-10-21 13:00:00+00:00   -1007.0
2024-10-21 13:01:00+00:00   -1135.0
2024-10-21 13:02:00+00:00   -5575.0
2024-10-21 13:03:00+00:00     -29.0
2024-10-21 13:04:00+00:00    -196.0
Freq: min, Length: 71, dtype: float64

The **multi-level OFI** is given by $$\textbf{ofi}_{i, t}^{(h)}=(\text{ofi}_{i, t}^{1, h}, \ldots, \text{ofi}_{i, t}^{10, h})^\intercal,\quad \mathrm{ofi}_{i, t}^{m, h}=\frac{\mathrm{OFI}_{i, t}^{m, h}}{Q_{i, t}^{M, h}},\quad m=1,2,\ldots,10,$$where OFI at level $m$ is $$\mathrm{OFI}_{i, t}^{m, h}\coloneqq\sum_{n=N(t-h)+1}^{N(t)} \mathrm{OF}_{i, n}^{m, b}-\mathrm{OF}_{i, n}^{m, a}$$and$$Q_{i, t}^{M, h}=\frac{1}{M} \sum_{m=1}^M \frac{1}{2 \Delta N(t)} \sum_{n=N(t-h)+1}^{N(t)}\left[q_{i, n}^{m, b}+q_{i, n}^{m, a}\right],\quad \Delta N(t) = N(t) - N(t-h).$$

In [None]:
sq = df_ofi.loc[:, df_ofi.columns.str.contains("sz")].resample("1min").mean().mean(axis=1)

df_result = pd.DataFrame(index=best_level_ofi.index)
multi_level_ofi = pd.DataFrame(index=best_level_ofi.index)
for i in range(M):
    idx = f"{i:02d}"
    df_result[f"ofi_{idx}"] = df_ofi[f"bid_of_{idx}"].sub(df_ofi[f"ask_of_{idx}"]).resample("1min").sum()
    multi_level_ofi[f"{idx}"] = 2 * df_result[f"ofi_{idx}"] / sq

multi_level_ofi

Unnamed: 0,00,01,02,03,04,05,06,07,08,09
2024-10-21 11:54:00+00:00,-9.841602,6.412795,10.045477,-2.891318,-1.130580,10.749773,0.815500,-4.577920,12.547579,10.935114
2024-10-21 11:55:00+00:00,-22.409154,32.536545,33.530871,-4.345571,-6.997106,13.589117,-14.528203,2.964564,25.557852,-10.679794
2024-10-21 11:56:00+00:00,6.095972,-10.458060,34.622931,-10.677077,11.899921,-4.964384,-1.168090,15.513701,-2.628204,4.617608
2024-10-21 11:57:00+00:00,17.021995,11.578503,5.939967,3.634905,-31.295647,17.483008,-8.936547,-6.720142,29.557985,-4.166843
2024-10-21 11:58:00+00:00,7.353625,2.788104,16.066450,16.815753,-7.214219,-17.599907,18.035548,-0.017426,14.167054,35.827138
...,...,...,...,...,...,...,...,...,...,...
2024-10-21 13:00:00+00:00,-10.904460,8.608785,-3.714230,-28.132858,25.934640,-19.578217,25.555637,9.973196,-11.684124,6.811227
2024-10-21 13:01:00+00:00,-12.070833,-39.328583,-30.607804,15.973913,3.413866,0.000000,13.081167,9.582221,-9.858733,23.652452
2024-10-21 13:02:00+00:00,-83.037360,-72.611144,-75.530484,-2.055454,-3.783227,-17.962880,-21.418426,-27.584787,48.660638,-30.802020
2024-10-21 13:03:00+00:00,-0.515086,-21.225081,28.436280,24.706350,20.283718,28.986889,36.659889,-13.623127,10.426044,37.920961


The **integrated OFI** is defined by $$\text{ofi}_{i, t}^{I, h}=\frac{\boldsymbol{w}_1^\intercal \textbf{ofi}_{i, t}^{(h)}}{\left\|\boldsymbol{w}_1\right\|_1},$$ where $\boldsymbol{w}_1$ is the first principal vector computed from historical data.

In [6]:
pca = PCA(n_components=1)
pca.fit(multi_level_ofi.values)
w1 = pca.components_[0]
w1_hat = w1 / np.sum(np.abs(w1))
integrated_ofi = pd.Series(np.dot(multi_level_ofi.values, w1_hat), index=best_level_ofi.index)
integrated_ofi

2024-10-21 11:54:00+00:00    -1.694526
2024-10-21 11:55:00+00:00    -3.342134
2024-10-21 11:56:00+00:00   -13.562487
2024-10-21 11:57:00+00:00    10.773180
2024-10-21 11:58:00+00:00    -0.160959
                               ...    
2024-10-21 13:00:00+00:00   -18.202884
2024-10-21 13:01:00+00:00     9.089961
2024-10-21 13:02:00+00:00    15.342357
2024-10-21 13:03:00+00:00    -2.699041
2024-10-21 13:04:00+00:00     4.127753
Freq: min, Length: 71, dtype: float64

**Cross-asset OFI** is not available from the dataset, as it contains one asset only.