# Market Depth View, Market/Limit Orders

This notebook is to create market depth view of order book and approximate market and limit order message. It's not feature engineering.

- Raw order book data includes 14 levels of ask/bid sizes and rates on fixed column names. 

- Market depth view transforms the data so the columns are a fixed axis of price levels. Axis: Time-Prices; Value: Order Size.

We could estimate the size of market order, limit order and cancelled order from market depth view.


### Table of Contents

1. [Load Raw Training Data](#1)

2. [Create Market Depth View of Limit Order Book](#2)
  
3. [Approximate Market, Limit and Cancelled Orders](#3)

4. [Results Summary](#4)

In [2]:
import numpy as np
import pandas as pd
import feather
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style = 'whitegrid')

%matplotlib inline

<a id="1" ></a>
# Load Raw Training Data

In [3]:
t_raw = feather.read_dataframe('../data-training.feather')

In [4]:
t_raw.head()

Unnamed: 0,askRate0,askRate1,askRate2,askRate3,askRate4,askRate5,askRate6,askRate7,askRate8,askRate9,...,bidSize6,bidSize7,bidSize8,bidSize9,bidSize10,bidSize11,bidSize12,bidSize13,bidSize14,y
0,1619.5,1620.0,1621.0,,,,,,,,...,20.0,27.0,11.0,14.0,35.0,10.0,1.0,10.0,13.0,-0.5
1,1619.5,1620.0,1621.0,1621.5,,,,,,,...,20.0,27.0,11.0,14.0,35.0,10.0,1.0,10.0,13.0,-0.5
2,1619.5,1620.0,1621.0,1621.5,1622.0,,,,,,...,20.0,27.0,11.0,14.0,35.0,10.0,1.0,10.0,13.0,-0.5
3,1619.5,1620.0,1621.0,1621.5,1622.0,,,,,,...,20.0,27.0,11.0,14.0,35.0,10.0,1.0,10.0,13.0,-0.5
4,1619.5,1620.0,1621.0,1621.5,1622.0,,,,,,...,20.0,27.0,11.0,14.0,35.0,10.0,1.0,10.0,13.0,-0.5


In [5]:
t_book = t_raw.drop('y',1)
t_target = t_raw['y']

<a id="2" ></a>
# Create Market Depth View of Limit Order Book
To see how limit order positions change on price ladder.

In [6]:
# column names
rate_col_names = [c for c in t_book.columns if 'askRate' in c][::-1]\
               + [c for c in t_book.columns if 'bidRate' in c]
size_col_names = [c for c in t_book.columns if 'askSize' in c][::-1]\
               + [c for c in t_book.columns if 'bidSize' in c]

# rate and size columns index
rate_iloc = [i for i, c in enumerate(t_book.columns) if c in rate_col_names]
size_iloc = [i for i, c in enumerate(t_book.columns) if c in size_col_names]

# rate range for the given book data
min_rate = t_book.loc[:,rate_col_names].min().min()
max_rate = t_book.loc[:,rate_col_names].max().max()
rate_range = np.arange(min_rate,max_rate+0.5,0.5)

# number of data given
n_tick = len(t_book)

save the parameters for future use

In [8]:
import pickle

params_raw_data = {'rate_col_names':rate_col_names, 'size_col_names':size_col_names,
                   'rate_iloc':rate_iloc, 'size_iloc':size_iloc,
                   'rate_range':rate_range, 'n_tick':n_tick}

pickle.dump(params_raw_data,open('params_raw_data.pkl','wb'))

<a id="2.1" ></a>

function to convert one tick of data, using parallel computing

In [6]:
import multiprocessing as mp

# create market depth view of one tick of order info
def depth_of_market_onetick(tick):
    rate = tick[rate_iloc]
    size = tick[size_iloc]
    rate = [r for r in rate if not np.isnan(r)]
    size = [s for s in size if not np.isnan(s)]
    
    t  = np.zeros(len(rate_range))
    t[[int((r-rate_range[0])*2) for r in rate]] = size
    
    return t

# create market depth view for entire order book
def depth_of_market_book(book):
    pool = mp.Pool(mp.cpu_count())
    depth_of_market = pool.map(depth_of_market_onetick, [t for t in np.array(book)])
    pool.close()

    return depth_of_market

<a id="2.2" ></a>
Create depth view of subsets and merge

In [23]:
from tqdm import tqdm

# split the order book into 300 subsets
for i in tqdm(range(300)):
    bgn = i*1e4
    end = (i+1)*1e4
        
    pkl_filename = "depth_book_"+str(i)+".pkl"
    with open(pkl_filename, 'wb') as file:
        pickle.dump(depth_of_market_book(t_book.loc[bgn:end-1]), file)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290


In [28]:
# merge the results of 300 subsets
depth_book = []
for i in tqdm(range(300)):
    depth_book.extend(pickle.load(open( "depth_book_"+str(i)+".pkl", "rb" ) ))

pkl_filename = "depth_book.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(depth_book, file)

2999999


assign index labels, columns are price levels, rows are ticks

In [33]:
# convert depth book to dataframe and save to feather for faster reading
t_depth_book = pd.DataFrame(depth_book, columns=rate_range)

import feather
feather.write_dataframe(t_depth_book, 'depth_book.feather')

In [7]:
t_depth_book = feather.read_dataframe('depth_book.feather')
t_depth_book.head()

Unnamed: 0,1565.0,1565.5,1566.0,1566.5,1567.0,1567.5,1568.0,1568.5,1569.0,1569.5,...,1711.5,1712.0,1712.5,1713.0,1713.5,1714.0,1714.5,1715.0,1715.5,1716.0
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id="3" ></a>
<a id="3.1" ></a>
# Approximate Market, Limit and Cancelled Orders

Based on the position changes on each bid/ask level, we could estimate how many market/limit orders have been booked or cancelled since last tick. Order messages should be available in rea-life HFT. 


Assumptions are
- orders that push top bid/ask prices further from mid-price are all market orders
- positions change beyond the new top bid/ask prices are caused by limit orders or cancellation

and they will cause
- ignoring the limit orders at the top ask/bid prices which fill the top level ask/bid like market orders

In [74]:
# position change since last tick
t_order_chg = t_depth_book.diff()

# first ask and bid price index on market depth book column axis
t_top_level_iloc = t_book.loc[:,['askRate0','bidRate0']].add(-min_rate).mul(2).astype(int)

In [9]:
t_top_level_iloc.head()

Unnamed: 0,askRate0,bidRate0
0,109,100
1,109,100
2,109,100
3,109,100
4,109,100


approximate sizes of three types of order for one tick

In [29]:
# backsolve the order info for one tick
def approx_orders_onetick(order_chg, this_top_level, last_top_level):
    """
    inputs types are np.array, faster than pd.DataFrame indexing
    
    order_chg:      depth_book change from last tick
    this_top_level: first ask and bid price iloc of this tick, [ask,bid]
    last_top_level: first ask and bid price iloc of last tick, [ask,bid]
    """
    market_ask_range = order_chg[this_top_level[1]:last_top_level[1]+1]
    market_bid_range = order_chg[last_top_level[0]:this_top_level[0]+1]
    
    market_ask = -market_ask_range.copy()
    market_ask[market_ask<0]=0
    market_bid = -market_bid_range.copy()
    market_bid[market_bid<0]=0
    
    order_chg[this_top_level[1]:last_top_level[1]+1] += market_ask
    order_chg[last_top_level[0]:this_top_level[0]+1] += market_bid
    
    limit = order_chg.copy()
    limit[limit<0] = 0
    cancl = -order_chg.copy()
    cancl[cancl<0] = 0
        
    return [market_ask.sum(), market_bid.sum(), limit, cancl]

test if the function is doing the right thing

In [11]:
def check_order_approx(i,lag=1,if_print=False):
    order_chg = t_depth_book.loc[i]-t_depth_book.loc[i-lag]
    market_ask,market_bid,limit,cancl = approx_orders_onetick(order_chg.values,
                                                              t_top_level_iloc.loc[i],
                                                              t_top_level_iloc.loc[i-lag])
    
    this_rates = t_book.loc[i,rate_col_names].dropna().values
    last_rates = t_book.loc[i-lag,rate_col_names].dropna().values
    
    rates = np.unique(np.concatenate((this_rates,last_rates)))
    rates_iloc = [int((r-min_rate)*2) for r in rates]
    
    res = pd.DataFrame(index=rates[::-1],columns=['last_tick','this_tick','chg','limit','cancl'])
    
    res.loc[last_rates,'last_tick'] = t_book.loc[i-lag,size_col_names].dropna().values
    res.loc[this_rates,'this_tick'] = t_book.loc[i,size_col_names].dropna().values
    res.loc[:,'limit'] = limit[rates_iloc][::-1]
    res.loc[:,'cancl'] = cancl[rates_iloc][::-1]
    res = res.fillna(0)
    res['chg'] = res['this_tick']-res['last_tick']
    
    if if_print:
        print('best_ask: %.1f\t best_bid: %.1f'%(t_book.loc[i,'askRate0'],t_book.loc[i,'bidRate0']))
        print('market_ask: %d\t market_bid: %d'%(market_ask,market_bid))
        print(res)
    
    test_result = res['this_tick'].add(-res['last_tick']).sum()-res['limit'].sum()+res['cancl'].sum()==-market_ask-market_bid
    
    return(test_result)

Print out a sample tick and orders estimation results. The below indicates the function is doing the right job.

In [12]:
check_order_approx(380,if_print=True)

best_ask: 1616.5	 best_bid: 1616.0
market_ask: 0	 market_bid: 0
        last_tick  this_tick   chg  limit  cancl
1624.0       24.0        0.0 -24.0    0.0   24.0
1623.5        4.0        4.0   0.0    0.0   -0.0
1623.0       10.0       10.0   0.0    0.0   -0.0
1622.5        4.0        4.0   0.0    0.0   -0.0
1622.0      152.0      152.0   0.0    0.0   -0.0
1621.5        8.0        8.0   0.0    0.0   -0.0
1621.0       24.0       24.0   0.0    0.0   -0.0
1620.5        5.0        5.0   0.0    0.0   -0.0
1620.0       13.0       13.0   0.0    0.0   -0.0
1619.5        4.0        4.0   0.0    0.0   -0.0
1619.0        3.0        3.0   0.0    0.0   -0.0
1618.5        2.0        2.0   0.0    0.0   -0.0
1618.0        4.0        4.0   0.0    0.0   -0.0
1617.5        1.0        1.0   0.0    0.0   -0.0
1617.0        1.0        1.0   0.0    0.0   -0.0
1616.5        0.0        1.0   1.0    1.0    0.0
1616.0        4.0        4.0   0.0    0.0   -0.0
1615.5        3.0        3.0   0.0    0.0   -0.0
1615.

True

Randomly pick 10 ticks with random lags to check if the test passes. Yes it passed for all samples.

In [13]:
print("%s\t%s\t%s"%('row_i','lag','pass?'))
for i in np.random.randint(1,len(t_book),10):
    lag = np.random.randint(1,min(i,100))
    print("%d\t%d\t%s"%(i,lag,check_order_approx(i,lag)))

row_i	lag	pass?
2191040	48	True
1506523	60	True
2492213	70	True
241581	10	True
171575	63	True
2957944	42	True
93514	92	True
2660046	55	True
369724	6	True
1506923	56	True


<a id="3.3" ></a>
### populate the transformation to whole book

Populate the single tick function to multiple ticks (a subset of book).

In [44]:
import pickle
from tqdm import tqdm
# approximate orders for entire order book
def approx_orders(chg,top_iloc):
    pool = mp.Pool(mp.cpu_count())
    params = ((chg.iloc[i].values,top_iloc.iloc[i].values,top_iloc.iloc[i-1].values) for i in range(1,len(chg)))
    depth_of_market = pool.starmap(approx_orders_onetick, [p for p in params])
    pool.close()

    return depth_of_market

Since the book is too large, separate it to 300 subgroups, calculate, dump the results and merge.

In [72]:
# split the order book into 300 subsets
for i in tqdm(range(300)):
    bgn = int(i*1e4)
    end = int((i+1)*1e4)
        
    pkl_filename = "approx_order"+str(i)+".pkl"
    with open(pkl_filename, 'wb') as file:
        pickle.dump(approx_orders(t_order_chg.loc[bgn:end],t_top_level_iloc.loc[bgn:end]), file)

100%|██████████| 300/300 [38:47<00:00,  7.76s/it]


In [73]:
# merge the results of 300 subsets
market_ask_bid = []
market_bid_orders = []
limit_orders = []
cancl_orders = []
for i in tqdm(range(300)):
    temp = pickle.load(open( "approx_order"+str(i)+".pkl", "rb" ))
    market_ask_bid.extend([t[:2] for t in temp])
    limit_orders.extend([t[2] for t in temp])
    cancl_orders.extend([t[3] for t in temp])

pickle.dump(market_ask_bid, open("approx_market_ask_bid.pkl","wb"))
pickle.dump(limit_orders, open("approx_limit.pkl","wb"))
pickle.dump(cancl_orders, open("approx_cancl.pkl","wb"))

100%|██████████| 300/300 [00:37<00:00,  7.97it/s]


Assign index to order dataframes.
- market orders have column [ask,bid]
- limit/cancelled orders have price levels as columns
- rows are ticks

In [22]:
market_ask_bid = pickle.load(open("approx_market_ask_bid.pkl","rb"))
df_market = pd.DataFrame(market_ask_bid, columns=['market_ask','market_bid'])
df_market.index = range(2,3000000)
feather.write_dataframe(df_market,'df_market_order.feather')

# Results

<a id="4" ></a>
Now we have three types of order info:
1. df_market: market ask and bid order
2. df_limit: limit order at different price levels
3. df_cancl: cancelled order at different price level

This book will ends here. It serves as once-in-a-project-life type preparation work. 