# Feature Engineering: Building the Limit Order Book

In this notebook, I take a raw set of sequential limit orders, and update a bid or ask order book according to the newest event in the book.

The steps are:
* Read in the data.
* Add empty columns for new features.
* Create a class [OrderBook](#OrderBook) which will hold updated information for the order books.
* Create methods for OrderBook which will modify an order book. These methods contain the logic for how to modify an order book for a given event.
* Instantiate OrderBook objects for the bids and asks.
* Run through the data to update the order book objects, and fill in the new features for each row, given the state of the order books at that time.
* After running through the data, save the updated data as a csv to be used for more processing in the next notebook.

In [78]:
import pandas as pd
import numpy as np
import datetime

In [79]:
soxx_1_06 = 'SOXX_book_events_sip-nasdaq_2017-01-06.csv'
#soxx_1_09 = "SOXX_book_events_sip-nasdaq_2017-01-09.csv"

data = pd.read_csv(soxx_1_06)
data = data.drop("Unnamed: 0", axis=1)
data = data[data.side != 'U']

In [80]:
print(data.shape)
data.head()

(188076, 7)


Unnamed: 0,timestamp,order_id,book_event_type,price,quantity,side,aux_quantity
0,1483706745349555268,2187624,A,10000000,100,B,0
1,1483706745349557645,2187628,A,199999990000000,100,A,0
2,1483707601849177770,2429788,A,124300000000,700,A,0
3,1483707602641135886,2429788,C,124300000000,0,A,700
4,1483707602641681537,2432900,A,124090000000,700,A,0


In [82]:
# Create columns for features we will build from order book
data['timestamp'] = data['timestamp'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000000000).strftime('%D %H:%M:%S.%f'))
data['price'] = data['price']/1000000000
data['best_bid'] = np.nan
data['best_ask'] = np.nan
data['mid_price'] = np.nan
data['bid_depth_10'] = np.nan
data['ask_depth_10'] = np.nan

bid_columns = []
ask_columns = []

for i in range(10, 0, -1):
    data['bid' + str(i)] = np.nan
    bid_columns.append('bid' + str(i))

for i in range(1,11):
    data['ask' + str(i)] = np.nan
    ask_columns.append('ask' + str(i))

<a id='OrderBook'></a>
### OrderBook Class 
OrderBook objects will store synthesized limit order book data that can be used to make predictions for future price movements. Each row in the dataframe affects the limit order book in some way. The methods of the OrderBook class make the appropriate changes to the order book, in order to extract the data we are interested in.

Comments/details regarding the logic for each of the methods are here: [Link](https://github.com/bshpringer/neural-network-stock-prediction/blob/master/info/OrderBook_Methods.ipynb)

In [46]:
from collections import defaultdict

class OrderBook:
    def __init__(self):
        self.orders = defaultdict(lambda: defaultdict(float))
        self.order_prices = defaultdict(lambda: defaultdict(float))
        self.current_prices = [0]
        
    def change_book(self, row):
        event = row.book_event_type
        
        if event == 'A':
            self.add_order(row)
        elif event == 'M':
            self.modify_order(row)
        elif event == 'C':
            self.cancel_order(row)
        elif event == 'T':
            self.trade_order(row)
            
        self.update_current_prices(row.side)
        
    def add_order(self, row):
        order_id = row.order_id
        self.orders[row.price][order_id] += row.quantity
        self.order_prices[order_id] = row.price
    
    def modify_order(self, row):
        order_id = row.order_id
        old_price = self.order_prices[order_id]
        del self.orders[old_price][order_id]
        self.orders[row.price][order_id] += row.quantity
        self.order_prices[order_id] = row.price
        if len(self.orders[old_price]) == 0:
            self.orders.pop(old_price)
        elif all(value <= 0 for value in self.orders[old_price].values()):
            self.orders.pop(old_price)
        
    def cancel_order(self, row):
        order_id = row.order_id
        self.orders[row.price].pop(order_id, None)
        del self.order_prices[order_id]
        if len(self.orders[row.price]) == 0:
            self.orders.pop(row.price)
        elif all(value <= 0 for value in self.orders[row.price].values()):
            self.orders.pop(row.price)
    
    def trade_order(self, row):
        order_id = row.order_id
        old_price = self.order_prices[order_id]
        self.orders[old_price][order_id] = row.aux_quantity

        if all(value <= 0 for value in self.orders[old_price].values()):
            self.orders.pop(old_price)
    
    def update_current_prices(self, side):
        if side == 'A':
            self.current_prices = sorted(self.orders.keys())
        elif side == 'B':
            self.current_prices = sorted(self.orders.keys(), reverse = True)
        
    def append_nones(self, length, price_list):
        diff_len = length - len(price_list)
        return price_list + [None] * diff_len
    
    def top_10_price_depth(self):
        top_10_prices = self.current_prices[0:10]
        return [sum(self.orders[price].values())*price for price in top_10_prices if price !=0]

In [64]:
# Helper method so that I always have 10 values to add for bids and asks
def append_nones(length, prices):
    diff_len = length - len(prices)
    return prices + [None] * diff_len

### Iterate through data

* Instantiate ask_book and bid_book
* Apply change method for a given row
* Pull figures we care about:

    * Current best bid/ask at the time
    * Sum of the top 10 price depths (price * total shares requested)
    * Individual depths for the top 10 prices
    
* Update the dataframe with all the calculated figures

In [85]:
%%time

ask_book = OrderBook()
bid_book = OrderBook()

for index, row in data[0:11000].iterrows():
    side = row.side
    
    if side == 'A':
        ask_book.change_book(row)
        
    elif side == 'B':
        bid_book.change_book(row)
        
    best_ask = ask_book.current_prices[0]
    best_bid = bid_book.current_prices[0]
    
    current_bid_depth = sum(bid_book.top_10_price_depth())
    current_ask_depth = sum(ask_book.top_10_price_depth())
    
    # align the bids depths with the columns
    top_10_bid_depths = bid_book.top_10_price_depth()
    appended_bids = append_nones(10,top_10_bid_depths)[::-1]
    
    top_10_ask_depth = ask_book.top_10_price_depth()
    appended_asks = append_nones(10, top_10_ask_depth)
    
    individual_columns = ['best_bid','best_ask', 'bid_depth_10','ask_depth_10']
    individual_figures = [best_bid, best_ask, current_bid_depth, current_ask_depth]
    
    update_columns = individual_columns + bid_columns + ask_columns
    update_stats = individual_figures + appended_bids + appended_asks
    
#     data.loc[row.name, update_columns] = update_stats
    
    if index % 10000 == 0:
        print(index)

0
10000
CPU times: user 3.81 s, sys: 205 ms, total: 4.01 s
Wall time: 4.41 s


In [76]:
data[10980:11000]

Unnamed: 0,timestamp,order_id,book_event_type,price,quantity,side,aux_quantity,best_bid,best_ask,mid_price,bid_depth_10,ask_depth_10,bid10,bid9,bid8,bid7,bid6,bid5,bid4,bid3,bid2,bid1,ask1,ask2,ask3,ask4,ask5,ask6,ask7,ask8,ask9,ask10
10990,1483713790548540607,23989852,A,122.59,200,A,0,122.05,122.13,,719767.0,305448.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,24426.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10991,1483713790592494077,23990392,A,122.13,300,A,0,122.05,122.13,,719767.0,342087.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,61065.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10992,1483713790592568255,23990400,A,122.13,600,A,0,122.05,122.13,,719767.0,415365.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,134343.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10993,1483713790594397611,23990392,C,122.13,0,A,300,122.05,122.13,,719767.0,378726.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,97704.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10994,1483713790596851371,23990524,A,122.13,300,A,0,122.05,122.13,,719767.0,415365.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,134343.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10995,1483713790623808048,23990524,C,122.13,0,A,300,122.05,122.13,,719767.0,378726.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,146424.0,24410.0,97704.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10996,1483713790672975927,23982024,C,122.02,0,B,300,122.05,122.13,,683161.0,378726.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,109818.0,24410.0,97704.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10997,1483713790676629975,23991216,A,122.13,300,A,0,122.05,122.13,,683161.0,415365.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,109818.0,24410.0,134343.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10998,1483713790777790829,23991216,C,122.13,0,A,300,122.05,122.13,,683161.0,378726.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,146412.0,109818.0,24410.0,97704.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0
10999,1483713790812608025,23829448,C,122.01,0,B,200,122.05,122.13,,658759.0,378726.0,12193.0,48780.0,36588.0,85379.0,73188.0,85393.0,61000.0,122010.0,109818.0,24410.0,97704.0,61070.0,61075.0,36648.0,24434.0,24438.0,24448.0,12225.0,12226.0,24458.0


In [73]:
for key in sorted(bid_book.orders, reverse=True)[0:10]:
    print("%s: %s" % (key, bid_book.orders[key]))

122.05: defaultdict(<class 'float'>, {23988428: 200.0})
122.02: defaultdict(<class 'float'>, {23940656: 100.0, 23947948: 600.0, 23948248: 200.0})
122.01: defaultdict(<class 'float'>, {23940504: 300.0, 23928180: 200.0, 23948276: 200.0, 23988424: 300.0, 24003112: 200.0})
122.0: defaultdict(<class 'float'>, {23928136: 300.0, 23902724: 200.0})
121.99: defaultdict(<class 'float'>, {23738564: 200.0, 23927596: 300.0, 23928064: 200.0})
121.98: defaultdict(<class 'float'>, {23818708: 300.0, 23906000: 100.0, 23915460: 200.0})
121.97: defaultdict(<class 'float'>, {23785664: 200.0, 23851548: 300.0, 23915396: 200.0})
121.96: defaultdict(<class 'float'>, {23788368: 200.0})
121.95: defaultdict(<class 'float'>, {23818220: 200.0, 23906412: 200.0})
121.93: defaultdict(<class 'float'>, {9245036: 100.0})


In [74]:
for key in sorted(ask_book.orders)[0:10]:
    print("%s: %s" % (key, ask_book.orders[key]))

122.13: defaultdict(<class 'float'>, {23990400: 600.0})
122.14: defaultdict(<class 'float'>, {23409228: 200.0, 23782148: 300.0})
122.15: defaultdict(<class 'float'>, {23271408: 200.0, 23737348: 300.0})
122.16: defaultdict(<class 'float'>, {23731440: 100.0, 23928068: 200.0})
122.17: defaultdict(<class 'float'>, {23967340: 200.0})
122.19: defaultdict(<class 'float'>, {23649084: 200.0})
122.24: defaultdict(<class 'float'>, {23509920: 200.0})
122.25: defaultdict(<class 'float'>, {12193908: 100.0})
122.26: defaultdict(<class 'float'>, {12193860: 100.0})
122.29: defaultdict(<class 'float'>, {23706864: 200.0})


In [None]:
# save data to a CSV so it can be processed further
#data.to_csv('SOXX_01_06_with_bids_depths.csv')