# Processing Order Book

Additonal features created based on work from the following papers:
* [Tsantekidisa et al, “Using Deep Learning for price prediction by exploiting stationary limit order book features” (2018)](https://arxiv.org/pdf/1810.09965.pdf)
* [Zhang and Kercheval, “Modeling high-frequency limit order book dynamics with support vector machines” (2013)](https://www.math.fsu.edu/~aluffi/archive/paper462.pdf)

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [18]:
soxx_1_6 = 'SOXX_01_06_with_bids_depths.csv'
#soxx_1_9 = 'SOXX_01_09_with_bids_depths.csv'
data = pd.read_csv(soxx_1_6)

## Additional Features
#### Relative Depth

This feature is a ratio of the total depth for the top 10 bid and top 10 ask prices. In theory, a ratio greater than 0 indicates more buying pressure, indicating a possible future upward movement in mid-price.

In [19]:
data["relative_depth"] = data.bid_depth_10 / data.ask_depth_10 - 1

#### Mid-Price Change
This feature is a ratio of the current mid-price to the mid-price of the previous period. In this case, we use 200 events as our horizon. A figure greater than 0 indicates an upward mid-price movement compared to 200 events previous. Further below in the *Labels* section, I define a function which can more flexibly change the number of steps we'd like to use for making predictions

In [20]:
data["mid_price_change"] = data.mid_price/data.shift(200).mid_price - 1

## Standardizing Features

Standarize columns based on means and std of all but the first and last 1,000 rows of the data. The first and last 1,000 rows are excluded because they include weirdness due to it being the beginning and end of the trading day, and thus greatly affect standardization. We standardize with the idea that this framework could be applied across several securities, and standardization allows for better comparison regardless of a security's share price.

In [21]:
# Take means after the first 1000 rows and before last 1000
# A lot of weirdness in the beginning and end of the day
mean_bid_depth = data.bid_depth_10[1000:-1000].mean()
std_bid_depth = data.bid_depth_10[1000:-1000].std()

#get the mean and std of the ask depth
mean_ask_depth = data.ask_depth_10[1000:-1000].mean()
std_ask_depth = data.ask_depth_10[1000:-1000].std()

# standardize bid and ask depth
data["norm_bid_depth"] = (data.bid_depth_10 - mean_bid_depth) / std_bid_depth
data["norm_ask_depth"] = (data.ask_depth_10 - mean_ask_depth) / std_ask_depth

# get the mean and std of mid price
# compare each mid price to the mean for the day
mean_mid_price =  data.mid_price[1000:-1000].mean()
std_mid_price = data.mid_price[1000:-1000].std()
data["norm_mid_price"] = (data.mid_price - mean_mid_price) / std_mid_price

# standardize relative depth
mean_rel_depth =  data.relative_depth[1000:-1000].mean()
std_rel_depth = data.relative_depth[1000:-1000].std()
data["norm_rel_depth"] = (data.relative_depth - mean_rel_depth) / std_rel_depth

# Standardize mid-price change
mean_mid_price_change =  data.mid_price_change[1000:-1000].mean()
std_mid_price_change = data.mid_price_change[1000:-1000].std()
data["norm_mid_price_change"] = (data.mid_price_change - mean_mid_price_change)/std_mid_price_change

In [22]:
# this is how much the price changes in 200 rows compared to the current row
# greater than 0 means the price increases in 200 rows compared to the current row
data["mid_price_change_forward"] = data.shift(-200).mid_price/data.mid_price - 1 

## Labels
The labels we want are for a price movement a certain number of events in the future. We want to use prior events in order to predict whether the price will move upwards, downards, or stay the same <i>k</i> events in the future. The formulas for <i>mid_price_change</i>, <i>mid_price_change_forward</i> and <i>norm_mid_price_change</i> are the same as previously, just without the predefined steps <i>k</i>.

In [23]:
def mid(data, steps):
    mid_price_change = data.mid_price/data.shift(steps).mid_price - 1
    mid_price_change_forward = data.shift(-steps).mid_price/data.mid_price - 1
    
    mean_mid_price_change = mid_price_change[1000:-1000].mean()
    std_mid_price_change = mid_price_change[1000:-1000].std()
    norm_mid_price_change = (mid_price_change - mean_mid_price_change)/std_mid_price_change
    
    return(mid_price_change, mid_price_change_forward, norm_mid_price_change)

Below I define a function that will look at the <i>mid_price_change_forward</i> column, which says by what percent the price has changed, and decide if this counts as an upwards, downward, or stationary movement using the <i> change_amount</i> argument. We'd like to get approximately even classes, if possible.

If the price goes up by more than <i>change_amount</i>, the label is 1. If it goes down by more than <i>change_amount</i>, then the label is -1. Else it is 0.

In [24]:
# label for if in 200 rows the price will be higher, lower, or the same
def label_price_changes(price_level_change, change_amount):
    if price_level_change >= change_amount:
        return 1
    if price_level_change <= -change_amount:
        return -1
    elif price_level_change == np.nan:
        return np.nan
    else:
        return 0

In [229]:
# use 200 events as the change label
data['mid_price_change'], data['mid_price_change_forward'], data['norm_mid_price_change'] = mid(data, 200)

Below: Use a change amount of .01%. For the SOXX data, this implies a midprice change of at least 1.5 cents, as the mid price was consistently over $100 for that day.

In [28]:
data['change_label'] = data.mid_price_change_forward.apply(label_price_changes, change_amount = .0001)

Using the .01% threshold, we get somewhat close to balanced classes. 37.7% of labels are stationary, 32.9% are upward movements, and 29.4% are downward movements.

In [29]:
round(data.change_label.value_counts()/len(data),3)

 0    0.377
 1    0.329
-1    0.294
Name: change_label, dtype: float64

Now that we have processed features and labels, we can save the data to be used in a neural network

In [231]:
#data.to_csv('SOXX_01_06_processed.csv')