<h2> Rules </h2>

<li> Submission deadline is 22-March-2018 11:59 IST</li>
<li> Output should be in the specified format below </li>


<h3>Background Summary</h3>

An order driven market is a financial market where all buyers and sellers display the prices at which they wish to buy or sell a particular security, as well as the amounts of the security desired to be bought or sold. In these markets, participants may submit limit orders or market orders. 

In a limit order, you specify how much of the asset you want to buy or sell, and the price you want. If there are matching orders on the book (e.g. someone who wants to sell at the same price, or lower, as the price at which you want to buy), your order will be filled immediately. If not, your order will stay on the book until matching orders arrive (which could be never). It is also possible for a limit order to be only partially filled, if the counterparty wants to trade a smaller amount than you did. In that case the rest of the order remains on the book.

In a market order, you only specify how much of the asset you want to trade. Your order is then filled immediately at the best price currently available on the market. For instance, if you place a market buy order, you will be matched with the current lowest-priced sell order on the book. If that order is not large enough to completely fill yours, the next-lowest sell order will be used to fill some more of yours, and so on.(You are encouraged to go through the 1st suggested reading for a pictorial understanding of order book and price dynamics)

In this competition, we use tick data. Tick data refers to any market data which shows the price and volume of every print.  Additionally changes to the state of the order book occur in the form of trades and quotes. A quote event occurs whenever the best bid or the ask price is updated. A trade event takes place when shares are bought or sold.

The aim of this competition is to determine the relationship between recent past order book events and future stock price for 30 seconds time-horizons. Few factors that are explored in the literature to predict price movements:  
<li>Order arrival rate</li>
<li>Bid-ask spread</li>
<li>Order book imbalance</li>
<li>Trade volume @ Bid price vs Trade volume @ Ask price</li>

Certain factors, such as current order book imbalance, tend to have good predictive power for very short-time time-horizons (under 10-20 seconds), however other factors might be important for time-horizons of more than a minute.

Equity markets are very fast and it is important to understand that multiple high-frequency events can occur in the same milliseconds. Analysing and understanding the data is critical before applying machine learning models.

This problem is based on real-life problem we work on. Another important point to note - trade event and quote event timestamp will rarely be at the same-time, usually quote event time stamp is before trade event time stamp. Refer to examples in the <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html">link</a> on how you could join this data 


<h2> Suggested Reading </h2>

Basic introduction of Limit Order Book
<li><a href="https://journal.r-project.org/archive/2011/RJ-2011-010/RJ-2011-010.pdf"> Analyzing an Electronic Limit Order Book </a></li>
<li><a href="https://www.amazon.in/Algorithmic-Trading-DMA-Introduction-Strategies/dp/0956399207">Algorithmic Trading and DMA</a></li>
<li><a href="https://www.quantstart.com/articles/high-frequency-trading-ii-limit-order-book">HFT - Limit Order Book</a></li>

<br/>
Advanced Topics 
<li><a href="http://www.personal.psu.edu/qxc2/research/jfuturesmarkets-2008.pdf"> The Information Content of an Open Limit Order Book</a></li>
<li><a href="http://eprints.maths.ox.ac.uk/1895/1/Darryl%20Shen%20%28for%20archive%29.pdf"> Order Imbalance Based Strategy in High Frequency Trading</a></li>


<h2>Problem Statement</h2>

The aim of the problem is to develop a forecasting model to predict a stock's short-term price movement. The use of such prediction models is widely prevalent in algorithmic trading. Algorithmic trading, sometimes referred to as high-frequency trading in specific circumstances, is the use of automated systems to identify true(money making) signals among massive amounts of data that capture the underlying stock dynamics. These models can be leveraged to develop profitable trading strategies(akin to hedge funds) to help investors/traders achieve better returns. Contestants are expected and encouraged to think of empirical models/heuristics in order to better predict the price evolution of the hypothetical stock.

<br/><br/>

<h2>Submission Instructions</h2>
<br/>Algorithm/model should be developed without changing the order in the submission. Steps mentioned is the order of code execution.
<ol>
  <li>Download files if not in local environment</li>
<li>Install all required libraries</li>
<li>Parameters such as file names for in-sample data & out-sample data</li>
<li>Algo-specific parameters</li>
<li>Functions to load data & evaluate performance</li>
<li>Train the model using in-sample data</li>
    <ul>
        <li>Trained model should be a <a href="https://docs.python.org/3/library/pickle.html">pickle</a> or a function with values</li>
    </ul>
    
<li>Predict with the trained model using out_sample data and evaluate the model performance</li>
</ol>

<b>Important Instructions</b>
<ol>
    <li>Cells that begin with "#[DONOTCHANGE]" shouldn't be changed</li>
    <li>Submission will not be considered if the program fails to run</li>
    <li>Tick frequency shouldn't be modified for out sample data</li>
    <li>Only first-prediction will be considered for a mid-price until it changes [More detail below] </li>
    <li>Model code should be commented</li>
    <li>Submitted notebook shouldn't ideally require trade_in.csv and quote_in.csv files (insample files)</li>
    <li>Brief description of the model (Preferably less than 1-page)</li>
</ol>

<b>Required Output</b><br/>
<ol>
<li>Minimum #predictions > 10000</li>
  <li>Low RMS</li>
<li>2 or 3 files with below names</li>
<ol>
  <li>Model.ipynb : Notebook file</li>
  <li>Model.pdf : Model documentation</li>
  <li>Model.pickle (optional) : Trained model</li>
</ol>
</ol>

<br/><br/>
<b> First-Prediction of every mid-price for evaluation </b>

predMid is the predicted mid-price 30 seconds ahead. NA in predMid implies that the model doesn't have a prediction. Only valid predictions will be considered for evaluation.

<br/>


date|sym|bsize|bid|ask|asize|mid|predMid|ValidPrediction
-----|-----|-----|-----|-----|-----|-----|-----|-----
2018.01.02D08:00:28.913000000|TEST.L|4816|5008|5011|569|5009.5|5020|Yes
2018.01.02D08:00:28.913000000|TEST.L|3327|5008|5011|569|5009.5|5020|<b>No</b>
2018.01.02D08:00:28.913000000|TEST.L|3327|5008|5015|616|5011.5|5018|Yes
2018.01.02D08:00:28.917000000|TEST.L|3363|5008|5015|616|5011.5|NA|-
2018.01.02D08:00:28.939000000|TEST.L|5045|5008|5015|616|5011.5|5018|<b>No</b>
2018.01.02D08:00:28.939000000|TEST.L|5045|5008|5016|45|5012|5005|Yes
2018.01.02D08:00:29.028000000|TEST.L|1718|5008|5016|45|5012|5005|<b>No</b>
2018.01.02D08:00:29.052000000|TEST.L|1718|5008|5015|90|5011.5|NA|-
2018.01.02D08:00:29.052000000|TEST.L|1718|5008|5015|256|5011.5|5020|Yes
2018.01.02D08:00:29.052000000|TEST.L|1718|5008|5015|278|5011.5|5020|<b>No</b>



<br/><br/><br/>

<h2>Data </h2>

In-sample data
<ul>
<li>trade_in.csv</li>
<li>quote_in.csv</li>
</ul>
Out-sample data
<ul>
<li>trade_out.csv</li>
<li>quote_out.csv</li>
</ul>

Your model will be evaluated with a different set of date set.

<b> Data Fields </b>
    
Variable Name|Description|Type|Example
-----|-----|-----|-----
datetime|Datetime of the event|Datetime in format yyyy.mm.ddDHH:MM:SS.fff|2018.02.10D10:20:20.100
ric|Stock ticker|String|TEST.L
price|Last trade price|Double|3.45
size|Last trade size|Integer|10000
bid|Current Bid Price|Double|3.45
ask|Current Ask Price|Double|3.5
bsize|Current Bid Size|Integer|4000
asize|Current Ask Size|Integer|5000
mid|0.5 * (bid + ask)|Double|3.475
predictedMid|Mid-price predicted by the model|Double|3.65

In [1]:
#[DONOTCHANGE]
#1st step - Download files from Google drive
#Manual download from https://drive.google.com/open?id=1lExwruiCpiOwgE_ijiZ1l6sIYbHnt5v_
import os
import zipfile
import urllib.request

def DownloadZipFiles():
  print('Download zip file to local drive')
  urllib.request.urlretrieve("https://drive.google.com/uc?authuser=0&id=1lExwruiCpiOwgE_ijiZ1l6sIYbHnt5v_&export=download", "intraday.zip")  
  zip_ref = zipfile.ZipFile('intraday.zip', 'r')
  zip_ref.extractall()
  zip_ref.close()

def FilesExists():
  if not os.path.exists('./intraday'):
    return False
  fileList = ['quote_in.csv', 'quote_out.csv', 'trade_in.csv', 'trade_out.csv']
  listDir = os.listdir('./intraday')
  listDir.sort()
  return listDir == fileList

try:    
  if not FilesExists():
    DownloadZipFiles()
  else:
    print('Files exist')    
except Exception as ex:
  print('File download from google drive failed')

Files exist


In [2]:
#2nd step - Install all required libraries
'''
!pip install pandas
!pip install pytz
!python -m pip install --upgrade pip
'''


'\n!pip install pandas\n!pip install pytz\n!python -m pip install --upgrade pip\n'

In [3]:
#[DONOTCHANGE]
#3rd step - all parameters
class Parameters(object):
    pass

param = Parameters()
param.tickSize = 0.5 #tick size is 0.5 GBp i.e. 0.005 GBP

param.fileDirectory = './intraday'

param.trade_InSampleFile = 'trade_in.csv'
param.quote_InSampleFile = 'quote_in.csv'

param.trade_OutSampleFile = 'trade_out.csv'
param.quote_OutSampleFile = 'quote_out.csv'

In [4]:
#4th step - Model specific parameters
param.imbalanceThreshold = 0.7
param.timeDuration = 30 #30 seconds

In [5]:
#[DONOTCHANGE]
#5th step - Functions to load data & evaluate performance

#Initialise libraries and functions
from sklearn.metrics import mean_squared_error
from math import sqrt

import os
import numpy as np
import pandas as pd

#Disable certain warnings
pd.options.mode.chained_assignment = None

#Identify future mid prices - 30 seconds duration
def IdentifyFutureMidPrices(df, predictionDuration = '30S'):
    futureData = df.resample(predictionDuration, on = 'datetime').first()
    futureData = futureData.shift(periods=-1)
    futureData.drop(columns = ['datetime', 'sym', 'bsize', 'bid', 'ask', 'asize'], inplace = True)
    futureData.rename(columns = {"mid":"futMid"}, inplace = True)
    futureData.reset_index(inplace = True)
    return pd.merge_asof(df, futureData[['datetime', 'futMid']], on='datetime')

def ReadCSV(file):
    print('Loading file - ' + file)
    df = pd.read_csv(file)
    df['datetime'] = pd.to_datetime(df['datetime'], format="%Y-%m-%dD%H:%M:%S.%f")
    return df

#Load data
def LoadData(path, tradeFile, quoteFile):
    tradeFile = os.path.join(path, tradeFile)
    quoteFile = os.path.join(path, quoteFile)

    trade_df = ReadCSV(tradeFile)
    quote_df = ReadCSV(quoteFile)
    
    quote_df['mid'] = 0.5*(quote_df['bid'].copy() + quote_df['ask'].copy())
    quote_df['midChangeGroup'] = quote_df['mid'].diff().ne(0).cumsum()
    quote_df = IdentifyFutureMidPrices(quote_df)
    print('Files loaded')
    return trade_df, quote_df

#Evaluation function
#df should contain columns - datetime, sym, bsize, bid, ask, asize, predMid (model predicted mid-price)
#Function to evaluate results
def RMS(df):
    df = df.groupby(['midChangeGroup']).first().reset_index()
    tmp = df.dropna(subset=['predMid', 'futMid'])
    rms = sqrt(mean_squared_error(tmp['futMid'], tmp['predMid']))
    predCount = len(tmp['predMid'])
    print('RMS = %.4f. #Predictions = %s' % (rms, predCount))

In [6]:
#6th Step
#Train the model using in-sample data

#Use a pickle to save the trained data or a function with all required calculations
#For ex: a regression/neural network trained model should be saved to pickle and loaded in PredictionModel
#Another option to define a function for all calculations to predict
#def Predict(quote_df, trade_df):
#    define all required calculations


#Using training data, below model has been designed. You should aim to improve this model
#When mid price changes and 
# imbalance is >  0.7 (param.imbalanceThreshold), predict that mid-price will tick-up
# imbalance is < -0.7 (param.imbalanceThreshold), predict that mid-price will tick-down
def InSamplePredictionModel(quote_df, trade_df):
    #Load pickle to predict or use a function Predict()
    print('Prediction model')    
    quote_df['tick'] = np.nan
    quote_df['predMid'] = np.nan
    
    quote_df['midChanged'] = quote_df['mid'].diff()
    quote_df['imbalance'] = (quote_df['bsize']-quote_df['asize'])/(quote_df['bsize']+quote_df['asize'])
    quote_df.loc[(quote_df['midChanged'] != 0) & (quote_df['imbalance'] > param.imbalanceThreshold), 'tick'] = 1
    quote_df.loc[(quote_df['midChanged'] != 0) & (quote_df['imbalance'] < -param.imbalanceThreshold), 'tick'] = -1

    quote_df['predMid'] = quote_df['mid'] + param.tickSize * quote_df['tick']
    return quote_df

#Load the csv if not in memory
dirContents = dir()
if not ('tradeIndf' in dirContents and 'quoteIndf' in dirContents):
    tradeIndf, quoteIndf = LoadData(param.fileDirectory, param.trade_InSampleFile, param.quote_InSampleFile)
    
#Training code
#def MachineLearningModel():
#    x = model.fit()

Loading file - ./intraday\trade_in.csv
Loading file - ./intraday\quote_in.csv
Files loaded


In [7]:
#7th step
#Predict with the trained model using out_sample data

#Load the out-sample csv if not in memory
#Do not change tick frequency for outsample dataframe
dirContents = dir()
if not ('tradeOutdf' in dirContents and 'quoteOutdf' in dirContents):
    tradeOutdf, quoteOutdf = LoadData(param.fileDirectory, param.trade_OutSampleFile, param.quote_OutSampleFile)

#Any required calculations
def OutSamplePrediction(quote_df, trade_df): 
    print('Out-sample prediction')    
    return InSamplePredictionModel(quote_df, trade_df)
    
    
res = OutSamplePrediction(quoteOutdf, tradeOutdf)
RMS(res)

Loading file - ./intraday\trade_out.csv
Loading file - ./intraday\quote_out.csv
Files loaded
Out-sample prediction
Prediction model
RMS = 2.4470. #Predictions = 9369
