##### my-zero-to-gbm-proj-assign

## Optiver Realized Volatility Prediction

This dataset contains stock market data relevant to the practical execution of trades in the financial markets. In particular, it includes order book snapshots and executed trades. With one second resolution, it provides a uniquely fine grained look at the micro-structure of modern financial markets.

This is a code competition where only the first few rows of the test set are available for download. The rows that are visible are intended to illustrate the hidden test set format and folder structure. The remainder will only be available to your notebook when it is submitted. The hidden test set contains data that can be used to construct features to predict roughly 150,000 target values. Loading the entire dataset will take slightly more than 3 GB of memory, by our estimation.

This is also a forecasting competition, where the final private leaderboard will be determined using data gathered after the training period closes, which means that the public and private leaderboards will have zero overlap. During the active training stage of the competition a large fraction of the test data will be filler, intended only to ensure the hidden dataset has approximately the same size as the actual test data. The filler data will be removed entirely during the forecasting phase of the competition and replaced with real

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

#### Data Description

book_[train/test].parquet A parquet file partitioned by stock_id. Provides order book data on the most competitive buy and sell orders entered into the market. The top two levels of the book are shared. The first level of the book will be more competitive in price terms, it will then receive execution priority over the second level.

stock_id - ID code for the stock. Not all stock IDs exist in every time bucket. Parquet coerces this column to the categorical data type when loaded; you may wish to convert it to int8.
time_id - ID code for the time bucket. Time IDs are not necessarily sequential but are consistent across all stocks.
seconds_in_bucket - Number of seconds from the start of the bucket, always starting from 0.
bid_price[1/2] - Normalized prices of the most/second most competitive buy level.
ask_price[1/2] - Normalized prices of the most/second most competitive sell level.
bid_size[1/2] - The number of shares on the most/second most competitive buy level.
ask_size[1/2] - The number of shares on the most/second most competitive sell level.
trade_[train/test].parquet A parquet file partitioned by stock_id. Contains data on trades that actually executed. Usually, in the market, there are more passive buy/sell intention updates (book updates) than actual trades, therefore one may expect this file to be more sparse than the order book.

stock_id - Same as above.
time_id - Same as above.
seconds_in_bucket - Same as above. Note that since trade and book data are taken from the same time window and trade data is more sparse in general, this field is not necessarily starting from 0.
price - The average price of executed transactions happening in one second. Prices have been normalized and the average has been weighted by the number of shares traded in each transaction.
size - The sum number of shares traded.
order_count - The number of unique trade orders taking place.
train.csv The ground truth values for the training set.

stock_id - Same as above, but since this is a csv the column will load as an integer instead of categorical.
time_id - Same as above.
target - The realized volatility computed over the 10 minute window following the feature data under the same stock/time_id. There is no overlap between feature and target data. You can find more info in our tutorial notebook.
test.csv Provides the mapping between the other data files and the submission file. As with other test files, most of the data is only available to your notebook upon submission with just the first few rows available for download.

stock_id - Same as above.
time_id - Same as above.
row_id - Unique identifier for the submission row. There is one row for each existing time ID/stock ID pair. Each time window is not necessarily containing every individual stock.
sample_submission.csv - A sample submission file in the correct format.

row_id - Same as in test.csv.
target - Same definition as in train.csv. The benchmark is using the median target value from train.csv.

In [None]:
# Execute this to save new versions of the notebook
#jovian.commit(project="my-zero-to-gbm-proj-assign")
jovian.commit(filename="my-zero-to-gbm-proj-assign.ipynb")

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
#train = pd.read_csv('../../../data/optiver-realized-volatility-prediction/train.csv')
train = pd.read_csv('d:\\optiver-realized-volatility-prediction\\train.csv')

In [2]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428932 entries, 0 to 428931
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   stock_id  428932 non-null  int64  
 1   time_id   428932 non-null  int64  
 2   target    428932 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 9.8 MB


In [2]:
import glob
list_order_book_file_train = glob.glob('d:\\optiver-realized-volatility-prediction\\book_train.parquet/*')
list_trade_book_file_train = glob.glob('d:\\optiver-realized-volatility-prediction\\trade_train.parquet/*')
#list_order_book_file_train

In [45]:
# def compute_wap(df_stock_book):
#        df_stock_book['bid_price'] = (df_stock_book['bid_price1'] * df_stock_book['bid_size1'] + df_stock_book['bid_price2'] * df_stock_book['bid_size2'])/(df_stock_book['bid_size1'] + df_stock_book['bid_size2'])
#        df_stock_book['bid_size'] = (df_stock_book['bid_size1'] + df_stock_book['bid_size2'])

#        df_stock_book['ask_price'] = (df_stock_book['ask_price1'] * df_stock_book['ask_size1'] + df_stock_book['ask_price2'] * df_stock_book['ask_size2'])/(df_stock_book['ask_size1'] + df_stock_book['ask_size2'])
#        df_stock_book['ask_size'] = (df_stock_book['ask_size1'] + df_stock_book['ask_size2'])

#        df_stock_book['wap'] = (df_stock_book['bid_price'] * df_stock_book['ask_size'] + df_stock_book['ask_price'] * df_stock_book['bid_size']) / (df_stock_book['bid_size'] +  df_stock_book['ask_size'])
#        return df_stock_book['wap']

In [3]:
df_order_book = pd.DataFrame()
for file in list_order_book_file_train:
     df_stock_book = pd.read_parquet(file)
     df_stock_book['stock_id'] = file.split('=')[1]
     df_order_book = pd.concat([df_order_book,df_stock_book])


df_trade_book = pd.DataFrame()
for file in list_trade_book_file_train:
     df_stock_book = pd.read_parquet(file)
     df_stock_book['stock_id'] = file.split('=')[1]
     df_trade_book = pd.concat([df_trade_book,df_stock_book])


In [4]:
# with open('book_index.csv','w+') as f:
#     for items in list(df_order_book_50.index):
#         f.write('%s\n' %items)
#new_index
# df_order_book_5.reset_index(inplace=True,drop=True)
# df_trade_book_5.reset_index(inplace=True,drop=True)

df_order_book.reset_index(inplace=True,drop=True)
df_trade_book.reset_index(inplace=True,drop=True)

#df_order_book_50.head(-1)
#df_order_book_50[df_order_book_50.index != df_order_book_50['index']]
#(72311913, 11)

In [5]:
df_order_book['stock_id'] = df_order_book['stock_id'].astype('int8')
df_trade_book['stock_id'] = df_trade_book['stock_id'].astype('int8')

In [43]:
# df_order_book['order_wap'] = (((df_order_book['bid_price1'] * df_order_book['bid_size1'] + df_order_book['bid_price2'] * df_order_book['bid_size2'])/(df_order_book['bid_size1'] + df_order_book['bid_size2']) * (df_order_book['ask_size1'] + df_order_book['ask_size2'])) + ((df_order_book['ask_price1'] * df_order_book['ask_size1'] + df_order_book['ask_price2'] * df_order_book['ask_size2'])/(df_order_book['ask_size1'] + df_order_book['ask_size2'])) * (df_order_book['bid_size1'] + df_order_book['bid_size2'])) / ((df_order_book['bid_size1'] + df_order_book['bid_size2']) +  (df_order_book['ask_size1'] + df_order_book['ask_size2']))

In [46]:
# df_order_book['order_wap'] = ((((df_order_book['bid_price1'] * df_order_book['bid_size1'] + df_order_book['bid_price2'] * 
#     df_order_book['bid_size2'])/(df_order_book['bid_size1'] + df_order_book['bid_size2'])) * (df_order_book['ask_size1'] + 
#     df_order_book['ask_size2'])) + (((df_order_book['ask_price1'] * df_order_book['ask_size1'] + df_order_book['ask_price2'] * 
#     df_order_book['ask_size2'])/(df_order_book['ask_size1'] + df_order_book['ask_size2'])) * (df_order_book['bid_size1'] + 
#     df_order_book['bid_size2'])) / ((df_order_book['bid_size1'] + df_order_book['bid_size2']) +  (df_order_book['ask_size1'] + 
#     df_order_book['ask_size2']))).astype('float32')

In [6]:
df_order_book['order_wap'] = ((((df_order_book['bid_price1'].values * df_order_book['bid_size1'].values + df_order_book['bid_price2'].values * 
    df_order_book['bid_size2'].values)/(df_order_book['bid_size1'].values + df_order_book['bid_size2'].values)) * (df_order_book['ask_size1'].values + 
    df_order_book['ask_size2'].values)) + (((df_order_book['ask_price1'].values * df_order_book['ask_size1'].values + df_order_book['ask_price2'].values * 
    df_order_book['ask_size2'].values)/(df_order_book['ask_size1'].values + df_order_book['ask_size2'].values)) * (df_order_book['bid_size1'].values + 
    df_order_book['bid_size2'].values)) / ((df_order_book['bid_size1'].values + df_order_book['bid_size2'].values) +  (df_order_book['ask_size1'].values + 
    df_order_book['ask_size2'].values))).astype('float32')

In [7]:
df_order_book.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167253289 entries, 0 to 167253288
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   time_id            int16  
 1   seconds_in_bucket  int16  
 2   bid_price1         float32
 3   ask_price1         float32
 4   bid_price2         float32
 5   ask_price2         float32
 6   bid_size1          int32  
 7   ask_size1          int32  
 8   bid_size2          int32  
 9   ask_size2          int32  
 10  stock_id           int8   
 11  order_wap          float32
dtypes: float32(5), int16(2), int32(4), int8(1)
memory usage: 6.4 GB


In [7]:
# grouped = df_order_book.groupby(['time_id','stock_id'])['seconds_in_bucket','order_wap']

In [9]:
# s3 =pd.DataFrame(df_order_book[['time_id','stock_id','seconds_in_bucket','order_wap']])

In [7]:
new_index = pd.Index(np.arange(0,600), name="seconds_in_bucket")

In [9]:
# s3=grouped.apply(lambda x : x.set_index('seconds_in_bucket').reindex(new_index))


In [8]:
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff()

In [11]:
# s3.loc(axis=0)[:,:,0].isna().sum()


In [12]:
#Below operation is not required because there are no zero seconds records that have null values 
# bf = s2.groupby(['time_id','stock_id'])
# bf['price'].first()
# s2.loc[s2.index.get_level_values(2)==0,'price'] = bf['price'].first()

In [13]:
# s3.fillna(method='ffill',inplace=True)

In [14]:
# s3.isna().sum()

In [15]:
# s3['log_return'] = s3.groupby(['time_id','stock_id'])['order_wap'].apply(log_return)

In [16]:
# s3.loc(axis=0)[:,:,0].isna().sum()

In [9]:

def order_volatility(series_log_return):
    return np.sqrt(np.sum(series_log_return**2))

In [18]:
# df_realized_vol_per_stock =  pd.DataFrame(s3[~s3['log_return'].isnull()]['log_return'].groupby(['time_id','stock_id']).agg(order_volatility))

In [19]:
# df_realized_vol_per_stock.isna().sum()

In [10]:
df_trade_book['scc'] = (df_trade_book['size']/df_trade_book['order_count']).astype('float16')

In [11]:
df_trade_book['price'].corr(df_trade_book['scc'])

0.00265647386908627

In [22]:
df_trade_book.info(),df_order_book.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38382741 entries, 0 to 38382740
Data columns (total 7 columns):
 #   Column             Dtype  
---  ------             -----  
 0   time_id            int16  
 1   seconds_in_bucket  int16  
 2   price              float32
 3   size               int32  
 4   order_count        int16  
 5   stock_id           int8   
 6   scc                float16
dtypes: float16(1), float32(1), int16(3), int32(1), int8(1)
memory usage: 622.3 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167253289 entries, 0 to 167253288
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   time_id            int16  
 1   seconds_in_bucket  int16  
 2   bid_price1         float32
 3   ask_price1         float32
 4   bid_price2         float32
 5   ask_price2         float32
 6   bid_size1          int32  
 7   ask_size1          int32  
 8   bid_size2          int32  
 9   ask_size2          int32  
 10  stock_id 

(None, None)

In [12]:
df_order_trade_merged = pd.merge(
    df_order_book[['time_id','stock_id','seconds_in_bucket','order_wap']],
    df_trade_book[['time_id','stock_id','seconds_in_bucket','price','scc']],
    how="outer",
    on=['time_id','stock_id','seconds_in_bucket'],
    sort=True,
    copy=False,
    indicator=False,
    validate='m:m'
    )

In [13]:
grouped = df_order_trade_merged.groupby(['time_id','stock_id'])['seconds_in_bucket','order_wap','price','scc']

  grouped = df_order_trade_merged.groupby(['time_id','stock_id'])['seconds_in_bucket','order_wap','price','scc']


In [14]:
s1 = grouped.apply(lambda x : x.set_index('seconds_in_bucket').reindex(new_index))

In [15]:
s1.shape

(257359200, 3)

In [25]:
# #grouped_trade = df_trade_book.groupby(['time_id','stock_id'])['seconds_in_bucket','price','scc']
# s4 = df_trade_book.groupby(['time_id','stock_id'])['seconds_in_bucket','price','scc'].apply(lambda x : x.set_index('seconds_in_bucket').reindex(new_index))

In [31]:
s1.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 257359200 entries, (5, 0, 0) to (32767, 126, 599)
Data columns (total 3 columns):
 #   Column     Dtype  
---  ------     -----  
 0   order_wap  float32
 1   price      float32
 2   scc        float16
dtypes: float16(1), float32(2)
memory usage: 3.6 GB


In [27]:
s1.isna().sum()

order_wap     90105911
price        218976459
scc          218976459
dtype: int64

In [28]:
s1.loc(axis=0)[:,:,0].isna().sum()

order_wap         0
price        334129
scc          334129
dtype: int64

In [16]:
bf = s1.groupby(['time_id','stock_id'])

In [30]:
bf.first().isna().sum()

order_wap     0
price        19
scc          19
dtype: int64

In [17]:
s1.loc[s1.index.get_level_values(2)==0,['order_wap','price','scc']] = bf.first()

In [36]:
s1.loc(axis=0)[:,:,0].isna().sum()

order_wap     0
price        19
scc          19
dtype: int64

In [18]:
s1.loc(axis=0)[:,:,0]['price'].isna().sum()

19

In [37]:
s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()]['price']

time_id  stock_id  seconds_in_bucket
62       37        0                   NaN
985      31        0                   NaN
3987     31        0                   NaN
5539     31        0                   NaN
5629     31        0                   NaN
6197     31        0                   NaN
8524     18        0                   NaN
8753     31        0                   NaN
8840     31        0                   NaN
9208     31        0                   NaN
9664     103       0                   NaN
12011    31        0                   NaN
13377    31        0                   NaN
13663    31        0                   NaN
15010    31        0                   NaN
20017    31        0                   NaN
22498    31        0                   NaN
28186    31        0                   NaN
32174    31        0                   NaN
Name: price, dtype: float32

In [32]:
s1.loc(axis=0)[62,37,:]['price'].isna().sum()

600

In [19]:
s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()]['price'].index.unique(level=1).values

array([ 37,  31,  18, 103], dtype=int64)

In [85]:
s1.loc(axis=0)[:,31,0]['price'].mean()

0.9998831152915955

In [91]:
# s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()]['price'].transform(lambda x: s1.loc(axis=0)[:,x.index.get_level_values(1),0]['price'].mean())

In [101]:
# s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()][['price','scc']].apply(lambda x: s1.loc(axis=0)[:,x.index.get_level_values(1),0]['price'].mean())

In [281]:
s1.loc(axis=0)[:,:,0][['price','scc']][s1.loc(axis=0)[:,:,0]['price'].isna()] = b.values

In [39]:
s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()][['price','scc']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,price,scc
time_id,stock_id,seconds_in_bucket,Unnamed: 3_level_1,Unnamed: 4_level_1
62,37,0,,
985,31,0,,
3987,31,0,,
5539,31,0,,
5629,31,0,,
6197,31,0,,
8524,18,0,,
8753,31,0,,
8840,31,0,,
9208,31,0,,


In [20]:
a = s1.loc(axis=0)[:,:,0][s1.loc(axis=0)[:,:,0]['price'].isna()][['price','scc']]

In [21]:
def price_scc(col):
    #return  s1.loc(axis=0)[:,col.name[1],0][col.index[0]].mean(),s1.loc(axis=0)[:,col.name[1],:][col.index[1]].mean()
    #return s1.loc(axis=0)[:,col.name[1],0].groupby(level=[1,2])['price'].mean(), s1.loc(axis=0)[:,col.name[1],0].groupby(level=[1,2])['scc'].mean()
    return s1.loc(axis=0)[:,col.name[1],0]['price'].mean(), np.asscalar(s1.loc(axis=0)[:,col.name[1],0]['scc'].mean(skipna=True,level=[1,2]).values)
    #return col.name[1] , col.index[0]
    #row.scc = s1.loc(axis=0)[:,row.index.get_level_values(1),0]['scc'].mean()

In [22]:
b = a.transform(price_scc ,axis=1)

  return s1.loc(axis=0)[:,col.name[1],0]['price'].mean(), np.asscalar(s1.loc(axis=0)[:,col.name[1],0]['scc'].mean(skipna=True,level=[1,2]).values)
  return s1.loc(axis=0)[:,col.name[1],0]['price'].mean(), np.asscalar(s1.loc(axis=0)[:,col.name[1],0]['scc'].mean(skipna=True,level=[1,2]).values)


In [23]:
b

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,price,scc
time_id,stock_id,seconds_in_bucket,Unnamed: 3_level_1,Unnamed: 4_level_1
62,37,0,0.999967,12.804688
985,31,0,0.999883,581.5
3987,31,0,0.999883,581.5
5539,31,0,0.999883,581.5
5629,31,0,0.999883,581.5
6197,31,0,0.999883,581.5
8524,18,0,0.999964,18.078125
8753,31,0,0.999883,581.5
8840,31,0,0.999883,581.5
9208,31,0,0.999883,581.5


In [24]:
s1.fillna(b,inplace=True)

In [313]:
s1.isna().sum()

order_wap     90105911
price        218642349
scc          218642349
dtype: int64

In [314]:
s1.loc(axis=0)[:,:,0].isna().sum()

order_wap     0
price        19
scc          19
dtype: int64

In [20]:
df_order_book['stock_id'] = df_order_book['stock_id'].astype('int8')
df_order_book[['time_id','seconds_in_bucket']] = df_order_book[['time_id','seconds_in_bucket']].astype('int16')
df_order_book[['bid_size1','ask_size1','bid_size2','ask_size2']] = df_order_book[['bid_size1','ask_size1','bid_size2','ask_size2']].astype('int32')

In [22]:
#grouped = df_order_trade_merged.groupby(['time_id','stock_id'])
grouped = df_order_book.groupby(['time_id','stock_id'])

#df_order_book_50_idxed.index.levels[2]

In [24]:
import swifter

In [26]:
from distributed import Client
# create local cluster
client = Client(n_workers=8)


+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| pandas  | 1.3.1  | 1.3.1     | 1.3.0   |
+---------+--------+-----------+---------+


In [27]:
import dask.dataframe as dd

In [28]:
dd.from_pandas(grouped, npartitions=8).apply(lambda x : x.set_index('seconds_in_bucket').reindex(new_index))

TypeError: Input must be a pandas DataFrame or Series

In [30]:
# The Dask engine is currently considered experimental, so I use ray
%env MODIN_ENGINE=ray
import modin.pandas as mpd


env: MODIN_ENGINE=ray


In [39]:
df_trade_book.values

array([[5.00000000e+00, 2.10000000e+01, 1.00230134e+00, 3.26000000e+02,
        1.20000000e+01, 0.00000000e+00],
       [5.00000000e+00, 4.60000000e+01, 1.00277805e+00, 1.28000000e+02,
        4.00000000e+00, 0.00000000e+00],
       [5.00000000e+00, 5.00000000e+01, 1.00281847e+00, 5.50000000e+01,
        1.00000000e+00, 0.00000000e+00],
       ...,
       [3.27670000e+04, 5.79000000e+02, 9.99797046e-01, 3.00000000e+02,
        3.00000000e+00, 9.90000000e+01],
       [3.27670000e+04, 5.84000000e+02, 9.99853671e-01, 1.72000000e+02,
        2.00000000e+00, 9.90000000e+01],
       [3.27670000e+04, 5.92000000e+02, 9.99661803e-01, 4.00000000e+02,
        5.00000000e+00, 9.90000000e+01]])

In [35]:
gmpd = mpd.dataframe(df_order_book.values)
grouped = gmpd.groupby(['time_id','stock_id'])

MemoryError: Unable to allocate 13.7 GiB for an array with shape (11, 167253289) and data type float64

In [25]:
s2=grouped.swifter.apply(lambda x : x.set_index('seconds_in_bucket').reindex(new_index))


AttributeError: 'DataFrameGroupBy' object has no attribute 'swifter'

In [None]:
s2.drop(columns=['time_id','stock_id'],inplace=True)

In [63]:
s2[['size','order_count']] = s2[['size','order_count']].fillna(0) 


In [69]:
s2.loc(axis=0)[:,:,0].isna().sum()


bid_price1         0
ask_price1         0
bid_price2         0
ask_price2         0
bid_size1          0
ask_size1          0
bid_size2          0
ask_size2          0
price          14413
size               0
order_count        0
dtype: int64

In [70]:
s2.loc(axis=0)[:,:,0][s2.loc(axis=0)[:,:,0]['price'].isna()]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,bid_price1,ask_price1,bid_price2,ask_price2,bid_size1,ask_size1,bid_size2,ask_size2,price,size,order_count
time_id,stock_id,seconds_in_bucket,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5,68,0,1.000354,1.000472,1.000236,1.000591,101.0,40.0,201.0,321.0,,0.0,0.0
5,100,0,0.999077,1.000103,0.999009,1.000239,100.0,7.0,64.0,100.0,,0.0,0.0
5,111,0,1.001884,1.002048,1.001720,1.002211,500.0,509.0,710.0,2091.0,,0.0,0.0
11,68,0,0.999941,1.000059,0.999824,1.000176,100.0,200.0,200.0,300.0,,0.0,0.0
11,81,0,0.996471,0.996785,0.996419,0.997569,286.0,201.0,20.0,100.0,,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32763,68,0,0.998702,0.998975,0.998565,0.999112,123.0,201.0,396.0,100.0,,0.0,0.0
32763,100,0,1.004087,1.004697,1.003721,1.004941,100.0,9.0,200.0,200.0,,0.0,0.0
32767,2,0,0.999298,0.999454,0.999220,0.999532,408.0,300.0,600.0,500.0,,0.0,0.0
32767,68,0,1.000636,1.000764,1.000509,1.000891,300.0,211.0,400.0,8.0,,0.0,0.0


In [71]:
bf = s2.groupby(['time_id','stock_id'])

In [72]:
    bf['price'].first()

time_id  stock_id
5        2           1.000688
         68          1.000585
         81          1.003518
         100         1.000103
         111         1.002161
                       ...   
32767    2           0.999298
         68          1.000764
         81          0.999430
         100         0.999315
         111         1.000101
Name: price, Length: 19149, dtype: float32

In [73]:
#s2.loc[(:,:,0),'price'] = bf['price'].first()
s2.loc[s2.index.get_level_values(2)==0,'price'] = bf['price'].first()

In [74]:
s2.isna().sum()

bid_price1     3029793
ask_price1     3029793
bid_price2     3029793
ask_price2     3029793
bid_size1      3029793
ask_size1      3029793
bid_size2      3029793
ask_size2      3029793
price          9379063
size                 0
order_count          0
dtype: int64

In [76]:
#s2.loc[:,:,1:] = s2.loc[:,:,1:].fillna(method='ffill')
#s2.loc[:,:,1:].fillna(method='ffill',inplace=True)
s2.fillna(method='ffill',inplace=True)

In [77]:
s2.isna().sum()

bid_price1     0
ask_price1     0
bid_price2     0
ask_price2     0
bid_size1      0
ask_size1      0
bid_size2      0
ask_size2      0
price          0
size           0
order_count    0
dtype: int64

Looping through each individual stocks, we can get the past realized volatility as prediction for each individual stocks.

In [78]:
s2['order_count'] = s2['order_count'].astype('int16')
s2[['bid_size1','ask_size1','bid_size2','ask_size2','size']] = s2[['bid_size1','ask_size1','bid_size2','ask_size2','size']].astype('int32')


In [80]:
# s2['bid_size'] = (s2['bid_size1'] + s2['bid_size2'])
# s2['bid_price'] = (s2['bid_price1'] * s2['bid_size1'] + s2['bid_price2'] * s2['bid_size2'])/(s2['bid_size1'] + s2['bid_size2'])

# s2['ask_size'] = (s2['ask_size1'] + s2['ask_size2'])
# s2['ask_price'] = (s2['ask_price1'] * s2['ask_size1'] + s2['ask_price2'] * s2['ask_size2'])/(s2['ask_size1'] + s2['ask_size2'])

#s2['wap'] = (((s2['bid_price1'] * s2['bid_size1'] + s2['bid_price2'] * s2['bid_size2'])/(s2['bid_size1'] + s2['bid_size2']) * (s2['ask_size1'] + s2['ask_size2'])) + ((s2['ask_price1'] * s2['ask_size1'] + s2['ask_price2'] * s2['ask_size2'])/(s2['ask_size1'] + s2['ask_size2'])) * (s2['bid_size1'] + s2['bid_size2'])) / ((s2['bid_size1'] + s2['bid_size2']) +  (s2['ask_size1'] + s2['ask_size2']))

s2['order_wap'] = (((s2['bid_price1'] * s2['bid_size1'] + s2['bid_price2'] * s2['bid_size2'])/(s2['bid_size1'] + s2['bid_size2']) * (s2['ask_size1'] + s2['ask_size2'])) + ((s2['ask_price1'] * s2['ask_size1'] + s2['ask_price2'] * s2['ask_size2'])/(s2['ask_size1'] + s2['ask_size2'])) * (s2['bid_size1'] + s2['bid_size2'])) / ((s2['bid_size1'] + s2['bid_size2']) +  (s2['ask_size1'] + s2['ask_size2']))


In [81]:
s2['trade_wap'] = s2.groupby(['time_id','stock_id'])['price'].apply(lambda x: pow(x - x.mean(),2))
#s2['trade_wap_sq'] = s2.groupby(['time_id','stock_id'])['price'].apply(lambda x: x - x.mean())

In [82]:
pd.options.display.float_format = '{:,.16f}'.format

In [51]:
def trade_volatility(trade_wap_prices):
    return np.sqrt(np.sum(trade_wap_prices)/len(trade_wap_prices))


In [85]:
s2['log_return'] = s2.groupby(['time_id','stock_id'])['order_wap'].apply(log_return)

In [20]:
s2['log_return'].isna().sum()

NameError: name 's2' is not defined

In [87]:
#remove the rows with null values of log return
#s2[~s2['log_return'].isnull()]['log_return']
df_realized_vol_per_stock =  pd.DataFrame(s2[~s2['log_return'].isnull()]['log_return'].groupby(['time_id','stock_id']).agg(order_volatility))

In [88]:
df_realized_vol_per_stock.isna().sum()

log_return    0
dtype: int64

In [89]:
df_realized_vol_per_stock

Unnamed: 0_level_0,Unnamed: 1_level_0,log_return
time_id,stock_id,Unnamed: 2_level_1
5,2,0.0018733931133985
5,68,0.0023004176145590
5,81,0.0054617860000476
5,100,0.0046436459937157
5,111,0.0029989438960921
...,...,...
32767,2,0.0010246982159702
32767,68,0.0013834203207654
32767,81,0.0026151711543734
32767,100,0.0019775423480388


In [19]:
df_realized_vol_per_stock.index.get_level_values(0)

NameError: name 'df_realized_vol_per_stock' is not defined

In [91]:
#df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock.apply(lambda x: fff(x.index.get_level_values(1),x.index.get_level_values(0)), axis=1 )
#df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock.apply(lambda x: x.index.get_level_values(1)+x.index.get_level_values(0))
df_realized_vol_per_stock['row_id'] = df_realized_vol_per_stock.apply(lambda x: x.index.get_level_values(1).astype(str) + "-" + x.index.get_level_values(0).astype(str))

#f'{stock_id}-{x}'

In [92]:
df_realized_vol_per_stock['trade_return'] = pd.DataFrame(s2['trade_wap'].groupby(['time_id','stock_id']).agg(trade_volatility))


In [93]:
df_realized_vol_per_stock

Unnamed: 0_level_0,Unnamed: 1_level_0,log_return,row_id,trade_return
time_id,stock_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
5,2,0.0018733931133985,2-5,0.0005385612603277
5,68,0.0023004176145590,68-5,0.0005042065749876
5,81,0.0054617860000476,81-5,0.0012333415215835
5,100,0.0046436459937157,100-5,0.0010931357974187
5,111,0.0029989438960921,111-5,0.0005883899284527
...,...,...,...,...
32767,2,0.0010246982159702,2-32767,0.0003467606729828
32767,68,0.0013834203207654,68-32767,0.0002697620948311
32767,81,0.0026151711543734,81-32767,0.0005826270207763
32767,100,0.0019775423480388,100-32767,0.0007258245022967


In [94]:
# def past_realized_volatility_per_stock(list_file,prediction_column_name):
#     df_past_realized = pd.DataFrame()
#     for file in list_file:
#         df_past_realized = pd.concat([df_past_realized,
#                                      realized_volatility_per_time_id(file,prediction_column_name)])
#     return df_past_realized
# df_past_realized_train = past_realized_volatility_per_stock(list_file=list_order_book_file_train,
#                                                            prediction_column_name='pred')

Let's join the output dataframe with train.csv to see the performance of the naive prediction on training set.

In [95]:
train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
train = train[['row_id','target']]


We will evaluate the naive prediction result by two metrics: RMSPE and R squared.

In [96]:
df_joined = train.merge(df_realized_vol_per_stock, on = ['row_id'], how = 'inner')

In [16]:
df_joined.info()

NameError: name 'df_joined' is not defined

In [98]:
from sklearn.metrics import r2_score
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))) * 100

In [None]:
R2 = round(r2_score(y_true = df_joined['target'], y_pred = df_joined['log_return']),3)
RMSPE = round(rmspe(y_true = df_joined['target'], y_pred = df_joined['log_return']),3)
print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}%')

In [None]:
R2 = round(r2_score(y_true = df_joined['target'], y_pred = df_joined['log_return'] + np.sqrt(df_joined['trade_return'])),3)
RMSPE = round(rmspe(y_true = df_joined['target'], y_pred = df_joined['log_return'] + np.sqrt(df_joined['trade_return'])),3)
print(f'Performance of the naive prediction: R2 score: {R2}, RMSPE: {RMSPE}%')

In [None]:
jovian.commit(filename="my-zero-to-gbm-proj-assign.ipynb")

## Machine Learning

Now we will start applying ML techniques to predict the volataility of the next 10 minutes window for each time-id/stock-id based on the order book volatility and trade volatility

we will learn the hyper parameters givne using the training targets for the same

we will use 2 different models to do the same

## Graident Bossting

We're now ready to train our gradient boosting machine (GBM) model. Here's how a GBM model works:

In [None]:
from xgboost import XGBRegressor

## Intel Extension for Scikit-learn

Intel(R) Extension for Scikit-learn* dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver, while getting the same solution faster.

To install these Intel-optimized packages for scikit-learn on Windows, Mac, and Linux x86_64, simply:

conda install scikit-learn-intelex

Once installed, there are two ways in which you can enable the replacement patching functionality for scikit-learn. You can enable it when you run your application:

python -m sklearnex my_application.py

Or you can explicitly enable the patching in your code:

from sklearnex import patch_sklearn

patch_sklearn()

In [99]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [None]:
Lazy

### Cross Validation

 create a validation set before training our XGBoost model. We'll use a different validation strategy this time, called <b> ShuffleSplit </b> cross validation (source):

In [100]:
from sklearn.model_selection import ShuffleSplit

In [101]:
ss = ShuffleSplit(n_splits = 5, test_size = 0.25, random_state=42)

Let's define a helper function train_and_evaluate which trains a model the given parameters and returns the trained model, training error and validation error.

In [None]:
def train_and_evaluate(X_train, train_targets, X_val, val_targets, **params):
    model = XGBRegressor(random_state=42, n_jobs=-1, **params)
    model.fit(X_train, train_targets)
    train_R2 = round(r2_score(train_targets, model.predict(X_train)),3)
    train_RMSPE = round(rmspe(train_targets, model.predict(X_train)),3)
    val_R2 = round(r2_score(val_targets, model.predict(X_val)),3)
    val_RMSPE = round(rmspe(val_targets, model.predict(X_val),),3)

    # train_rmse = rmse(model.predict(X_train), train_targets)
    # val_rmse = rmse(model.predict(X_val), val_targets)
    return model, train_R2, train_RMSPE, val_R2, val_RMSPE
#n_jobs = -1 means that use all the available threads in that machine where the alogorithm is running 

Now we will train the model for each split data of the ShuffleSplit

In [None]:
df_joined.info()

In [102]:
inputs = df_joined[['log_return', 'trade_return']].copy()
targets = df_joined['target'].copy()

In [103]:
inputs
#targets

Unnamed: 0,log_return,trade_return
0,0.0018733931133985,0.0005385612603277
1,0.0008151048469293,0.0003309202729724
2,0.0012939132651704,0.0002343668165850
3,0.0015225304382781,0.0005046058795415
4,0.0012425119932203,0.0002491848717909
...,...,...
19144,0.0024886340803033,0.0005013425834477
19145,0.0009824073836777,0.0002220391470473
19146,0.0019381003855530,0.0010461938800290
19147,0.0025020021033653,0.0006868820055388


In [107]:
from lazypredict.Supervised import LazyRegressor




In [104]:
models = []

for train_idxs, val_idxs in ss.split(inputs):
    X_train, train_targets = inputs.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = inputs.iloc[val_idxs], targets.iloc[val_idxs]
    model, train_R2, train_RMSPE, val_R2, val_RMSPE = train_and_evaluate(X_train, 
                                                     train_targets, 
                                                     X_val, 
                                                     val_targets, 
                                                     max_depth=5, 
                                                     n_estimators=50)
    models.append(model)
    print('Train R2: {}, Train RMSPE: {}, Validation R2: {}, Validation RMSPE: {}'.format(train_R2, train_RMSPE, val_R2, val_RMSPE))

NameError: name 'train_and_evaluate' is not defined

Let's also define a function to average predictions from the 5 different models.

In [None]:
def predict_avg(models, inputs):
    return np.mean([model.predict(inputs) for model in models], axis=0)

In [None]:
preds = predict_avg(models, inputs)

In [None]:
preds

In [106]:
# explicitly require this experimental feature
from sklearn.experimental import enable_halving_search_cv
# now you can import normally from model_selection
from sklearn.model_selection import HalvingGridSearchCV

ImportError: cannot import name 'enable_halving_search_cv' from 'sklearn.experimental' (C:\ProgramData\Anaconda3\lib\site-packages\sklearn\experimental\__init__.py)

In [None]:
model = XGBRegressor(random_state=42, n_jobs=-1)
param_grid = {"max_depth": [[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, None]],
              "n_estimators" : list(range(10,51,2)),
              "learning_rate": np.logspace(-3,3,100),
              # "booster": ['gbtree', 'gblinear','dart'],
              # "gamma": np.logspace(-1,1,100),
              "subsample": np.linspace(0.1,1.0,9, endpoint=False),
              # "num_parallel_tree": range(1,6,1),
              # "reg_alpha": np.logspace(-3,3,300),
              # "reg_lambda": np.logspace(-3,3,300)
            }


In [None]:
models = []
for train_idxs, val_idxs in ss.split(inputs):
    X_train, train_targets = inputs.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = inputs.iloc[val_idxs], targets.iloc[val_idxs]
    grid_search = HalvingGridSearchCV(model, param_grid, random_state=42).fit(X_train, train_targets)
    model, train_R2, train_RMSPE, val_R2, val_RMSPE = train_and_evaluate(X_train, 
                                                    train_targets, 
                                                    X_val, 
                                                    val_targets, 
                                                    **grid_search.best_params_)
    models.append(model)
    print('Train R2: {}, Train RMSPE: {}, Validation R2: {}, Validation RMSPE: {}'.format(train_R2, train_RMSPE, val_R2, val_RMSPE))


In [None]:
models = []
for train_idxs, val_idxs in ss.split(inputs):
    X_train, train_targets = inputs.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = inputs.iloc[val_idxs], targets.iloc[val_idxs]
    X_train = np.ascontiguousarray(X_train).reshape(-1,2)
    train_targets = np.ascontiguousarray(train_targets).reshape(-1,1)
    X_val = np.ascontiguousarray(X_val).reshape(-1,2)
    val_targets = np.ascontiguousarray(val_targets).reshape(-1,1)
    reg = LazyRegressor(ignore_warnings=False, random_state=11111, verbose=True)
    models, predictions = reg.fit(X_train, X_val, train_targets, val_targets)  # pass all sets
    print(models.head(100))


In [None]:
# Fit LazyRegressor




In [None]:
grid_search.best_params_

Discussion on cross validation designs

https://www.kaggle.com/vishnurapps/undersanding-kfold-stratifiedkfold-and-groupkfold