# Build daily trading volume profiles

This notebook provides a template for using DataRobot time series models to build a daily trading volume profile. Each ticker in an exchange will have a different profile. In order to do so, you divide the trading day into intervals (starting at one minute, you can work with, for example, 5, 10 or 30 minutes intervals).
For each of these intervals, and for each tickers, you want to know what fraction of the daily volume will be traded during that intervals.

Each of these intervals will have an Time Series models that predicts the fraction of volume that will be traded the next working day. While looking at historical data for the corresponding window, you will also leverage neighboring windows. That is, if you are predicting what will happen tomorrow between 9:35am and 9:36am, look at what has happened during that minute in the past, but also what happened between 9:30am and 9:35am and what happened between 9:36am and 9:41 (since you are looking at past days, you can look ahead to the following minutes).

This notebook (in conjunction with the helper file `helper.py`) provides the code necessary to prepare the data and run the modeling pipeline for each time window. 

Furthermore, to keep control of all the DataRobot projects required to build the volume profile, it creates a pandas dataframe that keeps track of the modeling choices, projects, and models that are used along the way.

## Setup

### Import libraries

In [1]:
import time
import datetime
from datetime import timedelta
from os.path import join
import datarobot as dr
import numpy as np
import pandas as pd

from importlib import reload
import helper
reload(helper)

<module 'helper' from '/Users/daniel.straulino/Documents/Coding/VolumePrediction/helper.py'>

### Connect to DataRobot

In [2]:
# Instantiate the DataRobot connection

DATAROBOT_API_TOKEN = "" # Get this from the Developer Tools page in the DataRobot UI
# Endpoint - This notebook uses the default endpoint for DataRobot Managed AI Cloud (US)
DATAROBOT_ENDPOINT = "https://app.datarobot.com/api/v2" # This should be the URL you use to access the DataRobot UI

client = dr.Client(
    token=DATAROBOT_API_TOKEN, 
    endpoint=DATAROBOT_ENDPOINT,
    user_agent_suffix='' # Optional but helps DataRobot improve this workflow
)

dr.client._global_client = client

### Load and inspect data

Examine the data you will be using. In this example, you have around six weeks of trading volume data for six different symbols.

In [3]:
data = pd.read_csv('https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/trading_activity/stock_data.csv')

In [4]:
data.head()

Unnamed: 0,date,date_time,Symbol,minute,TradeVolume,TradeStopStockIndicator,TradeCorrectionIndicator,TradeThroughExemptIndicator,TradeId,SaleCondition,...,TradeId_med_minute,NumTrades_med_minute,TradeVolume_dev_minute,TradeId_dev_minute,NumTrades_dev_minute,Sector,Security_Type,Cap,Style,Exchange
0,2018-04-16,2018-04-16 09:30:00.000000,AAPL,09:30,1260427,0,0.0,0.490443,1181,17,...,819.5,1452.5,834052.9,636.93066,1079.936,Tech,Common Stock,Large,Blend,Nasdaq
1,2018-04-16,2018-04-16 09:31:00.000000,AAPL,09:31,143181,0,0.0,0.328965,670,4,...,721.5,965.0,157709.42,685.5965,918.561,Tech,Common Stock,Large,Blend,Nasdaq
2,2018-04-16,2018-04-16 09:32:00.000000,AAPL,09:32,88857,0,0.0,0.396825,530,4,...,838.0,946.5,101739.195,539.57434,605.1958,Tech,Common Stock,Large,Blend,Nasdaq
3,2018-04-16,2018-04-16 09:33:00.000000,AAPL,09:33,122435,0,0.0,0.263907,773,6,...,746.5,798.0,106579.58,561.4731,681.22577,Tech,Common Stock,Large,Blend,Nasdaq
4,2018-04-16,2018-04-16 09:34:00.000000,AAPL,09:34,93103,0,0.0,0.347287,645,5,...,748.0,843.5,118851.01,650.0484,755.0338,Tech,Common Stock,Large,Blend,Nasdaq


In [5]:
data = data[['date', 'date_time', 'Symbol', 'minute', 'TradeVolume','TradePrice',
       'NumTrades', 'Sector', 'Security_Type','Cap', 'Style', 'Exchange']].copy()

In [6]:
data.head(3)

Unnamed: 0,date,date_time,Symbol,minute,TradeVolume,TradePrice,NumTrades,Sector,Security_Type,Cap,Style,Exchange
0,2018-04-16,2018-04-16 09:30:00.000000,AAPL,09:30,1260427,175.0,2145,Tech,Common Stock,Large,Blend,Nasdaq
1,2018-04-16,2018-04-16 09:31:00.000000,AAPL,09:31,143181,175.31,763,Tech,Common Stock,Large,Blend,Nasdaq
2,2018-04-16,2018-04-16 09:32:00.000000,AAPL,09:32,88857,175.3,567,Tech,Common Stock,Large,Blend,Nasdaq


### Modeling configuration

As mentioned in the introduction, the volume profile is built on a time window basis. The first step is therefore to choose the granularity of the project.

Similarly, to model each time window you will also be looking at the neighbouring time windows. Therefore, you need to chose how wide the radius of observation should be.

Finally, you will focus on predicting volume % in this notebook: the fraction of the total daily volume that will trade during the interval.

In [7]:
# Number of minutes in each window
window_length = 1
# How many windows to look ahead and behind
neighbours_radius = 1
# Use percentage of volume
percentage = True

modelling_choice = {'window_length': window_length,
                 'window_radius': neighbours_radius,
                 'percentage': percentage}

Chose which aggregations you want to do (in case you aggregate into windows of more than 1 minute). If there is no aggregation, there is no need for this dictionary.

You only aggregate numeric features. All the categorical features remain unchanged as they are constant.

In [8]:
aggregation_dictionary = {'date':'first',
                          'minute':'min',
                          'TradeVolume':['sum', 'min', 'max','std'],
                          'TradePrice':['mean', 'min', 'max','std'],
                          'NumTrades':['sum', 'min', 'max','std'],
                          }

### Time series settings: daytime features

Time series problems require more input that typical machine learning problems. You need to decide how many days into the future you want to forecast, how far into the past should you go when deriving additional features, and more. You also need to determine which features are known in advance and what calendar should be used for modeling. The daytime features dictionary below reflects all the choices you need to make before running a time series project.

The default settings provided here are a good starting point, but you might decide to experiment with other setups.

It's always useful to add a calendar. This example uses a US-specific calendar generated by DataRobot, but ideally you will build a calendar for each exchange.

A multiseries calendar should have a column with dates, a second column with the name of the events, and a third (which needs to have the same name as the Series ID in the project, in this case `Symbol_`) that specifies to which series it corresponds. If the last column is empty, the event applies to all series. If an event affects several but not all series, you need to add a row per series, as shown in rows 4 and 5.

In [9]:
# Generate a calendar from the country code
# Uncomment the line below if you need to create a new one
# Choose country and start/end dates.

# basicCalendar = dr.CalendarFile().create_calendar_from_country_code('US', '2018-04-08 00:00:00', '2018-07-08 00:00:00')

In [10]:
basicCalendar = dr.CalendarFile.get('636b9fa1e0d3da945fc201a6')

In [11]:
pd.read_csv('data/calendar_example.csv').fillna('').head(7)

Unnamed: 0,Date,Name,Symbol_
0,2023-01-06,Non-farms payroll,
1,2023-01-09,Earnings report,DIS
2,2023-01-17,Mathin Luther King Jr day,
3,2023-02-15,Ex-dividend date,WMT
4,2023-02-02,Tech specific announcement,APPL
5,2023-02-02,Tech specific announcement,AMZN
6,…,…,…


Every time series project requires you to configure a collection of settings:

* Target: The name of the variable to predict (in this case % volume)
* Metric: Against which metric should the model be optimized? 
    * Regression: RMSE, MAE, MASE. 
    * Classification: LogLoss, AUC.
* Series ID: The column that corresponds to the series name (Symbol_)
* Forecasting Window (FW): Determine how many steps into the feature to forecast
* Feature Derivation Window (FDW): Determine how many steps into the past should you use to derive features (seasonality and events are handled separately)
* Known in advance features (KA): Determine which columns are characteristics that you know ahead of time. In this case you would have columns such as Sector, Exchange, Security Type, etc.

Review the dictionary below that configures these settings. Target, SeriesID, and KA will depend on the dataset columns. The FDW and FW are good starting points, but you can iterate over other options as part of the experimentation process. In this case, a one day FW makese sense since you are usually interested in the next day's profile.

In [12]:
# Specify the datetime project
date_time_part = 'date_time_'
target = 'TradeVolume_sum'
metric = 'MASE'
# Series ID
multiseries_ids = 'Symbol_'
# Features known in advance
kia_columns = ['Sector_', 'Security_Type_', 'Cap_', 'Style_', 'Exchange_']
# These settings are a good starting point and can be modified as part of the experimentation process

# Use a 14-day FDW
fdw_start = -14
fdw_end = 0
# For the FW you are only looking at the next day for now
fw_start = 1
fw_end = 1

# Use -1 to make use of all the available computing resources
number_of_workers = -1
# Backtests are similar to the folds in crossvalidation
# The number will also be limited by the amount of data you have
number_of_backtests = 2

In [13]:
datetime_dict ={'partitioning':date_time_part,
                'calendar':basicCalendar,
                'target':target,
                'metric':metric,
                'seriesID':multiseries_ids,
                'KIA': kia_columns,
                'fdw_start':fdw_start,
                'fdw_end': fdw_end,
                'fw_start':fw_start,
                'fw_end':fw_end,
                'workers':number_of_workers,
                'backtests':number_of_backtests
                }

## Prepare data

You can use the helper function `prepare_data` to get data ready for modeling. This function aggregates the data (if necessary) and add columns corresponding to the neighboring time intervals according to the modeling settings. As a reminder, you will build one project per minute (or per intervals if you use a different level of granularity), and therefore you need to have constructed a dataset for each project. The modeling settings will dictate the level of granularity as well as the breadth of features you want to add to each time slice (how many minutes before and after should you use to derive features). The aggregation dictionary defines which operations you will use to create these new features.

In [14]:
prepared_data = helper.prepare_data(data, modelling_choice, aggregation_dictionary = aggregation_dictionary)

The next cell should be deleted. You are running it to save time by only looking at the last half an hour.

In [15]:
# This is just to reduce the scope to the last half an hour
# Can be skipped 
time_cut = timedelta(hours = 15, minutes = 30)
prepared_data = prepared_data[prepared_data.minute_min>time_cut]

In [16]:
prepared_data[['date_time_', 'Symbol_', 'Sector_', 'Security_Type_', 'Cap_', 'Style_',
       'Exchange_', 'date_first', 'minute_min', 'TradeVolume_sum',
       'NumTrades_sum', 'NumTrades_max', 'TradePrice_mean','NumTrades_sum_bwd_1', 'NumTrades_sum_fwd_1']].head()

Unnamed: 0,date_time_,Symbol_,Sector_,Security_Type_,Cap_,Style_,Exchange_,date_first,minute_min,TradeVolume_sum,NumTrades_sum,NumTrades_max,TradePrice_mean,NumTrades_sum_bwd_1,NumTrades_sum_fwd_1
361,2018-04-16 15:31:00,AAPL,Tech,Common Stock,Large,Blend,Nasdaq,2018-04-16,0 days 15:31:00,0.003934,415,415,175.7,331.0,290.0
362,2018-04-16 15:32:00,AAPL,Tech,Common Stock,Large,Blend,Nasdaq,2018-04-16,0 days 15:32:00,0.001732,290,290,175.73,415.0,415.0
363,2018-04-16 15:33:00,AAPL,Tech,Common Stock,Large,Blend,Nasdaq,2018-04-16,0 days 15:33:00,0.002628,415,415,175.75,290.0,508.0
364,2018-04-16 15:34:00,AAPL,Tech,Common Stock,Large,Blend,Nasdaq,2018-04-16,0 days 15:34:00,0.003762,508,508,175.75,415.0,267.0
365,2018-04-16 15:35:00,AAPL,Tech,Common Stock,Large,Blend,Nasdaq,2018-04-16,0 days 15:35:00,0.00197,267,267,175.79,508.0,358.0


Use the following cells to check the data after preparation and track time windows.

In [17]:
prepared_data.columns

Index(['date_time_', 'Symbol_', 'Sector_', 'Security_Type_', 'Cap_', 'Style_',
       'Exchange_', 'date_first', 'minute_min', 'TradeVolume_min',
       'TradeVolume_max', 'TradeVolume_std', 'TradePrice_mean',
       'TradePrice_min', 'TradePrice_max', 'TradePrice_std', 'NumTrades_sum',
       'NumTrades_min', 'NumTrades_max', 'NumTrades_std', 'TradeVolume_sum',
       'TradeVolume_min_fwd_1', 'TradeVolume_max_fwd_1',
       'TradeVolume_std_fwd_1', 'TradePrice_mean_fwd_1',
       'TradePrice_min_fwd_1', 'TradePrice_max_fwd_1', 'TradePrice_std_fwd_1',
       'NumTrades_sum_fwd_1', 'NumTrades_min_fwd_1', 'NumTrades_max_fwd_1',
       'NumTrades_std_fwd_1', 'TradeVolume_sum_fwd_1', 'TradeVolume_min_bwd_1',
       'TradeVolume_max_bwd_1', 'TradeVolume_std_bwd_1',
       'TradePrice_mean_bwd_1', 'TradePrice_min_bwd_1', 'TradePrice_max_bwd_1',
       'TradePrice_std_bwd_1', 'NumTrades_sum_bwd_1', 'NumTrades_min_bwd_1',
       'NumTrades_max_bwd_1', 'NumTrades_std_bwd_1', 'TradeVolume_sum_bw

In [18]:
prepared_data.Symbol_.unique()

array(['AAPL', 'AMZN', 'DIS', 'F', 'FB', 'NFLX', 'QQQ', 'SPY', 'VZ',
       'WMT'], dtype=object)

Keep track of the windows for which you will be building models by creating a dataframe that captures the hour and minute at which each window starts.

In [19]:
slices_df = pd.DataFrame(prepared_data.minute_min.unique()[:-1])

In [20]:
# Get the datetime object that identifies the start of each time slice
# Get the hour and minute that correspond to the datetime object for future use

slices_df = pd.concat([slices_df, slices_df[0].dt.components['hours'],slices_df[0].dt.components['minutes']], axis =1)

In [21]:
slices_df.head()

Unnamed: 0,0,hours,minutes
0,0 days 15:31:00,15,31
1,0 days 15:32:00,15,32
2,0 days 15:33:00,15,33
3,0 days 15:34:00,15,34
4,0 days 15:35:00,15,35


## Run the model factory

When you have finished data preparation, you can run the modeling jobs in DataRobot. The results are stored in a dataframe that allows you to quickly access the projects for each of the time windows.

In [22]:
# Run a time series project for each "slice"
# Keep all the project IDs so you can later gather all the information you need

projects_df = helper.run_all_projects(prepared_data, slices_df, modelling_choice, datetime_dict)
    

VolPred_percentage_each_1min_15:31_v_2023-05-29
Project VolPred_percentage_each_1min_15:31_v_2023-05-29 creation started: 2023-05-29 15:37:38
Project creation finished. Elapsed time: 168.05912685394287
https://app.datarobot.com/projects/6474b8b355ef8574714197f4/eda
VolPred_percentage_each_1min_15:32_v_2023-05-29
Project VolPred_percentage_each_1min_15:32_v_2023-05-29 creation started: 2023-05-29 15:40:26
Project creation finished. Elapsed time: 184.54082369804382
https://app.datarobot.com/projects/6474b95bb2f47210965f1243/eda
VolPred_percentage_each_1min_15:33_v_2023-05-29
Project VolPred_percentage_each_1min_15:33_v_2023-05-29 creation started: 2023-05-29 15:43:31
Project creation finished. Elapsed time: 200.2231252193451
https://app.datarobot.com/projects/6474ba13b375dad5e35f11a8/eda
VolPred_percentage_each_1min_15:34_v_2023-05-29
Project VolPred_percentage_each_1min_15:34_v_2023-05-29 creation started: 2023-05-29 15:46:51
Project creation finished. Elapsed time: 173.36340594291687
h

In [23]:
projects_df

Unnamed: 0,project_id,slice,url,project
0,6474b8b355ef8574714197f4,15:31,https://app.datarobot.com/projects/6474b8b355e...,Project(VolPred_percentage_each_1min_15:31_v_2...
1,6474b95bb2f47210965f1243,15:32,https://app.datarobot.com/projects/6474b95bb2f...,Project(VolPred_percentage_each_1min_15:32_v_2...
2,6474ba13b375dad5e35f11a8,15:33,https://app.datarobot.com/projects/6474ba13b37...,Project(VolPred_percentage_each_1min_15:33_v_2...
3,6474badbb2f47210965f124b,15:34,https://app.datarobot.com/projects/6474badbb2f...,Project(VolPred_percentage_each_1min_15:34_v_2...
4,6474bb89b375dad5e35f11db,15:35,https://app.datarobot.com/projects/6474bb89b37...,Project(VolPred_percentage_each_1min_15:35_v_2...
5,6474bc32c173843ed041972b,15:36,https://app.datarobot.com/projects/6474bc32c17...,Project(VolPred_percentage_each_1min_15:36_v_2...
6,6474bcdfb2f47210965f1265,15:37,https://app.datarobot.com/projects/6474bcdfb2f...,Project(VolPred_percentage_each_1min_15:37_v_2...
7,6474bd92b375dad5e35f11f4,15:38,https://app.datarobot.com/projects/6474bd92b37...,Project(VolPred_percentage_each_1min_15:38_v_2...
8,6474be44b375dad5e35f1207,15:39,https://app.datarobot.com/projects/6474be44b37...,Project(VolPred_percentage_each_1min_15:39_v_2...
9,6474bf06b375dad5e35f1220,15:40,https://app.datarobot.com/projects/6474bf06b37...,Project(VolPred_percentage_each_1min_15:40_v_2...


Give a unique name to the dataframe that you have constructed and store it as a pickle file. 

In [24]:
now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
percentage_str = 'percentage'
if modelling_choice['percentage'] == False: percentage_str = 'sum'

project_name = 'VolPred_'+percentage_str+'_each_' +str(modelling_choice['window_length'])+'min''_v_' + now


In [25]:
project_name

'VolPred_percentage_each_1min_v_2023-05-29-17-09-17'

In [26]:
projects_df.to_pickle('results/'+project_name + '.pkl') 

In [27]:
projects_df.head()

Unnamed: 0,project_id,slice,url,project
0,6474b8b355ef8574714197f4,15:31,https://app.datarobot.com/projects/6474b8b355e...,Project(VolPred_percentage_each_1min_15:31_v_2...
1,6474b95bb2f47210965f1243,15:32,https://app.datarobot.com/projects/6474b95bb2f...,Project(VolPred_percentage_each_1min_15:32_v_2...
2,6474ba13b375dad5e35f11a8,15:33,https://app.datarobot.com/projects/6474ba13b37...,Project(VolPred_percentage_each_1min_15:33_v_2...
3,6474badbb2f47210965f124b,15:34,https://app.datarobot.com/projects/6474badbb2f...,Project(VolPred_percentage_each_1min_15:34_v_2...
4,6474bb89b375dad5e35f11db,15:35,https://app.datarobot.com/projects/6474bb89b37...,Project(VolPred_percentage_each_1min_15:35_v_2...
