# Purpose of this Notebook 🎯

- Save newcomers' time in understanding how the submission API works.

# Credits 💡

- [Optiver 2023 Basic Submission Demo](https://www.kaggle.com/code/sohier/optiver-2023-basic-submission-demo) provides a basic demo for submission.

- [Explain the Data📝 | LightGBM Baseline🚀](https://www.kaggle.com/code/a27182818/explain-the-data-lightgbm-baseline) is my another notebook. It explores the training data in deep and provides a simple LightGBM workflow.

# Remarks 🚥

- The competition is still at its early stage as of writing (2023-09-29).

- The contents that the API return could be very different as the competition progress.

- _We will update this notebook continuously, to catch any latest changes_

- If you find some parts of this notebook are wrong or just difficult to understand, please point it out in the comments, much appreciated.

# TL;DR, here are some Takeaways🍎

- **The purpose of the API is to use hidden data to score notebooks**: when you submit your output, the data that API use to score your submissions are not the same as you run the API in notebook privately.

- When you run the API in notebook, it provide same data as in `/kaggle/input/optiver-trading-at-the-close/example_test_files`, which currently is a subset of our training data.

- The API may return different data as the competition progress.

- To gain more understanding on the API, take a look of `public_timeseries_testing_util.py`. From it, you can even make yourself a custom API.

# Lets get started 🦾!

In [1]:
# Firstly, we take a look on the training data
import pandas as pd

train = pd.read_csv("/kaggle/input/optiver-trading-at-the-close/train.csv")
train

Unnamed: 0,stock_id,date_id,seconds_in_bucket,imbalance_size,imbalance_buy_sell_flag,reference_price,matched_size,far_price,near_price,bid_price,bid_size,ask_price,ask_size,wap,target,time_id,row_id
0,0,0,0,3180602.69,1,0.999812,13380276.64,,,0.999812,60651.50,1.000026,8493.03,1.000000,-3.029704,0,0_0_0
1,1,0,0,166603.91,-1,0.999896,1642214.25,,,0.999896,3233.04,1.000660,20605.09,1.000000,-5.519986,0,0_0_1
2,2,0,0,302879.87,-1,0.999561,1819368.03,,,0.999403,37956.00,1.000298,18995.00,1.000000,-8.389950,0,0_0_2
3,3,0,0,11917682.27,-1,1.000171,18389745.62,,,0.999999,2324.90,1.000214,479032.40,1.000000,-4.010200,0,0_0_3
4,4,0,0,447549.96,-1,0.999532,17860614.95,,,0.999394,16485.54,1.000016,434.10,1.000000,-7.349849,0,0_0_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5237975,195,480,540,2440722.89,-1,1.000317,28280361.74,0.999734,0.999734,1.000317,32257.04,1.000434,319862.40,1.000328,2.310276,26454,480_540_195
5237976,196,480,540,349510.47,-1,1.000643,9187699.11,1.000129,1.000386,1.000643,205108.40,1.000900,93393.07,1.000819,-8.220077,26454,480_540_196
5237977,197,480,540,0.00,0,0.995789,12725436.10,0.995789,0.995789,0.995789,16790.66,0.995883,180038.32,0.995797,1.169443,26454,480_540_197
5237978,198,480,540,1000898.84,1,0.999210,94773271.05,0.999210,0.999210,0.998970,125631.72,0.999210,669893.00,0.999008,-1.540184,26454,480_540_198


In [2]:
# Then, we focus on the API
import optiver2023

# You can only call `make_env()` once per section.
env = optiver2023.make_env()
iter_test = env.iter_test()

# To count how many time the "for loop" runs.
counter = 0

# init 3 empty lists
test_ls, revealed_targets_ls, sample_prediction_ls = [], [], []

for (test, revealed_targets, sample_prediction) in iter_test:
    # Append the dataframe that API return into the list.
    test_ls.append(test.copy())
    revealed_targets_ls.append(revealed_targets.copy())
    sample_prediction_ls.append(sample_prediction.copy())

    # Writes our predictions (here, all predictions are 0s).
    sample_prediction["target"] = 0
    
    # This line submit our predictions.
    env.predict(sample_prediction)
    counter += 1

print('\n', '=' * 50, sep="")
print(f"counter: {counter}")

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.

counter: 165


# test

In [3]:
test_ls[0]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,imbalance_size,imbalance_buy_sell_flag,reference_price,matched_size,far_price,near_price,bid_price,bid_size,ask_price,ask_size,wap,row_id
0,0,478,0,3753451.43,-1,0.999875,11548975.43,,,0.999875,22940.00,1.000050,9177.60,1.0,478_0_0
1,1,478,0,985977.11,-1,1.000245,3850033.97,,,0.999940,1967.90,1.000601,19692.00,1.0,478_0_1
2,2,478,0,599128.74,1,1.000584,4359198.25,,,0.999918,4488.22,1.000636,34955.12,1.0,478_0_2
3,3,478,0,2872317.54,-1,0.999802,27129551.64,,,0.999705,16082.04,1.000189,10314.00,1.0,478_0_3
4,4,478,0,740059.14,-1,0.999886,8880890.78,,,0.999720,19012.35,1.000107,7245.60,1.0,478_0_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,195,478,0,11075474.70,1,0.999672,11165016.52,,,0.999792,24960.00,1.000153,18310.60,1.0,478_0_195
196,196,478,0,1303523.55,1,0.999953,2280988.82,,,0.999953,4533.30,1.000206,19715.00,1.0,478_0_196
197,197,478,0,3920578.20,-1,1.000605,2969820.70,,,0.999650,5760.15,1.000318,5240.00,1.0,478_0_197
198,198,478,0,5074285.84,-1,1.000136,34865444.24,,,0.999894,74171.94,1.000136,95381.00,1.0,478_0_198


In [4]:
test_ls[1]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,imbalance_size,imbalance_buy_sell_flag,reference_price,matched_size,far_price,near_price,bid_price,bid_size,ask_price,ask_size,wap,row_id
0,0,478,10,3771174.79,-1,1.000050,11550982.93,,,1.000050,17208.00,1.000224,21456.38,1.000127,478_10_0
1,1,478,10,967084.13,-1,0.999991,3868926.94,,,0.999940,1967.90,1.000346,20671.35,0.999975,478_10_1
2,2,478,10,712904.65,1,1.000482,4245422.33,,,1.000123,14833.68,1.000636,54092.56,1.000234,478_10_2
3,3,478,10,2871905.05,-1,0.999898,27131614.05,,,0.999802,15258.80,0.999995,8455.84,0.999926,478_10_3
4,4,478,10,740059.14,-1,0.999775,8880890.78,,,0.999775,36216.00,0.999941,25355.40,0.999873,478_10_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,195,478,10,10904047.04,1,1.000633,11338441.39,,,1.000513,81844.58,1.000633,166.54,1.000633,478_10_195
196,196,478,10,1248372.35,1,1.000714,2332197.82,,,1.000460,9150.08,1.000714,4142.25,1.000635,478_10_196
197,197,478,10,3896586.63,-1,1.000318,2982392.70,,,1.000318,5240.00,1.001177,31886.56,1.000440,478_10_197
198,198,478,10,5094603.38,-1,1.000377,34855244.00,,,1.000377,64708.80,1.000618,84266.19,1.000482,478_10_198


In [5]:
test_ls[164]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,imbalance_size,imbalance_buy_sell_flag,reference_price,matched_size,far_price,near_price,bid_price,bid_size,ask_price,ask_size,wap,row_id
0,0,480,540,475513.69,-1,0.999193,41686415.27,0.999017,0.999017,0.999193,110123.01,0.999368,283817.38,0.999242,480_540_0
1,1,480,540,43854.51,-1,0.996543,7680424.49,0.996121,0.996490,0.996543,5675.70,0.997122,167909.10,0.996562,480_540_1
2,2,480,540,184125.24,-1,0.998246,11081468.02,0.998040,0.998195,0.998143,5616.14,0.998555,102488.46,0.998165,480_540_2
3,3,480,540,3635484.09,-1,0.999113,73735938.19,0.998338,0.998386,0.999113,12796.18,0.999209,420250.76,0.999115,480_540_3
4,4,480,540,2750278.49,-1,0.998700,37418449.82,0.997493,0.998042,0.998645,102169.32,0.998809,583790.75,0.998669,480_540_4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,195,480,540,2440722.89,-1,1.000317,28280361.74,0.999734,0.999734,1.000317,32257.04,1.000434,319862.40,1.000328,480_540_195
196,196,480,540,349510.47,-1,1.000643,9187699.11,1.000129,1.000386,1.000643,205108.40,1.000900,93393.07,1.000819,480_540_196
197,197,480,540,0.00,0,0.995789,12725436.10,0.995789,0.995789,0.995789,16790.66,0.995883,180038.32,0.995797,480_540_197
198,198,480,540,1000898.84,1,0.999210,94773271.05,0.999210,0.999210,0.998970,125631.72,0.999210,669893.00,0.999008,480_540_198


the API return the last 33,000 rows of our training data without the `target` column. For a single iteration, it return 200 rows of data -- 200 `stock_id` for same `date_id` and same `seconds_in_bucket`.

It serves as the `y_test`, we will write code such as `model.predict(test)` to make predictions.

And it's important to note that, although here it return a subset of our training data, when we submitting our notebook for scoring, the API would provide out-of-sample data.

# revealed_targets

In [6]:
revealed_targets_ls[0]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,revealed_target,revealed_date_id,revealed_time_id
0,0,478,0,-2.310276,477,26235
1,1,478,0,-12.850165,477,26235
2,2,478,0,-0.439882,477,26235
3,3,478,0,7.259846,477,26235
4,4,478,0,4.780293,477,26235
...,...,...,...,...,...,...
10995,195,478,540,-3.190041,477,26289
10996,196,478,540,-6.200075,477,26289
10997,197,478,540,0.0,477,26289
10998,198,478,540,1.300573,477,26289


In [7]:
revealed_targets_ls[1]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,revealed_target,revealed_date_id,revealed_time_id
0,,478.0,10.0,,,


In [8]:
revealed_targets_ls[164]

Unnamed: 0,stock_id,date_id,seconds_in_bucket,revealed_target,revealed_date_id,revealed_time_id
0,,480.0,540.0,,,


In [9]:
(train
    .query('date_id == 477')
    .loc[:, ['stock_id', 'date_id', 'seconds_in_bucket', 'target', 'time_id']]
)

Unnamed: 0,stock_id,date_id,seconds_in_bucket,target,time_id
5193980,0,477,0,-2.310276,26235
5193981,1,477,0,-12.850165,26235
5193982,2,477,0,-0.439882,26235
5193983,3,477,0,7.259846,26235
5193984,4,477,0,4.780292,26235
...,...,...,...,...,...
5204975,195,477,540,-3.190041,26289
5204976,196,477,540,-6.200075,26289
5204977,197,477,540,0.000000,26289
5204978,198,477,540,1.300573,26289


According to the [Dataset Description](https://www.kaggle.com/competitions/optiver-trading-at-the-close/data),

> for `revealed_targets`, The first time_id for each date in this file provides the true target values for the entire previous date. All other rows contain mostly null values.

From the structure of the `revealed_targets`, we guess that it's purpose is to provide out-of-sample trading days' true **target** values.

Currently, it only provides the **target** value for the subset of training dataset (date_id=477).

As the competition continues, will it provide more **target** values beside the current training data?

Could we combine the `revealed_targets` with `test` to generate **new training data**? 

And could we hence continuously imporve our model via new training data in the process of each API iterations?

# sample_prediction

In [10]:
sample_prediction_ls[0]

Unnamed: 0,row_id,target
0,478_0_0,1.0
1,478_0_1,1.0
2,478_0_2,1.0
3,478_0_3,1.0
4,478_0_4,1.0
...,...,...
195,478_0_195,1.0
196,478_0_196,1.0
197,478_0_197,1.0
198,478_0_198,1.0


In [11]:
sample_prediction_ls[1]

Unnamed: 0,row_id,target
0,478_10_0,1.0
1,478_10_1,1.0
2,478_10_2,1.0
3,478_10_3,1.0
4,478_10_4,1.0
...,...,...
195,478_10_195,1.0
196,478_10_196,1.0
197,478_10_197,1.0
198,478_10_198,1.0


In [12]:
sample_prediction_ls[164]

Unnamed: 0,row_id,target
0,480_540_0,1.0
1,480_540_1,1.0
2,480_540_2,1.0
3,480_540_3,1.0
4,480_540_4,1.0
...,...,...
195,480_540_195,1.0
196,480_540_196,1.0
197,480_540_197,1.0
198,480_540_198,1.0


It make a default prediction of 1 for every `row_id`.

You could, of course, over-write it with your predictions from your model, via codes like `sample_prediction['target'] = model.predict(test)`