# Optiver Trading At The Close

[Competition](https://www.kaggle.com/competitions/optiver-trading-at-the-close)

[GitHub Repository](https://github.com/cvaisnor/DNN-Final-Project)

[Google Slides Presentation](https://docs.google.com/presentation/d/1Xc5F1_NveFi1il3GqHej2aqVmklR_jZU4kx6ZJhiDEM/edit?usp=sharing)

# General Imports

In [1]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

kaggle_environment = False # True if running on Kaggle, don't forget to add the dataset!

if kaggle_environment:
    data_path = '/kaggle/input/'
else:
    data_path = 'kaggle/input/'

# disable PkgResourcesDeprecationWarning
import warnings
warnings.filterwarnings('ignore')

# Setting File Structure

In [2]:
for dirname, _, filenames in os.walk(data_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

kaggle/input/optiver-trading-at-the-close/public_timeseries_testing_util.py
kaggle/input/optiver-trading-at-the-close/.gitkeep
kaggle/input/optiver-trading-at-the-close/train.csv
kaggle/input/optiver-trading-at-the-close/optiver2023/__init__.py
kaggle/input/optiver-trading-at-the-close/optiver2023/competition.cpython-310-x86_64-linux-gnu.so
kaggle/input/optiver-trading-at-the-close/optiver2023/__pycache__/__init__.cpython-310.pyc
kaggle/input/optiver-trading-at-the-close/example_test_files/test.csv
kaggle/input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv
kaggle/input/optiver-trading-at-the-close/example_test_files/sample_submission.csv


# Loading Data

In [3]:
train = pd.read_csv(data_path + 'optiver-trading-at-the-close/train.csv')
revealed_targets = pd.read_csv(data_path + 'optiver-trading-at-the-close/example_test_files/revealed_targets.csv')
test = pd.read_csv(data_path + 'optiver-trading-at-the-close/example_test_files/test.csv')
sample_submission = pd.read_csv(data_path + 'optiver-trading-at-the-close/example_test_files/sample_submission.csv')

# Data Descriptions

- stock_id - A unique identifier for the stock. Not all stock IDs exist in every time bucket.
- date_id - A unique identifier for the date. Date IDs are sequential & consistent across all stocks.
- imbalance_size - The amount unmatched at the current reference price (in USD).
- imbalance_buy_sell_flag - An indicator reflecting the direction of auction imbalance.
    - buy-side imbalance; 1
    - sell-side imbalance; -1
    - no imbalance; 0

- reference_price - The price at which paired shares are maximized, the imbalance is minimized and the distance from the bid-ask midpoint is minimized, in that order. Can also be thought of as being equal to the near price bounded between the best bid and ask price.
- matched_size - The amount that can be matched at the current reference price (in USD).
- far_price - The crossing price that will maximize the number of shares matched based on auction interest only. This calculation excludes continuous market orders.
- near_price - The crossing price that will maximize the number of shares matched based auction and continuous market orders.
- [bid/ask]_price - Price of the most competitive buy/sell level in the non-auction book.
- [bid/ask]_size - The dollar notional amount on the most competitive buy/sell level in the non-auction book.
- wap - The weighted average price in the non-auction book.
- seconds_in_bucket - The number of seconds elapsed since the beginning of the day's closing auction, always starting from 0.
- target - The 60 second future move in the wap of the stock, less the 60 second future move of the synthetic index. Only provided for the train set.
    - The synthetic index is a custom weighted index of Nasdaq-listed stocks constructed by Optiver for this competition.
    - The unit of the target is basis points, which is a common unit of measurement in financial markets. A 1 basis point price move is equivalent to a 0.01% price move.
    - Where t is the time at the current observation, we can define the target:
                   
---

All size related columns are in USD terms.

All price related columns are converted to a price move relative to the stock wap (weighted average price) at the beginning of the auction period.

sample_submission A valid sample submission, delivered by the API. See this notebook for a very simple example of how to use the sample submission.

revealed_targets When the first time_id for each date (i.e. when seconds_in_bucket equals zero) the API will serve a dataframe providing the true target values for the entire previous date. All other rows contain null values for the columns of interest.

public_timeseries_testing_util.py An optional file intended to make it easier to run custom offline API tests. See the script's docstring for details. You will need to edit this file before using it.

example_test_files/ Data intended to illustrate how the API functions. Includes the same files and columns delivered by the API. The first three date ids are repeats of the last three date ids in the train set, to enable an illustration of how the API functions.

optiver2023/ Files that enable the API. Expect the API to deliver all rows in under five minutes and to reserve less than 0.5 GB of memory. The first three date ids delivered by the API are repeats of the last three date ids in the train set, to better illustrate how the API functions. You must make predictions for those dates in order to advance the API but those predictions are not scored.

# Feature Engineering Section

In [9]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

class CustomIterativeImputer(BaseEstimator, TransformerMixin):
    def __init__(self, columns, max_iter=10, random_state=0):
        self.columns = columns
        self.max_iter = max_iter
        self.random_state = random_state
        self.imputer = None

    def fit(self, X, y=None):
        # Initialize the IterativeImputer
        self.imputer = IterativeImputer(max_iter=self.max_iter, random_state=self.random_state)
        self.imputer.fit(X[self.columns])
        return self

    def transform(self, X):
        # Check is fit had been called
        if self.imputer is None:
            raise RuntimeError("You must call fit before calling transform")
        
        # Fit the imputer and transform the data
        X[self.columns] = self.imputer.transform(X[self.columns])
        return X

# Create the pipeline with the custom imputer
pipeline = Pipeline([
    ('iterative_imputer', CustomIterativeImputer(columns=['far_price', 'near_price'])),
    # Add other steps here if necessary
])

In [10]:
# Use the pipeline on the training data
df_processed = pipeline.fit_transform(train)

In [11]:
# drop all rows with NaNs (500-ish rows after imputation)
df_processed = df_processed.dropna()

In [12]:
def add_datetime_feature(df):
    """
    Add datetime features to a DataFrame.

    Parameters:
    df (pandas.DataFrame): DataFrame to process.

    Returns:
    pandas.DataFrame: DataFrame with added datetime feature.
    """
    df = df.copy()
    start_time = pd.to_timedelta('9:30:00')
    # Convert date_id to actual dates starting from 2020-01-01
    df['date'] = pd.to_datetime('2020-01-01') + pd.to_timedelta(df['date_id'], unit='D')
    df['time'] = pd.to_timedelta(df['time_id'], unit='s')
    # Add the trading start time to the time column
    df['time'] = df['time'] + start_time
    # Create long-form datetime timestamp
    df['timestamp'] = df['date'] + df['time']
    df.drop(['date', 'time', 'time_id', 'date_id'], axis=1, inplace=True)
    return df

In [13]:
df_with_datetime = add_datetime_feature(df_processed)

Comprehensive function for all feature engineering

In [15]:
def generate_features(test):
    # test is a dataframe with the same columns as train with only 200 rows
    test = test.copy()

    # Add datetime features
    test = add_datetime_feature(test)

    return test

# Model Development

In [16]:
# imports
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

2023-11-17 18:53:29.761894: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-17 18:53:29.781378: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX512F AVX512_VNNI AVX512_BF16, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [17]:
train_data = TimeSeriesDataFrame.from_data_frame(df_with_datetime, timestamp_column='timestamp', id_column='stock_id')

In [18]:
predictor = TimeSeriesPredictor(prediction_length=60, target='target', eval_metric='MASE', ignore_time_index=True)



In [19]:
predictor.fit(train_data=train_data, presets='fast_training', time_limit=600)

TimeSeriesPredictor.fit() called
Setting presets to: medium_quality
Fitting with arguments:
{'enable_ensemble': True,
 'evaluation_metric': 'MASE',
 'hyperparameter_tune_kwargs': None,
 'hyperparameters': 'medium_quality',
 'prediction_length': 60,
 'random_seed': None,
 'target': 'target',
 'time_limit': 600}
Provided training data set with 5237760 rows, 200 items (item = single time series). Average time series length is 26188.8.
Training artifacts will be saved to: /home/chris/dev/jhu/605.742_deep_neural_networks/final_project/autogluon-m1
AutoGluon will save models to autogluon-m1/
AutoGluon will gauge predictive performance using evaluation metric: 'MASE'
	This metric's sign has been flipped to adhere to being 'higher is better'. The reported score can be multiplied by -1 to get the metric value.

Provided dataset contains following columns:
	target:           'target'
	past covariates:  ['seconds_in_bucket', 'imbalance_size', 'imbalance_buy_sell_flag', 'reference_price', 'matched

<autogluon.timeseries.predictor.TimeSeriesPredictor at 0x7f8db2cf2d40>

# Generating Predictions

Only for Kaggle:
When uploading notebook, make sure to add the Optiver data on the right side of the screen.

In [None]:
if kaggle_environment:
    import optiver2023
    env = optiver2023.make_env()
    iter_test = env.iter_test()

In [None]:
if kaggle_environment:
    # To count how many time the "for loop" runs.
    counter = 0

    # init 3 empty lists
    test_ls, revealed_targets_ls, sample_prediction_ls = [], [], []

    for (test, revealed_targets, sample_prediction) in iter_test:
        # Append the dataframe that API return into the list.
        test_ls.append(test.copy())
        revealed_targets_ls.append(revealed_targets.copy())
        sample_prediction_ls.append(sample_prediction.copy())

        # Generate features
        test = generate_features(test)

        # Writes our predictions 
        sample_prediction["target"] = predictor.predict(test)
        
        # This line submit our predictions.
        env.predict(sample_prediction)
        counter += 1

    print('\n', '=' * 50, sep="")
    print(f"counter: {counter}")