# XGBoost and LGBM Inference Notebook

This notebook shows how to load a pretrained model from a training notebook and test the submission pipeline with the provided `public_timeseries_testing_util.py`, so that the `make_env` function can be called multiple times (we don't have to restart the kernel). I use this notebook to submit my two tuned models, one xgboost and three lgbm models with different random seeds

## Load the pretrained models

See [this notebook](https://www.kaggle.com/code/moritzm00/optiver-xgb-tuning-w-ray-tune) for training and hyperparameter tuning.

In [1]:
from xgboost import XGBRegressor
xgb_model = XGBRegressor()
xgb_model.load_model("/kaggle/input/optiver-xgb-tuning-w-ray-tune/model.json")
xgb_model

Training notebook of lgbm models, see [here](https://www.kaggle.com/code/moritzm00/optiver-lgbm-tuning-w-optuna)

In [2]:
import pickle
import numpy as np

lgbm_models = []
n_models = 1 # up to 3

for i in range(n_models):
    with open(f"/kaggle/input/optiver-lgbm-tuning-w-optuna/model-{i}.pkl", "rb") as f:
         lgbm_models.append(pickle.load(f))

In [3]:
def make_ensemble_pred(X):
    preds = [xgb_model.predict(X)]
    for lgbm_model in lgbm_models:
        preds.append(lgbm_model.predict(X, num_iterations=lgbm_model.best_iteration_))
    return np.mean(preds, axis=0)

In [4]:
!cp /kaggle/input/optiver-lgbm-tuning-w-optuna/featurize.py .

In [5]:
from featurize import featurize

## Finish the MockApi class implementation

See this file: `/kaggle/input/optiver-trading-at-the-close/public_timeseries_testing_util.py`

We have to define the three variables, as explained in the docstring. With this, we can rerun the api multiple times without having to restart the kernel!

In [6]:
'''
An unlocked version of the timeseries API intended for testing alternate inputs.
Mirrors the production timeseries API in the crucial respects, but won't be as fast.

ONLY works afer the first three variables in MockAPI.__init__ are populated.
'''

from typing import Sequence, Tuple

import pandas as pd


class MockApi:
    def __init__(self):
        '''
        YOU MUST UPDATE THE FIRST THREE LINES of this method.
        They've been intentionally left in an invalid state.

        Variables to set:
            input_paths: a list of two or more paths to the csv files to be served
            group_id_column: the column that identifies which groups of rows the API should serve.
                A call to iter_test serves all rows of all dataframes with the current group ID value.
            export_group_id_column: if true, the dataframes iter_test serves will include the group_id_column values.
        '''
        self.input_paths: Sequence[str] = [
            "/kaggle/input/optiver-trading-at-the-close/example_test_files/test.csv",
            "/kaggle/input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv",
            "/kaggle/input/optiver-trading-at-the-close/example_test_files/sample_submission.csv"
        ]
        self.group_id_column: str = "time_id"
        self.export_group_id_column: bool = False
        # iter_test is only designed to support at least two dataframes, such as test and sample_submission
        assert len(self.input_paths) >= 2

        self._status = 'initialized'
        self.predictions = []

    def iter_test(self) -> Tuple[pd.DataFrame]:
        '''
        Loads all of the dataframes specified in self.input_paths,
        then yields all rows in those dataframes that equal the current self.group_id_column value.
        '''
        if self._status != 'initialized':

            raise Exception('WARNING: the real API can only iterate over `iter_test()` once.')

        dataframes = []
        for pth in self.input_paths:
            dataframes.append(pd.read_csv(pth, low_memory=False))
        group_order = dataframes[0][self.group_id_column].drop_duplicates().tolist()
        dataframes = [df.set_index(self.group_id_column) for df in dataframes]

        for group_id in group_order:
            self._status = 'prediction_needed'
            current_data = []
            for df in dataframes:
                cur_df = df.loc[group_id].copy()
                # returning single line dataframes from df.loc requires special handling
                if not isinstance(cur_df, pd.DataFrame):
                    cur_df = pd.DataFrame({a: b for a, b in zip(cur_df.index.values, cur_df.values)}, index=[group_id])
                    cur_df.index.name = self.group_id_column
                cur_df = cur_df.reset_index(drop=not(self.export_group_id_column))
                current_data.append(cur_df)
            yield tuple(current_data)

            while self._status != 'prediction_received':
                print('You must call `predict()` successfully before you can continue with `iter_test()`', flush=True)
                yield None
                
        with open('submission_mocked.csv', 'w') as f_open:
             pd.concat(self.predictions).to_csv(f_open, index=False)
        self._status = 'finished'

    def predict(self, user_predictions: pd.DataFrame):
        '''
        Accepts and stores the user's predictions and unlocks iter_test once that is done
        '''
        if self._status == 'finished':
            raise Exception('You have already made predictions for the full test set.')
        if self._status != 'prediction_needed':
            raise Exception('You must get the next test sample from `iter_test()` first.')
        if not isinstance(user_predictions, pd.DataFrame):
            raise Exception('You must provide a DataFrame.')

        self.predictions.append(user_predictions)
        self._status = 'prediction_received'


def make_env():
    return MockApi()

## Iterate over the test data

In [7]:
mock_api = False # false if you want submit this notebook

In [8]:
if mock_api:
    env = make_env()
    iter_test = env.iter_test()
else:
    import optiver2023
    env = optiver2023.make_env()
    iter_test = env.iter_test()

In [9]:
def zero_sum(prices, volumes):
    
    # credits: https://github.com/gotoConversion/goto_conversion/
    
    std_error = np.sqrt(volumes)
    step = np.sum(prices)/np.sum(std_error)
    out = prices-std_error*step
    
    return out

In [10]:
counter = 0
for (test, revealed_targets, sample_prediction) in iter_test:
    X_test = featurize(test)
    sample_prediction['target'] = make_ensemble_pred(X_test)
    
    #sample_prediction['target'] = zero_sum(
    #    sample_prediction['target'],
    #    test.loc[:,'bid_size'] + test.loc[:,'ask_size']
    #)
    env.predict(sample_prediction)
    counter += 1

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


In [11]:
submission = pd.read_csv(f"/kaggle/working/submission{'_mocked' if mock_api else ''}.csv")
submission.head()

Unnamed: 0,row_id,target
0,478_0_0,-0.871197
1,478_0_1,2.336029
2,478_0_2,3.434151
3,478_0_3,-0.874021
4,478_0_4,-1.395351


### If you found this notebook helpful or insightful, I would sincerely appreciate your support through an upvote.