### Memory Optimised Baseline Model (XGB) - Jane Street ###
In this notebook, we build an **XGBoost** model for forecasting on the Jane Street dataset while keeping our memory footprint as low as possible.  

These optimisations help ensure that we stay within Kaggle’s memory limits while still being able to train a robust model.


#### Baseline notebooks:
Preprocessing : https://www.kaggle.com/code/motono0223/js24-preprocessing-create-lags
Training (Code only) : this notebook https://www.kaggle.com/code/motono0223/js24-train-gbdt-model-with-lags-singlemodel
trained model : https://www.kaggle.com/datasets/motono0223/js24-trained-gbdt-model
Inference : https://www.kaggle.com/code/motono0223/js24-inference-gbdt-with-lags-singlemodel
https://www.kaggle.com/code/zoutain/xgb-baseline-from-train-to-submit
https://www.kaggle.com/code/regisvargas/jane-street-a-beginner-s-notebook

In [1]:
import pandas as pd
import polars as pl
import numpy as np
import os
from tqdm.auto import tqdm
import pickle
import gc 
from sklearn.metrics import r2_score
from xgboost import XGBRegressor

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None


In [2]:
class CONFIG:
    seed = 42
    target_col = "responder_6"
    feature_cols = ["symbol_id", "time_id"] \
        + [f"feature_{idx:02d}" for idx in range(79)] \
        + [f"responder_{idx}_lag_1" for idx in range(9)]
    categorical_cols = []
    lag_cols_original = ["date_id", "symbol_id"] + [f"responder_{idx}" for idx in range(9)]
    lag_cols_rename = { f"responder_{idx}" : f"responder_{idx}_lag_1" for idx in range(9)}
    valid_ratio = 0.05
    start_dt = 1100

## **Loading & Lazy Scanning**  
- **Polars** `scan_parquet` is used instead of `read_parquet` to enable lazy operations.  
- Only when `.collect()` is called does the data actually load into memory.  
- We also cast some columns to more memory-efficient types (like `Int32` in Polars).  

In [3]:
train = pl.scan_parquet(f"/kaggle/input/jane-street-real-time-market-data-forecasting/train.parquet"
).select(pl.int_range(pl.len(), dtype=pl.UInt32).alias("id"),pl.all(),
).with_columns((pl.col(CONFIG.target_col)*2).cast(pl.Int32).alias("label")).filter(pl.col("date_id").gt(CONFIG.start_dt))

In [4]:
lags = train.select(pl.col(CONFIG.lag_cols_original))
lags = lags.rename(CONFIG.lag_cols_rename)
lags = lags.with_columns(date_id = pl.col('date_id') + 1,  # lagged by 1 day
                        )
lags = lags.group_by(["date_id", "symbol_id"], maintain_order=True).last()  # pick up last record of previous date

## **Join Lagged Data**  
We now join the lagged features back into our main training dataset. Since we’re using Polars lazy mode, these transformations won’t immediately materialize in memory.

In [5]:
train = train.join(lags, on=["date_id", "symbol_id"],  how="left")
train

## **Train/Validation Split**  
- We compute the total length of the dataset and split based on `valid_ratio`.  
- **Note**: All of these are still lazy until we call `.collect()`.

In [6]:
# Use lazy operations
train_lazy = train.lazy()

# Compute these once:
len_train = train_lazy.select(pl.col("date_id")).collect().shape[0]
valid_records = int(len_train * CONFIG.valid_ratio)
len_ofl_mdl = len_train - valid_records

last_tr_dt = train_lazy.select(pl.col("date_id")).collect().row(len_ofl_mdl)[0]

# Now filter with lazy queries, 
# and select only the columns we need for training and validation at once
training_data_lazy = (
    train_lazy
    .filter(pl.col("date_id") <= last_tr_dt)
    .select(CONFIG.feature_cols + [CONFIG.target_col, "weight"])
)

validation_data_lazy = (
    train_lazy
    .filter(pl.col("date_id") > last_tr_dt)
    .select(CONFIG.feature_cols + [CONFIG.target_col, "weight"])
)

Here, we trigger the actual load into memory with `.collect()`. We also explicitly manage large objects and call garbage collection to free up memory.

In [7]:
# Collect once for each split
training_data = training_data_lazy.collect()
validation_data = validation_data_lazy.collect()

In [8]:
def get_model(seed):
    # XGBoost parameters
    XGB_Params = {
        'learning_rate': 0.1,
        'max_depth': 6,
        'n_estimators': 200,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'reg_alpha': 1,
        'reg_lambda': 5,
        'random_state': seed,
        'tree_method': 'gpu_hist',
        'device' : 'cuda',
        'n_gpus' : 2,
        'verbose': True
    }
    
    XGB_Model = XGBRegressor(**XGB_Params)
    return XGB_Model

In [9]:
del train
gc.collect()


0

## **Prepare Numpy Arrays for Model**  
We convert Polars dataframes to Numpy arrays for training. This is more memory-efficient than holding onto all columns in dataframes during model training.

In [10]:
X_train = training_data.select(CONFIG.feature_cols).to_numpy()
y_train = training_data.select(CONFIG.target_col).to_numpy().ravel()
w_train = training_data.select("weight").to_numpy().ravel()

X_valid = validation_data.select(CONFIG.feature_cols).to_numpy()
y_valid = validation_data.select(CONFIG.target_col).to_numpy().ravel()
w_valid = validation_data.select("weight").to_numpy().ravel()

In [11]:
gc.collect()

0

In [12]:
%%time
model = get_model(CONFIG.seed)
model.fit( X_train, y_train, sample_weight=w_train)

CPU times: user 11min 20s, sys: 8.18 s, total: 11min 28s
Wall time: 4min 37s


## **Chunked Predictions**  
When working with large datasets, predicting all at once can cause memory spikes. To address this, we split the dataset into halves (or other manageable chunks) for inference.

In [13]:
# Assuming X_train, y_train, w_train are all NumPy arrays
half = X_train.shape[0] // 2

# Predict on the first half of X_train
y_pred_train1 = model.predict(X_train[:half])

# Predict on the second half of X_train
y_pred_train2 = model.predict(X_train[half:])

# Concatenate predictions
y_pred_train = np.concatenate([y_pred_train1, y_pred_train2], axis=0)

# Compute R² score with sample_weight
train_score = r2_score(y_train, y_pred_train, sample_weight=w_train)
train_score

0.05118821176711785

In [14]:
# If you have a validation set, you can also compute R² on the validation set
y_pred_valid = model.predict(X_valid)
r2_valid = r2_score(y_valid, y_pred_valid)
print("R² score on the validation set:", r2_valid)

R² score on the validation set: 0.004936436333746519


In [15]:
!mkdir -p /kaggle/working/jane-street/

## **Exporting the Model**  
Finally, we save the model as a pickle file for later inference.

In [16]:
import pickle

pkl_path = '/kaggle/working/jane-street/xgb_boost.pkl'
 
with open(pkl_path, 'wb') as file:  
    pickle.dump(model, file)
