# Optiver 2023 Trading Competition Notebook 📈📊

## Introduction 🌟
Welcome to this Jupyter notebook developed for the Optiver 2023 Trading Competition! This notebook is designed to help you participate in the competition and make predictions on the provided financial dataset.

### Inspiration and Credits 🙌
This notebook is inspired by the work of Yuanzhe Zhou, available at [this Kaggle project](https://www.kaggle.com/code/yuanzhezhou/baseline-lgb-xgb-and-catboost/). We extend our gratitude to Yuanzhe Zhou for sharing their insights and code.

🌟 Explore my profile and other public projects, and don't forget to share your feedback! 
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

🙏 Thank you for taking the time to review my work, and please give it a thumbs-up if you found it valuable! 👍

## Purpose 🎯
The primary purpose of this notebook is to:
- Load and preprocess the competition data 📁
- Engineer relevant features for model training 🏋️‍♂️
- Train predictive models to make target variable predictions 🧠
- Submit predictions to the competition environment 📤

## Notebook Structure 📚
This notebook is structured as follows:
1. **Data Preparation**: In this section, we load and preprocess the competition data.
2. **Feature Engineering**: We generate and select relevant features for model training.
3. **Model Training**: We train machine learning models on the prepared data.
4. **Prediction and Submission**: We make predictions on the test data and submit them for evaluation.
5. **Conclusion**: We summarize the key findings and results.

## How to Use 🛠️
To use this notebook effectively, please follow these steps:
1. Ensure you have the competition data and environment set up.
2. Execute each cell sequentially to perform data preparation, feature engineering, model training, and prediction submission.
3. Customize and adapt the code as needed to improve model performance or experiment with different approaches.

**Note**: Make sure to replace any placeholder paths or configurations with your specific information.

## Acknowledgments 🙏
We acknowledge the Optiver 2023 Trading Competition organizers for providing the dataset and the competition platform.

Let's get started! Feel free to reach out if you have any questions or need assistance along the way.
👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

## 📚 Import the necessary libraries


In [1]:
# 📦 Import necessary libraries
import pandas as pd         # 🐼 Pandas for data manipulation
import lightgbm as lgb      # 🚥 LightGBM for gradient boosting
import xgboost as xgb       # 🌲 XGBoost for gradient boosting
import catboost as cbt      # 🐱 CatBoost for gradient boosting
import numpy as np          # 🔢 NumPy for numerical operations
import joblib               # 📦 Joblib for model serialization
import os                   # 📂 OS module for file operations
import optiver2023          # custom module


## 📋 Function to Calculate Imbalance Features



1. `def calculate_imbalance_features(df):`:
   - This line defines a Python function called `calculate_imbalance_features` that takes a DataFrame `df` as an input.

2. `df['imb_s1'] = df.eval('(bid_size - ask_size) / (bid_size + ask_size)')`:
   - This line creates a new column in the DataFrame `df` called `'imb_s1'`.
   - The values in this column are calculated using the `eval` method, which evaluates the mathematical expression `(bid_size - ask_size) / (bid_size + ask_size)` for each row in the DataFrame.
   - This expression calculates the bid-ask imbalance, a measure of the difference between the sizes of buy (bid) and sell (ask) orders in a financial dataset.


Moving on to the next line:

```python
    # Calculate and add imbalance feature 2 (imb_s2)
    df['imb_s2'] = df.eval('(imbalance_size - matched_size) / (matched_size + imbalance_size)')
```

3. `df['imb_s2'] = df.eval('(imbalance_size - matched_size) / (matched_size + imbalance_size)')`:
   - Similar to the previous line, this line creates a new column in the DataFrame `df` called `'imb_s2'`.
   - It calculates the values for this column using the `eval` method, which evaluates the expression `(imbalance_size - matched_size) / (matched_size + imbalance_size)` for each row in the DataFrame.
   - This expression calculates another measure related to the size of imbalanced orders in a financial dataset.


The `calculate_imbalance_features` function takes a DataFrame as input and calculates two financial metrics, 'imb_s1' and 'imb_s2', which represent bid-ask imbalance and another measure related to order size imbalances, respectively.

In [2]:
def calculate_imbalance_features(df):
    # 📈 Calculate and add imbalance feature 1 (imb_s1)
    df['imb_s1'] = df.eval('(bid_size - ask_size) / (bid_size + ask_size)')  

    # 🔃 Calculate and add imbalance feature 2 (imb_s2)
    df['imb_s2'] = df.eval('(imbalance_size - matched_size) / (matched_size + imbalance_size)') 

    return df


## Function to Calculate Price-Based Features

Explanation:

1. `def calculate_price_features(df, features):`:
   - This line defines a Python function called `calculate_price_features` that takes two parameters: a DataFrame `df` and a list `features`.

2. `prices = ['reference_price', 'far_price', 'near_price', 'ask_price', 'bid_price', 'wap']`:
   - This line creates a list called `prices` containing the names of various price-related columns.

3. `for i, a in enumerate(prices):`:
   - This line starts a loop that iterates over the elements in the `prices` list, and it uses `i` to keep track of the index and `a` to store the name of the first price.

4. `for j, b in enumerate(prices):`:
   - Inside the previous loop, this line starts another loop that iterates over the same `prices` list. It uses `j` to keep track of the index and `b` to store the name of the second price.

5. `if i > j:`:
   - This line checks if the index of the first price (a) is greater than the index of the second price (b). This condition ensures that we only calculate features where the first price comes after the second price in the list.

6. `df[f'{a}_{b}_imb'] = df.eval(f'({a} - {b}) / ({a} + {b})')`:
   - If the condition in the previous line is met, this line calculates a new feature by subtracting the second price (b) from the first price (a) and then dividing the result by the sum of the two prices.
   - The calculated feature is added as a new column in the DataFrame with a name in the format `{a}_{b}_imb`, where `a` and `b` are the names of the prices being compared.

7. `features.append(f'{a}_{b}_imb')`:
   - This line adds the name of the newly created feature to the `features` list, keeping track of all the features that have been calculated.

8. `return df, features`:
   - Finally, the function returns the modified DataFrame (`df`) with the added features and the updated list of features (`features`).

This code is a function that calculates and adds price-related features to a DataFrame based on the differences and ratios between various price columns. It iterates through pairs of price columns and calculates features for pairs where the first price comes after the second price in the list. The function then returns the modified DataFrame and the list of calculated features.

In [3]:
def calculate_price_features(df, features):
    # Define a list of price-related columns
    prices = ['reference_price', 'far_price', 'near_price', 'ask_price', 'bid_price', 'wap']

    # Loop through the price columns to create new features
    for i, a in enumerate(prices):
        for j, b in enumerate(prices):
            # Check if the first price (a) comes after the second price (b) in the list
            if i > j:
                # Calculate and add a new feature to the DataFrame
                df[f'{a}_{b}_imb'] = df.eval(f'({a} - {b}) / ({a} + {b})')
                # Add the new feature name to the list of features
                features.append(f'{a}_{b}_imb')

    # Return the modified DataFrame and the updated list of features
    return df, features


## Function to Calculate Additional Price-Based Features

Explanation:

1. `def calculate_additional_price_features(df, features):`:
   - This line defines a Python function called `calculate_additional_price_features` that takes two parameters: a DataFrame `df` and a list `features`.

2. `prices = ['reference_price', 'far_price', 'near_price', 'ask_price', 'bid_price', 'wap']`:
   - This line creates a list called `prices` containing the names of various price-related columns.

3. `for i, a in enumerate(prices):`:
   - This line starts a loop that iterates over the elements in the `prices` list. It uses `i` to keep track of the index and `a` to store the name of the first price.

4. `for j, b in enumerate(prices):`:
   - Inside the previous loop, this line starts another loop that iterates over the same `prices` list. It uses `j` to keep track of the index and `b` to store the name of the second price.

5. `for k, c in enumerate(prices):`:
   - Inside the second loop, this line starts yet another loop that iterates over the `prices` list. It uses `k` to keep track of the index and `c` to store the name of the third price.

6. `if i > j and j > k:`:
   - This line checks if the indices of the prices a, b, and c are in descending order, ensuring that we consider combinations where a > b > c.

7. `max_ = df[[a, b, c]].max(axis=1)`:
   - This line calculates the maximum value among the prices a, b, and c for each row in the DataFrame `df`.

8. `min_ = df[[a, b, c]].min(axis=1)`:
   - This line calculates the minimum value among the prices a, b, and c for each row in the DataFrame `df`.

9. `mid_ = df[[a, b, c]].sum(axis=1) - min_ - max_`:
   - This line calculates the middle value among the prices a, b, and c for each row in the DataFrame `df` by subtracting the minimum and maximum values from the sum.

10. `df[f'{a}_{b}_{c}_imb2'] = (max_ - mid_) / (mid_ - min_)`:
    - If the condition in line 6 is met, this line calculates a new feature using the max, min, and mid values for a, b, and c.
    - The calculated feature is added as a new column in the DataFrame with a name in the format `{a}_{b}_{c}_imb2`.

11. `features.append(f'{a}_{b}_{c}_imb2')`:
    - This line adds the name of the newly created feature to the `features` list, keeping track of all the features that have been calculated.

12. `return df, features`:
    - Finally, the function returns the modified DataFrame (`df`) with the added features and the updated list of features (`features`).

This code is a nested loop that calculates and adds price-related features to a DataFrame based on combinations of three price columns. It considers combinations where the prices are in descending order, and for each combination, it calculates a feature related to the max, min, and mid values among the selected prices. The function then returns the modified DataFrame and the list of calculated features.

In [4]:
def calculate_additional_price_features(df, features):
    # Define a list of price-related columns
    prices = ['reference_price', 'far_price', 'near_price', 'ask_price', 'bid_price', 'wap']

    # Loop through the price columns to create new features
    for i, a in enumerate(prices):
        for j, b in enumerate(prices):
            for k, c in enumerate(prices):
                # Check if the order of prices a, b, and c is descending
                if i > j and j > k:
                    # Calculate the maximum, minimum, and mid values among a, b, and c
                    max_ = df[[a, b, c]].max(axis=1)
                    min_ = df[[a, b, c]].min(axis=1)
                    mid_ = df[[a, b, c]].sum(axis=1) - min_ - max_

                    # Calculate and add a new feature to the DataFrame
                    df[f'{a}_{b}_{c}_imb2'] = (max_ - mid_) / (mid_ - min_)
                    # Add the new feature name to the list of features
                    features.append(f'{a}_{b}_{c}_imb2')

    # Return the modified DataFrame and the updated list of features
    return df, features


## Function to Generate Features from DataFrame

Explanation:

1. `def generate_features(df):`:
   - This line defines a Python function called `generate_features` that takes a DataFrame `df` as input.

2. `features = ['seconds_in_bucket', 'imbalance_buy_sell_flag', ...]`:
   - This line creates a list called `features` containing the names of selected feature columns that will be used to generate the final feature set.

3. `df = calculate_imbalance_features(df)`:
   - This line calls the `calculate_imbalance_features` function to calculate imbalance-related features based on the input DataFrame `df`. It updates the DataFrame `df` with these new features.
   
4. `df, features = calculate_price_features(df, features)`:
   - This line calls the `calculate_price_features` function to calculate price-related features based on price differences and adds them to the input DataFrame `df`. It also updates the `features` list to include the names of these newly calculated features.
   
5. `df, features = calculate_additional_price_features(df, features)`:
   - This line calls the `calculate_additional_price_features` function to calculate additional price-related features based on combinations of price differences. It adds these features to the input DataFrame `df` and updates the `features` list with their names.

6. `return df[features]`:
   - Finally, the function returns a DataFrame that includes only the selected features listed in the `features` list. This DataFrame represents the final set of features that will be used for further analysis or modeling.

This function takes an initial DataFrame as input, calculates various features related to imbalance and price differences, and returns a DataFrame containing the selected set of features for further analysis or modeling.

In [5]:
def generate_features(df):
    # Define the list of feature column names
    features = ['seconds_in_bucket', 'imbalance_buy_sell_flag',
                'imbalance_size', 'matched_size', 'bid_size', 'ask_size',
                'reference_price', 'far_price', 'near_price', 'ask_price', 'bid_price', 'wap',
                'imb_s1', 'imb_s2'
               ]
    
    # Calculate imbalance features
    df = calculate_imbalance_features(df)  # 📊 Calculate imbalance features
    
    # Calculate features based on price differences
    df, features = calculate_price_features(df, features)  # 💰 Calculate price-related features
    
    # Calculate additional features based on price differences
    df, features = calculate_additional_price_features(df, features)  # 🔄 Calculate additional price features
    
    # Return the DataFrame with selected features
    return df[features]


## Training

Explanation:

1. `TRAINING = True`:
   - This line defines a boolean variable `TRAINING` and sets it to `True`. It indicates that we are in a training mode, which suggests that the following code is meant for training purposes.

2. `if TRAINING:`:
   - This line starts an `if` statement that checks if the `TRAINING` variable is `True`. If it is `True`, it proceeds to the next block of code.

3. `df_train = pd.read_csv('/kaggle/input/optiver-trading-at-the-close/train.csv')`:
   - Inside the `if` block, this line reads the training data from a CSV file located at the specified path (`'/kaggle/input/optiver-trading-at-the-close/train.csv'`) using the Pandas library. It loads the data into a DataFrame called `df_train`.

4. `df_ = generate_features(df_train)`:
   - After reading the training data, this line calls the `generate_features` function to generate features based on the training data. It passes the `df_train` DataFrame as an argument.
   - The resulting DataFrame with selected features is assigned to the variable `df_`.

This code block is meant for training a machine learning model or performing data analysis. It first checks if the `TRAINING` variable is `True`, indicating that it's in a training mode. If so, it reads the training data from a CSV file, generates features based on that data, and stores the resulting DataFrame in the variable `df_`.

In [6]:
TRAINING = True

if TRAINING:
    # Read the training data from a CSV file
    df_train = pd.read_csv('/kaggle/input/optiver-trading-at-the-close/train.csv')  # 📁 Read training data
    
    # Generate features for the training data
    df_ = generate_features(df_train)  # 🏋️‍♂️ Generate features for training data


## pre-trained models loading

Explanation:

1. `os.system('mkdir models')`:
   - This line creates a directory named 'models' using the `os.system` command. It's used for storing trained models.

2. `model_path = '/kaggle/input/optiverbaselinezyz'`:
   - This line sets the path to the directory where pre-trained models are located or where models will be saved during training.

3. `N_fold = 5`:
   - This line defines the number of folds for cross-validation.

4. `if TRAINING:`:
   - This block of code is executed only if the `TRAINING` flag is `True`. It's used for training models.

5. Data preparation:
   - This section prepares the input features `X` and the target variable `Y` for model training. It also filters out rows with non-finite target values and creates an index array for data splitting.

6. `models = []`:
   - This line initializes an empty list to store trained models.

7. `def train(model_dict, modelname='lgb'):`:
   - This defines a function `train` for training a model. It takes a model dictionary and a model name as parameters.

8. Model training:
   - This section fits the selected model on the training data and appends the trained model to the `models` list. It also saves the trained model to a file.

9. Model dictionary:
   - This section defines a dictionary `model_dict` that maps model names to their respective model objects.

10. Cross-validation loop:
    - This loop iterates `N_fold` times and trains models using different folds of the data. It calls the `train` function for each model and fold.

This code block is for training machine learning models using cross-validation. It also allows for the loading of pre-trained models if not in training mode.

In [7]:
# Create a directory named 'models'
os.system('mkdir models')  # 📁 Create a directory for storing models

# Set the path for model loading
model_path = '/kaggle/input/optiverbaselinezyz'

# Define the number of folds for cross-validation
N_fold =5 

if TRAINING:
    # Prepare the input features and target variable
    X = df_.values
    Y = df_train['target'].values

    # Filter out rows with non-finite target values
    X = X[np.isfinite(Y)]
    Y = Y[np.isfinite(Y)]

    # Create an index array for data splitting
    index = np.arange(len(X))

# Initialize a list to store trained models
models = []

# Define a function for training a model
def train(model_dict, modelname='lgb'):
    if TRAINING:
        # Get the model from the model dictionary
        model = model_dict[modelname]
        
        # Fit the model on the training data
        model.fit(X[index % N_fold != i], Y[index % N_fold != i], 
                  eval_set=[(X[index % N_fold == i], Y[index % N_fold == i])], 
                  verbose=10, 
                  early_stopping_rounds=100
                 )
        
        # Append the trained model to the models list
        models.append(model)
        
        # Save the model to a file
        joblib.dump(model, './models/{modelname}_{i}.model')
    else:
        # Load a pre-trained model if not in training mode
        models.append(joblib.load(f'{model_path}/{modelname}_{i}.model'))
    return 

# Define a dictionary of models to train
model_dict = {
    'lgb': lgb.LGBMRegressor(objective='regression_l1', n_estimators=500),
    'xgb': xgb.XGBRegressor(tree_method='hist', objective='reg:absoluteerror', n_estimators=500),
    'cbt': cbt.CatBoostRegressor(objective='MAE', iterations=3000),
}

# Loop for training models using cross-validation
#for i in range(N_fold):
   # train(model_dict, 'lgb')  # Train LightGBM model
#     train(model_dict, 'xgb')  # Train XGBoost model (commented out)
   # train(model_dict, 'cbt')  # Train CatBoost model


In [8]:
def update_model_dict(model_dict):
    # Define the updated hyperparameters and replace the models in model_dict
    
    # LightGBM
    model_dict['lgb'] = lgb.LGBMRegressor(
        objective='regression',
        n_estimators=1000,       # Higher may capture more complex patterns but may overfit
        learning_rate=0.05,      # Lower for stability, but may require more estimators
        max_depth=6,             # Higher may capture more complex patterns but may overfit
        num_leaves=64,           # Higher may capture more complex patterns but may overfit
        min_child_samples=20,    # Higher for more conservative tree growth
        subsample=0.8,           # Lower for less randomness, higher for more randomness
        colsample_bytree=0.8,    # Lower for less randomness, higher for more randomness
        reg_alpha=0.1,           # Higher for more L1 regularization
        reg_lambda=0.1,          # Higher for more L2 regularization
        verbose=10,
        early_stopping_rounds=100
    )
    
    # XGBoost
    model_dict['xgb'] = xgb.XGBRegressor(
        tree_method='hist',
        objective='reg:squarederror',
        n_estimators=1000,       # Higher may capture more complex patterns but may overfit
        learning_rate=0.05,      # Lower for stability, but may require more estimators
        max_depth=6,             # Higher may capture more complex patterns but may overfit
        min_child_weight=1,      # Adjust as needed, higher for more conservative tree growth
        gamma=0,                 # Adjust as needed, higher for more regularization
        subsample=0.8,           # Lower for less randomness, higher for more randomness
        colsample_bytree=0.8,    # Lower for less randomness, higher for more randomness
        reg_alpha=0.1,           # Higher for more L1 regularization
        reg_lambda=0.1,          # Higher for more L2 regularization
        verbose=10
    )
    
    # CatBoost
    model_dict['cbt'] = cbt.CatBoostRegressor(
        objective='MAE',
        iterations=1000,         # Higher may capture more complex patterns but may overfit
        learning_rate=0.05,      # Lower for stability, but may require more iterations
        depth=6,                 # Higher may capture more complex patterns but may overfit
        l2_leaf_reg=1,           # Adjust as needed, higher for more regularization
        verbose=10
    )

# Call the update_model_dict function to update the hyperparameters
update_model_dict(model_dict)

for i in range(N_fold):
    train(model_dict, 'lgb')  # Train LightGBM model
    train(model_dict, 'xgb')  # Train XGBoost model (commented out)
    train(model_dict, 'cbt')  # Train CatBoost model




[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.053126
[LightGBM] [Debug] init for col-wise cost 0.001383 seconds, init for row-wise cost 1.229954 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12043
[LightGBM] [Info] Number of data points in the train set: 4190313, number of used features: 49
[LightGBM] [Info] Start training from score -0.049791
[LightGBM] [Debug] Trained a tree with leaves = 62 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 62 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth 



Parameters: { "verbose" } are not used.

[0]	validation_0-rmse:9.71372
[10]	validation_0-rmse:9.61814
[20]	validation_0-rmse:9.57847
[30]	validation_0-rmse:9.55996
[40]	validation_0-rmse:9.54958
[50]	validation_0-rmse:9.54347
[60]	validation_0-rmse:9.53879
[70]	validation_0-rmse:9.53517
[80]	validation_0-rmse:9.53278
[90]	validation_0-rmse:9.53171
[100]	validation_0-rmse:9.53076
[110]	validation_0-rmse:9.53052
[120]	validation_0-rmse:9.52893
[130]	validation_0-rmse:9.52719
[140]	validation_0-rmse:9.52648
[150]	validation_0-rmse:9.54811
[160]	validation_0-rmse:9.53934
[170]	validation_0-rmse:9.53437
[180]	validation_0-rmse:9.52997
[190]	validation_0-rmse:9.53651
[200]	validation_0-rmse:9.53408
[210]	validation_0-rmse:9.53414
[220]	validation_0-rmse:9.54187
[230]	validation_0-rmse:9.54155
[240]	validation_0-rmse:9.53991
[247]	validation_0-rmse:9.53935
0:	learn: 6.3735716	test: 6.5057188	best: 6.5057188 (0)	total: 1.51s	remaining: 25m 4s
10:	learn: 6.3248470	test: 6.4518097	best: 6.451809



[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.052688
[LightGBM] [Debug] init for col-wise cost 0.000143 seconds, init for row-wise cost 1.235493 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12043
[LightGBM] [Info] Number of data points in the train set: 4190313, number of used features: 49
[LightGBM] [Info] Start training from score -0.053562
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth 



Parameters: { "verbose" } are not used.

[0]	validation_0-rmse:9.32685
[10]	validation_0-rmse:9.23967
[20]	validation_0-rmse:9.20272
[30]	validation_0-rmse:9.18546
[40]	validation_0-rmse:9.17654
[50]	validation_0-rmse:9.17033
[60]	validation_0-rmse:9.16525
[70]	validation_0-rmse:9.16124
[80]	validation_0-rmse:9.15847
[90]	validation_0-rmse:9.15590
[100]	validation_0-rmse:9.15357
[110]	validation_0-rmse:9.15132
[120]	validation_0-rmse:9.14855
[130]	validation_0-rmse:9.14694
[140]	validation_0-rmse:9.14580
[150]	validation_0-rmse:9.14440
[160]	validation_0-rmse:9.14247
[170]	validation_0-rmse:10.96680
[180]	validation_0-rmse:19.66531
[190]	validation_0-rmse:14.33714
[200]	validation_0-rmse:11.45405
[210]	validation_0-rmse:10.17435
[220]	validation_0-rmse:9.58976
[230]	validation_0-rmse:9.36243
[240]	validation_0-rmse:9.27245
[250]	validation_0-rmse:9.22636
[260]	validation_0-rmse:9.20046
[263]	validation_0-rmse:9.19536
0:	learn: 6.4120984	test: 6.3514494	best: 6.3514494 (0)	total: 1.36s	



[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.052473
[LightGBM] [Debug] init for col-wise cost 0.000151 seconds, init for row-wise cost 1.316330 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12043
[LightGBM] [Info] Number of data points in the train set: 4190314, number of used features: 49
[LightGBM] [Info] Start training from score -0.041514
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth 



Parameters: { "verbose" } are not used.

[0]	validation_0-rmse:9.19767
[10]	validation_0-rmse:9.11244
[20]	validation_0-rmse:9.07676
[30]	validation_0-rmse:9.05965
[40]	validation_0-rmse:9.05005
[50]	validation_0-rmse:9.04409
[60]	validation_0-rmse:9.03878
[70]	validation_0-rmse:9.03453
[80]	validation_0-rmse:9.03234
[90]	validation_0-rmse:9.02976
[100]	validation_0-rmse:9.02709
[110]	validation_0-rmse:9.02468
[120]	validation_0-rmse:9.02218
[130]	validation_0-rmse:9.02013
[140]	validation_0-rmse:9.01853
[150]	validation_0-rmse:9.01717
[160]	validation_0-rmse:9.01459
[170]	validation_0-rmse:9.01327
[180]	validation_0-rmse:9.01219
[190]	validation_0-rmse:15.60236
[200]	validation_0-rmse:11.99753
[210]	validation_0-rmse:10.45002
[220]	validation_0-rmse:9.60691
[230]	validation_0-rmse:15.27086
[240]	validation_0-rmse:11.97297
[250]	validation_0-rmse:10.20105
[260]	validation_0-rmse:9.57786
[270]	validation_0-rmse:9.24690
[280]	validation_0-rmse:9.11099
[288]	validation_0-rmse:9.06176
0:	l



[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.052212
[LightGBM] [Debug] init for col-wise cost 0.000117 seconds, init for row-wise cost 1.331231 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12043
[LightGBM] [Info] Number of data points in the train set: 4190314, number of used features: 49
[LightGBM] [Info] Start training from score -0.047265
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth 



Parameters: { "verbose" } are not used.

[0]	validation_0-rmse:9.46513
[10]	validation_0-rmse:9.38409
[20]	validation_0-rmse:9.35023
[30]	validation_0-rmse:9.33430
[40]	validation_0-rmse:9.32624
[50]	validation_0-rmse:9.32039
[60]	validation_0-rmse:9.31494
[70]	validation_0-rmse:9.31026
[80]	validation_0-rmse:9.30800
[90]	validation_0-rmse:9.30515
[100]	validation_0-rmse:9.30299
[110]	validation_0-rmse:9.29894
[120]	validation_0-rmse:9.29706
[130]	validation_0-rmse:9.29465
[140]	validation_0-rmse:9.29359
[150]	validation_0-rmse:9.29199
[160]	validation_0-rmse:9.29121
[170]	validation_0-rmse:9.28962
[180]	validation_0-rmse:9.28685
[190]	validation_0-rmse:9.28472
[200]	validation_0-rmse:9.28450
[210]	validation_0-rmse:9.28323
[220]	validation_0-rmse:9.28151
[230]	validation_0-rmse:9.28060
[240]	validation_0-rmse:9.28031
[250]	validation_0-rmse:9.28086
[260]	validation_0-rmse:9.28069
[270]	validation_0-rmse:9.27963
[280]	validation_0-rmse:9.27851
[290]	validation_0-rmse:9.27759
[300]	vali



[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.052516
[LightGBM] [Debug] init for col-wise cost 0.000149 seconds, init for row-wise cost 1.320086 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12043
[LightGBM] [Info] Number of data points in the train set: 4190314, number of used features: 49
[LightGBM] [Info] Start training from score -0.045673
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 64 and depth 



Parameters: { "verbose" } are not used.

[0]	validation_0-rmse:9.56218
[10]	validation_0-rmse:9.47631
[20]	validation_0-rmse:9.44004
[30]	validation_0-rmse:9.42324
[40]	validation_0-rmse:9.41412
[50]	validation_0-rmse:9.40858
[60]	validation_0-rmse:9.40438
[70]	validation_0-rmse:9.40110
[80]	validation_0-rmse:9.39897
[90]	validation_0-rmse:9.39777
[100]	validation_0-rmse:9.39529
[110]	validation_0-rmse:9.39249
[120]	validation_0-rmse:9.39040
[130]	validation_0-rmse:9.38977
[140]	validation_0-rmse:9.38783
[150]	validation_0-rmse:9.38675
[160]	validation_0-rmse:9.38579
[170]	validation_0-rmse:9.38409
[180]	validation_0-rmse:9.38280
[190]	validation_0-rmse:9.38137
[200]	validation_0-rmse:9.39555
[210]	validation_0-rmse:9.43422
[220]	validation_0-rmse:9.49014
[230]	validation_0-rmse:9.46416
[240]	validation_0-rmse:9.45024
[250]	validation_0-rmse:9.44176
[260]	validation_0-rmse:9.44171
[270]	validation_0-rmse:9.53236
[280]	validation_0-rmse:9.62049
[290]	validation_0-rmse:9.51808
[293]	vali

## Setting Up Optiver 2023 Environment and Initializing Test Data Iterator

Explanation:

1. `env = optiver2023.make_env()`:
   - This line creates an environment for working with the Optiver 2023 competition. The `optiver2023` module is used to create this environment. It is a custom environment or module provided for the competition.

2. `iter_test = env.iter_test()`:
   - This line initializes an iterator for the testing data within the Optiver 2023 environment. The iterator allows you to loop through and access the testing data in an organized and efficient manner.



In [9]:
# Create an Optiver 2023 environment
env = optiver2023.make_env()  # 🏭 Create an Optiver 2023 environment

# Initialize an iterator for testing data
iter_test = env.iter_test()  # 🔄 Initialize an iterator for testing data


## creating submission file 

Explanation:

1. `counter = 0`:
   - This line initializes a counter variable to keep track of the number of iterations through the test data. The emoji 🔢 indicates that it's a numeric counter.

2. `for (test, revealed_targets, sample_prediction) in iter_test:`:
   - This line starts a `for` loop that iterates through the test data provided by the Optiver 2023 environment. The loop unpacks each tuple, containing `test`, `revealed_targets`, and `sample_prediction`, for processing.

3. Generating Features:
   - Inside the loop, this line calls the `generate_features` function to generate features for the test data. The emoji 🏋️‍♂️ suggests that it's a data transformation step.

4. Making Predictions:
   - This line calculates predictions for the target variable using the trained models (`models`) and computes the mean of those predictions. It assigns the calculated mean as the 'target' in the `sample_prediction` DataFrame.

5. Submitting Predictions:
   - The `env.predict(sample_prediction)` line submits the predictions to the Optiver 2023 environment for evaluation and scoring. The emoji 📤 indicates the action of submitting predictions.

6. `counter += 1`:
   - After processing each test data batch, this line increments the counter to keep track of the number of iterations through the test data.

This code block is used to iterate through the test data, generate features, make predictions using trained models, submit the predictions to the competition environment, and keep track of the iteration count.

In [10]:
# Initialize a counter
counter = 0  # 🔢 Initialize a counter

# Iterate through the test data provided by the Optiver 2023 environment
for (test, revealed_targets, sample_prediction) in iter_test:  # 🔄 Iterate through the test data
    
    # Generate features for the test data
    feat = generate_features(test)  # 🏋️‍♂️ Generate features for the test data
    
    # Make predictions using the trained models and compute the mean
    sample_prediction['target'] = np.mean([model.predict(feat) for model in models], 0)
    
    # Submit the predictions to the Optiver 2023 environment
    env.predict(sample_prediction)  # 📤 Submit predictions to the environment
    
    # Increment the counter
    counter += 1  # 🔢 Increment the counter


This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.


## Explore More! 👀
Thank you for exploring this notebook! If you found this notebook insightful or if it helped you in any way, I invite you to explore more of my work on my profile.

👉 [Visit my Profile](https://www.kaggle.com/zulqarnainali) 👈

## Feedback and Gratitude 🙏
We value your feedback! Your insights and suggestions are essential for our continuous improvement. If you have any comments, questions, or ideas to share, please don't hesitate to reach out.

📬 Contact me via email: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I would like to express our heartfelt gratitude for your time and engagement. Your support motivates us to create more valuable content.

Happy coding and best of luck in your data science endeavors! 🚀
