Welcome to our comprehensive notebook, where we delve into the powerful world of ensemble learning using LightGBM (Light Gradient Boosting Machine). LightGBM, known for its efficiency and speed, is a gradient boosting framework that has gained popularity in the machine learning community, especially for large datasets.

In this notebook, we have designed an ensemble approach with LightGBM models. Ensemble learning is a technique that combines the predictions from multiple machine learning algorithms to make more accurate predictions than any individual model. This approach is particularly effective in improving the performance of models on complex datasets, as it capitalizes on the strengths of each individual model while mitigating their weaknesses.

Our focus here is to demonstrate how an ensemble of LightGBM models can be effectively employed to achieve superior predictive performance. This is especially beneficial for scenarios where using GPU resources is not feasible or preferred. We take advantage of LightGBM's ability to handle large datasets with ease, even on CPU, making it an ideal choice for environments with hardware constraints.

Of course you can run on GPU if you wish. 

I'm also sharing the dataset with you if you want to ensemble with your own models: https://www.kaggle.com/datasets/verracodeguacas/ensemble-of-models

In [1]:
import gc  # Garbage collection for memory management
import os  # Operating system-related functions
import time  # Time-related functions
import warnings  # Handling warnings
from itertools import combinations  # For creating combinations of elements
from warnings import simplefilter  # Simplifying warning handling

# üì¶ Importing machine learning libraries
import joblib  # For saving and loading models
import numpy as np  # Numerical operations
import pandas as pd  # Data manipulation and analysis
from sklearn.metrics import mean_absolute_error  # Metric for evaluation
from sklearn.model_selection import KFold, TimeSeriesSplit  # Cross-validation techniques

# ü§ê Disable warnings to keep the code clean
warnings.filterwarnings("ignore")
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

# üìä Define flags and variables
is_offline = False  # Flag for online/offline mode
is_train = True  # Flag for training mode
is_infer = True  # Flag for inference mode
max_lookback = np.nan  # Maximum lookback (not specified)
split_day = 435  # Split day for time series data


In [2]:
# üìÇ Read the dataset from a CSV file using Pandas
df = pd.read_csv("/kaggle/input/optiver-trading-at-the-close/train.csv")

# üßπ Remove rows with missing values in the "target" column
df = df.dropna(subset=["target"])

# üîÅ Reset the index of the DataFrame and apply the changes in place
df.reset_index(drop=True, inplace=True)

# üìè Get the shape of the DataFrame (number of rows and columns)
df_shape = df.shape


In [3]:
# üèéÔ∏è Import Numba for just-in-time (JIT) compilation and parallel processing
from numba import njit, prange

# üìä Function to compute triplet imbalance in parallel using Numba
@njit(parallel=True)
def compute_triplet_imbalance(df_values, comb_indices):
    num_rows = df_values.shape[0]
    num_combinations = len(comb_indices)
    imbalance_features = np.empty((num_rows, num_combinations))

    # üîÅ Loop through all combinations of triplets
    for i in prange(num_combinations):
        a, b, c = comb_indices[i]
        
        # üîÅ Loop through rows of the DataFrame
        for j in range(num_rows):
            max_val = max(df_values[j, a], df_values[j, b], df_values[j, c])
            min_val = min(df_values[j, a], df_values[j, b], df_values[j, c])
            mid_val = df_values[j, a] + df_values[j, b] + df_values[j, c] - min_val - max_val
            
            # üö´ Prevent division by zero
            if mid_val == min_val:
                imbalance_features[j, i] = np.nan
            else:
                imbalance_features[j, i] = (max_val - mid_val) / (mid_val - min_val)

    return imbalance_features

# üìà Function to calculate triplet imbalance for given price data and a DataFrame
def calculate_triplet_imbalance_numba(price, df):
    # Convert DataFrame to numpy array for Numba compatibility
    df_values = df[price].values
    comb_indices = [(price.index(a), price.index(b), price.index(c)) for a, b, c in combinations(price, 3)]

    # Calculate the triplet imbalance using the Numba-optimized function
    features_array = compute_triplet_imbalance(df_values, comb_indices)

    # Create a DataFrame from the results
    columns = [f"{a}_{b}_{c}_imb2" for a, b, c in combinations(price, 3)]
    features = pd.DataFrame(features_array, columns=columns)

    return features

@njit(fastmath=True)
def rolling_average(arr, window):
    """
    Calculate the rolling average for a 1D numpy array.
    
    Parameters:
    arr (numpy.ndarray): Input array to calculate the rolling average.
    window (int): The number of elements to consider for the moving average.
    
    Returns:
    numpy.ndarray: Array containing the rolling average values.
    """
    n = len(arr)
    result = np.empty(n)
    result[:window] = np.nan  # Padding with NaN for elements where the window is not full
    cumsum = np.cumsum(arr)

    for i in range(window, n):
        result[i] = (cumsum[i] - cumsum[i - window]) / window

    return result

@njit(parallel=True)
def compute_rolling_averages(df_values, window_sizes):
    """
    Calculate the rolling averages for multiple window sizes in parallel.
    
    Parameters:
    df_values (numpy.ndarray): 2D array of values to calculate the rolling averages.
    window_sizes (List[int]): List of window sizes for the rolling averages.
    
    Returns:
    numpy.ndarray: A 3D array containing the rolling averages for each window size.
    """
    num_rows, num_features = df_values.shape
    num_windows = len(window_sizes)
    rolling_features = np.empty((num_rows, num_features, num_windows))

    for feature_idx in prange(num_features):
        for window_idx, window in enumerate(window_sizes):
            rolling_features[:, feature_idx, window_idx] = rolling_average(df_values[:, feature_idx], window)

    return rolling_features

## üìä Feature Generation Functions üìä






**Explaination**



1. `imbalance_features(df)`:
   - This function takes a DataFrame `df` as input.
   - It calculates various features related to price and size data using Pandas' `eval` function, creating new columns in the DataFrame for each feature.
   - It then creates pairwise price imbalance features for combinations of price columns.
   - Next, it calculates triplet imbalance features using the Numba-optimized function `calculate_triplet_imbalance_numba`.
   - It calculates the rolling features.
   - Finally, it calculates additional features, including momentum, spread, intensity, pressure, market urgency, and depth pressure.
   - It also calculates statistical aggregation features (mean, standard deviation, skewness, kurtosis) for both price and size columns.
   - Shifted, return, and diff features are generated for specific columns.
   - Infinite values in the DataFrame are replaced with 0.

2. `other_features(df)`:
   - This function adds time-related and stock-related features to the DataFrame.
   - It calculates the day of the week, seconds, and minutes from the "date_id" and "seconds_in_bucket" columns.
   - It maps global features from a predefined dictionary to the DataFrame based on the "stock_id."

3. `generate_all_features(df)`:
   - This function combines the features generated by the `imbalance_features` and `other_features` functions.
   - It selects the relevant columns for feature generation, applies the `imbalance_features` function, adds time and stock-related features using the `other_features` function, and then performs garbage collection to free up memory.
   - The function returns a DataFrame containing the generated features, excluding certain columns like "row_id," "target," "time_id," and "date_id."


In [4]:
# Check if the code is running in offline or online mode
if is_offline:
    # In offline mode, split the data into training and validation sets based on the split_day
    df_train = df[df["date_id"] <= split_day]
    df_valid = df[df["date_id"] > split_day]
    
    # Display a message indicating offline mode and the shapes of the training and validation sets
    print("Offline mode")
    print(f"train : {df_train.shape}, valid : {df_valid.shape}")
else:
    # In online mode, use the entire dataset for training
    df_train = df
    
    # Display a message indicating online mode
    print("Online mode")


Online mode


In [5]:
if is_train:
    global_stock_id_feats = {
        "median_size": df_train.groupby("stock_id")["bid_size"].median() + df_train.groupby("stock_id")["ask_size"].median(),
        "std_size": df_train.groupby("stock_id")["bid_size"].std() + df_train.groupby("stock_id")["ask_size"].std(),
        "ptp_size": df_train.groupby("stock_id")["bid_size"].max() - df_train.groupby("stock_id")["bid_size"].min(),
        "median_price": df_train.groupby("stock_id")["bid_price"].median() + df_train.groupby("stock_id")["ask_price"].median(),
        "std_price": df_train.groupby("stock_id")["bid_price"].std() + df_train.groupby("stock_id")["ask_price"].std(),
        "ptp_price": df_train.groupby("stock_id")["bid_price"].max() - df_train.groupby("stock_id")["ask_price"].min(),
    }

# Bringing the LightGBM models to the mix

In [6]:
import os
import lightgbm as lgb

def imbalance_features_lgbm(df):
    # Define lists of price and size-related column names
    prices = ["reference_price", "far_price", "near_price", "ask_price", "bid_price", "wap"]
    sizes = ["matched_size", "bid_size", "ask_size", "imbalance_size"]
    df["volume"] = df.eval("ask_size + bid_size")
    df["mid_price"] = df.eval("(ask_price + bid_price) / 2")
    df["liquidity_imbalance"] = df.eval("(bid_size-ask_size)/(bid_size+ask_size)")
    df["matched_imbalance"] = df.eval("(imbalance_size-matched_size)/(matched_size+imbalance_size)")
    df["size_imbalance"] = df.eval("bid_size / ask_size")

    for c in combinations(prices, 2):
        df[f"{c[0]}_{c[1]}_imb"] = df.eval(f"({c[0]} - {c[1]})/({c[0]} + {c[1]})")

    for c in [['ask_price', 'bid_price', 'wap', 'reference_price'], sizes]:
        triplet_feature = calculate_triplet_imbalance_numba(c, df)
        df[triplet_feature.columns] = triplet_feature.values
   
    df["imbalance_momentum"] = df.groupby(['stock_id'])['imbalance_size'].diff(periods=1) / df['matched_size']
    df["price_spread"] = df["ask_price"] - df["bid_price"]
    df["spread_intensity"] = df.groupby(['stock_id'])['price_spread'].diff()
    df['price_pressure'] = df['imbalance_size'] * (df['ask_price'] - df['bid_price'])
    df['market_urgency'] = df['price_spread'] * df['liquidity_imbalance']
    df['depth_pressure'] = (df['ask_size'] - df['bid_size']) * (df['far_price'] - df['near_price'])
    
    # Calculate various statistical aggregation features
    for func in ["mean", "std", "skew", "kurt"]:
        df[f"all_prices_{func}"] = df[prices].agg(func, axis=1)
        df[f"all_sizes_{func}"] = df[sizes].agg(func, axis=1)
        

    for col in ['matched_size', 'imbalance_size', 'reference_price', 'imbalance_buy_sell_flag']:
        for window in [1, 2, 3, 10]:
            df[f"{col}_shift_{window}"] = df.groupby('stock_id')[col].shift(window)
            df[f"{col}_ret_{window}"] = df.groupby('stock_id')[col].pct_change(window)
    
    # Calculate diff features for specific columns
    for col in ['ask_price', 'bid_price', 'ask_size', 'bid_size', 'market_urgency', 'imbalance_momentum', 'size_imbalance']:
        for window in [1, 2, 3, 10]:
            df[f"{col}_diff_{window}"] = df.groupby("stock_id")[col].diff(window)
    return df.replace([np.inf, -np.inf], 0)

def other_features_lgbm(df):
    df["dow"] = df["date_id"] % 5  # Day of the week
    df["seconds"] = df["seconds_in_bucket"] % 60  
    df["minute"] = df["seconds_in_bucket"] // 60  
    for key, value in global_stock_id_feats.items():
        df[f"global_{key}"] = df["stock_id"].map(value.to_dict())

    return df

def generate_all_features_lgbm(df):
    # Select relevant columns for feature generation
    cols = [c for c in df.columns if c not in ["row_id", "time_id", "target"]]
    df = df[cols]
    # Generate imbalance features
    df = imbalance_features_lgbm(df)
    df = other_features_lgbm(df)
    gc.collect()  
    feature_name = [i for i in df.columns if i not in ["row_id", "target", "time_id", "date_id"]]
    return df[feature_name]

# model_save_path = '/kaggle/input/lightgbm-models/modelitos_para_despues'
# num_folds = 5  # The number of folds you used during training

# loaded_models = []

# # Load each model
# for i in range(1, num_folds + 1):
#     model_filename = os.path.join(model_save_path, f'doblez_{i}.txt')
#     if os.path.exists(model_filename):
#         loaded_model = lgb.Booster(model_file=model_filename)
#         loaded_models.append(loaded_model)
#         print(f"Model for fold {i} loaded from {model_filename}")
#     else:
#         print(f"Model file {model_filename} not found.")

# # Load the final model
# final_model_filename = os.path.join(model_save_path, 'doblez-conjunto.txt')
# if os.path.exists(final_model_filename):
#     final_model = lgb.Booster(model_file=final_model_filename)
#     loaded_models.append(final_model)
#     print(f"Final model loaded from {final_model_filename}")
# else:
#     print(f"Final model file {final_model_filename} not found.")

# Now 'loaded_models' contains the models loaded from the files


I think some of the models below are better than the others. You can choose what "folders" to use

In [7]:
import os
import lightgbm as lgb

def load_models_from_folder(model_save_path, num_folds=5):
    loaded_models = []

    # Load each fold model
    for i in range(1, num_folds + 1):
        model_filename = os.path.join(model_save_path, f'doblez_{i}.txt')
        if os.path.exists(model_filename):
            loaded_model = lgb.Booster(model_file=model_filename)
            loaded_models.append(loaded_model)
            print(f"Model for fold {i} loaded from {model_filename}")
        else:
            print(f"Model file {model_filename} not found.")

    # Load the final model
    final_model_filename = os.path.join(model_save_path, 'doblez-conjunto.txt')
    if os.path.exists(final_model_filename):
        final_model = lgb.Booster(model_file=final_model_filename)
        loaded_models.append(final_model)
        print(f"Final model loaded from {final_model_filename}")
    else:
        print(f"Final model file {final_model_filename} not found.")
    
    return loaded_models

# Assuming you have a list of folders from which to load the models
folders = [
    '/kaggle/input/lightgbm-models/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results (1)/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results (2)/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results (3)/modelitos_para_despues',
     '/kaggle/input/ensemble-of-models/results (4)/modelitos_para_despues',
     '/kaggle/input/ensemble-of-models/results (5)/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results (6)/modelitos_para_despues',
    '/kaggle/input/ensemble-of-models/results (7)/modelitos_para_despues',
]
num_folds = 5
all_loaded_models = []
for folder in folders:
    all_loaded_models.extend(load_models_from_folder(folder))


Model for fold 1 loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez_1.txt
Model for fold 2 loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez_2.txt
Model for fold 3 loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez_3.txt
Model for fold 4 loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez_4.txt
Model for fold 5 loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez_5.txt
Final model loaded from /kaggle/input/lightgbm-models/modelitos_para_despues/doblez-conjunto.txt
Model for fold 1 loaded from /kaggle/input/ensemble-of-models/results/modelitos_para_despues/doblez_1.txt
Model for fold 2 loaded from /kaggle/input/ensemble-of-models/results/modelitos_para_despues/doblez_2.txt
Model for fold 3 loaded from /kaggle/input/ensemble-of-models/results/modelitos_para_despues/doblez_3.txt
Model for fold 4 loaded from /kaggle/input/ensemble-of-models/results/modelitos_para_despues/doblez_4.txt
Mode

In [8]:
from sklearn.metrics import mean_absolute_error
import numpy as np
import pandas as pd
import time

def zero_sum(prices, volumes):
    std_error = np.sqrt(volumes)
    step = np.sum(prices) / np.sum(std_error)
    out = prices - std_error * step
    return out

# Inference
if is_infer:
    import optiver2023
    env = optiver2023.make_env()  # Setting up the environment for the competition
    iter_test = env.iter_test()   # Getting the iterator for the test set
    counter = 0                   # Initializing a counter
    y_min, y_max = -64, 64        # Setting prediction boundaries
    qps = []                      # Queries per second tracking
    cache = pd.DataFrame()        # Initializing a cache to store test data
    
    model_weights = [1/len(all_loaded_models)] * len(all_loaded_models)
    
    for (test_df, revealed_targets, sample_prediction_df) in iter_test:
        test_df = test_df.drop('currently_scored', axis=1)
        now_time = time.time()    # Current time for performance measurement
        print('counter:', counter)
        # Concatenating new test data with the cache, keeping only the last 21 observations per stock_id
        cache = pd.concat([cache, test_df], ignore_index=True, axis=0)
        if counter > 0:
            cache = cache.groupby('stock_id').tail(21).reset_index(drop=True)
        
        feat = generate_all_features_lgbm(cache)[-len(test_df):]
        pred = model_weights[0] * all_loaded_models[0].predict(feat)
        # Generate predictions for each model and calculate the weighted average
        for model, weight in zip(all_loaded_models[1:], model_weights[1:]):
            pred += weight * model.predict(feat)
        
        # Apply your zero-sum and clipping operations
        pred = zero_sum(pred, test_df['bid_size'] + test_df['ask_size'])
        clipped_predictions = np.clip(pred, y_min, y_max)
        
        # Set the predictions in the sample_prediction_df
        sample_prediction_df['target'] = clipped_predictions
        
        # Use the environment to make predictions
        env.predict(sample_prediction_df)
        
        counter += 1
        qps.append(time.time() - now_time)
        
        if counter % 10 == 0:
            print(f"{counter} queries per second: {np.mean(qps)}")

    time_cost = 1.146 * np.mean(qps)
    print(f"The code will take approximately {np.round(time_cost, 2)} hours to reason about")


This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
counter: 0
counter: 1
counter: 2
counter: 3
counter: 4
counter: 5
counter: 6
counter: 7
counter: 8
counter: 9
10 queries per second: 6.037146282196045
counter: 10
counter: 11
counter: 12
counter: 13
counter: 14
counter: 15
counter: 16
counter: 17
counter: 18
counter: 19
20 queries per second: 5.742031657695771
counter: 20
counter: 21
counter: 22
counter: 23
counter: 24
counter: 25
counter: 26
counter: 27
counter: 28
counter: 29
30 queries per second: 5.637731218338013
counter: 30
counter: 31
counter: 32
counter: 33
counter: 34
counter: 35
counter: 36
counter: 37
counter: 38
counter: 39
40 queries per second: 5.810316443443298
counter: 40
counter: 41
counter: 42
counter: 43
counter: 44
counter: 45
counter: 46
counter: 47
counter: 48
counter: 49
50 queries per second: 5.800113725662231
counter: 50
counter: 51
counter: 52
counter: 53
counter: 54
counter: 55
counter: