We used the N-BEATSModel for multi-step time series forecasting, specifically focusing on two key assets rather than the entire dataset to reduce computational load and complexity. The data was processed by splitting it into training and testing sets for high, low, close, and volume time series. We extended the covariates for future time steps, enabling the model to capture long-term trends. The N-BEATS model was selected due to its strength in handling non-seasonal data and its flexibility in learning patterns across different features, making it ideal for this task.

In [4]:
pip install darts

Collecting darts
  Downloading darts-0.31.0-py3-none-any.whl.metadata (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting nfoursid>=1.0.0 (from darts)
  Downloading nfoursid-1.0.1-py3-none-any.whl.metadata (1.9 kB)
Collecting pmdarima>=1.8.0 (from darts)
  Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Collecting pyod>=0.9.5 (from darts)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting statsforecast>=1.4 (from darts)
  Downloading statsforecast-1.7.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (28 kB)
Coll

In [5]:
pip install dask[dataframe]

Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Downloading dask_expr-1.1.16-py3-none-any.whl.metadata (2.5 kB)
INFO: pip is looking at multiple versions of dask-expr to determine which version is compatible with other requirements. This could take a while.
  Downloading dask_expr-1.1.15-py3-none-any.whl.metadata (2.5 kB)
  Downloading dask_expr-1.1.14-py3-none-any.whl.metadata (2.5 kB)
  Downloading dask_expr-1.1.13-py3-none-any.whl.metadata (2.5 kB)
Downloading dask_expr-1.1.13-py3-none-any.whl (242 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.6/242.6 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dask-expr
Successfully installed dask-expr-1.1.13


In [6]:
import numpy as np
import pandas as pd
from darts import TimeSeries
from darts.models import NBEATSModel
from darts.dataprocessing.transformers import Scaler
from sklearn.preprocessing import LabelEncoder
import torch
from scipy.stats import norm
from darts import concatenate
# FP16 Precision Setup
torch.set_default_dtype(torch.float16)

In [7]:
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive/')
file_path_train = '/content/drive/MyDrive/TRAIN_Reco_2021_2022_2023.parquet.gzip'
file_path_test = '/content/drive/MyDrive/TEST_Reco_2024.parquet.gzip'
df = pd.read_parquet(file_path_train, engine='pyarrow')
ds = pd.read_parquet(file_path_test, engine='pyarrow')

hlcv_columns = ['high', 'low', 'close', 'volume']
df[hlcv_columns] = df[hlcv_columns].astype('float16')

Mounted at /content/drive/


In [8]:
# Display the first few lines of the DataFrame
print(df.head())


                                ID  high  low  close  volume
ExecutionTime                                               
2021-01-06 21:45:00+01:00  Fri00Q1   0.0  0.0    0.0     0.0
2021-01-06 22:00:00+01:00  Fri00Q1   0.0  0.0    0.0     0.0
2021-01-06 22:15:00+01:00  Fri00Q1   0.0  0.0    0.0     0.0
2021-01-06 22:30:00+01:00  Fri00Q1   0.0  0.0    0.0     0.0
2021-01-06 22:45:00+01:00  Fri00Q1   0.0  0.0    0.0     0.0


In [9]:
df = df[df['ID'].isin(['Fri01Q1', 'Mon01Q1'])]
ds = ds[ds['ID'].isin(['Fri01Q1', 'Mon01Q1'])]

df.reset_index(inplace=True)
df['ExecutionTime'] = df['ExecutionTime'].dt.tz_convert('UTC')  # Convert to UTC
df['ExecutionTime'] = df['ExecutionTime'].dt.tz_localize(None)
df = df.sort_values(by=['ID', 'ExecutionTime'])

ds.reset_index(inplace=True)
ds['ExecutionTime'] = ds['ExecutionTime'].dt.tz_convert('UTC')  # Convert to UTC
ds['ExecutionTime'] = ds['ExecutionTime'].dt.tz_localize(None)
ds = ds.sort_values(by=['ID', 'ExecutionTime'])

le = LabelEncoder()
df['ID'] = le.fit_transform(df['ID'])
ds['ID'] = le.transform(ds['ID'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds['ExecutionTime'] = ds['ExecutionTime'].dt.tz_convert('UTC')  # Convert to UTC
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ds['ExecutionTime'] = ds['ExecutionTime'].dt.tz_localize(None)


In [10]:
# Transform non-zero and zero values for a column
def transform_column(column, use_train_params=False, train_non_zeros=None, train_non_zero_ranks=None):
    non_zero_values = column[column != 0]
    zero_values = column[column == 0]

    # If using training parameters, apply those for non-zero transformation
    if use_train_params:
        if len(train_non_zeros) > 0:
            test_ranks = (non_zero_values.rank() - 0.5) / len(non_zero_values)
            non_zero_transformed = pd.Series(norm.ppf(test_ranks), index=non_zero_values.index)
        else:
            non_zero_transformed = pd.Series([], index=non_zero_values.index)
    else:
        if len(non_zero_values) > 0:
            non_zero_ranks = (non_zero_values.rank() - 0.5) / len(non_zero_values)
            non_zero_transformed = pd.Series(norm.ppf(non_zero_ranks), index=non_zero_values.index)
        else:
            non_zero_transformed = pd.Series([], index=non_zero_values.index)

    zero_transformed = pd.Series(np.random.uniform(low=0.0, high=1.0, size=len(zero_values)), index=zero_values.index)
    final_transformed_column = pd.concat([non_zero_transformed, zero_transformed]).sort_index()

    return final_transformed_column, non_zero_values, non_zero_ranks if not use_train_params else train_non_zero_ranks

# Apply transformation to the training data
def process_training_data(dataframe):
    transformed_columns = {}
    training_parameters = {}

    for feature in ['high', 'low', 'close', 'volume']:
        transformed_column, train_non_zeros, train_non_zero_ranks = transform_column(dataframe[feature])
        transformed_columns[feature] = transformed_column
        training_parameters[feature] = (train_non_zeros, train_non_zero_ranks)

    return pd.DataFrame(transformed_columns), training_parameters

# Apply transformation to the test data using training parameters
def process_test_data(test_dataframe, training_parameters):
    transformed_test_columns = {}

    for feature in ['high', 'low', 'close', 'volume']:
        train_non_zeros, train_non_zero_ranks = training_parameters[feature]
        transformed_column, _, _ = transform_column(test_dataframe[feature], use_train_params=True, train_non_zeros=train_non_zeros, train_non_zero_ranks=train_non_zero_ranks)
        transformed_test_columns[feature] = transformed_column

    return pd.DataFrame(transformed_test_columns)

# Process training and test datasets
train_transformed, transformation_params = process_training_data(df)
df[hlcv_columns] = train_transformed[hlcv_columns]
df.set_index('ExecutionTime', inplace=True)

test_transformed = process_test_data(ds, transformation_params)
ds[hlcv_columns] = test_transformed[hlcv_columns]
ds.set_index('ExecutionTime', inplace=True)

df_train = df
df_test = ds


In [11]:
# Function to generate TimeSeries for each contract and contract ID
def generate_time_series_by_contract(dataframe):
    series_by_contract = {}
    id_series_by_contract = {}

    # Iterate through each unique contract ID
    for unique_id in dataframe['ID'].unique():
        contract_specific_data = dataframe[dataframe['ID'] == unique_id]

        # Create TimeSeries objects for 'high', 'low', 'close', and 'volume'
        high_series = TimeSeries.from_dataframe(contract_specific_data, value_cols='high')
        low_series = TimeSeries.from_dataframe(contract_specific_data, value_cols='low')
        close_series = TimeSeries.from_dataframe(contract_specific_data, value_cols='close')
        volume_series = TimeSeries.from_dataframe(contract_specific_data, value_cols='volume')

        # Create a TimeSeries for the contract ID (used as a covariate)
        id_series = TimeSeries.from_dataframe(contract_specific_data, value_cols='ID')

        series_by_contract[unique_id] = [high_series, low_series, close_series, volume_series]
        id_series_by_contract[unique_id] = id_series

    return series_by_contract, id_series_by_contract

# Generate the time series for contracts and IDs
series_by_contract, id_series_by_contract = generate_time_series_by_contract(df)


In [12]:
# Initialize dictionaries to hold train and test data for each contract
train_series_by_contract = {}
test_series_by_contract = {}
train_id_series_by_contract = {}
test_id_series_by_contract = {}

# Iterate over each contract ID and its corresponding series
for unique_contract_id, series_collection in series_by_contract.items():
    # Filter the training and testing data based on contract ID
    train_data_for_contract = df_train[df_train['ID'] == unique_contract_id]
    test_data_for_contract = df_test[df_test['ID'] == unique_contract_id]

    # Create training series for HLCV using df_train
    high_train_series = TimeSeries.from_dataframe(train_data_for_contract, value_cols='high')
    low_train_series = TimeSeries.from_dataframe(train_data_for_contract, value_cols='low')
    close_train_series = TimeSeries.from_dataframe(train_data_for_contract, value_cols='close')
    volume_train_series = TimeSeries.from_dataframe(train_data_for_contract, value_cols='volume')

    train_series_by_contract[unique_contract_id] = [high_train_series, low_train_series, close_train_series, volume_train_series]

    # Create test series for HLCV using df_test
    high_test_series = TimeSeries.from_dataframe(test_data_for_contract, value_cols='high')
    low_test_series = TimeSeries.from_dataframe(test_data_for_contract, value_cols='low')
    close_test_series = TimeSeries.from_dataframe(test_data_for_contract, value_cols='close')
    volume_test_series = TimeSeries.from_dataframe(test_data_for_contract, value_cols='volume')

    test_series_by_contract[unique_contract_id] = [high_test_series, low_test_series, close_test_series, volume_test_series]

    # Create ID series for training and testing
    train_id_series = TimeSeries.from_dataframe(train_data_for_contract, value_cols='ID')
    test_id_series = TimeSeries.from_dataframe(test_data_for_contract, value_cols='ID')

    train_id_series_by_contract[unique_contract_id] = train_id_series
    test_id_series_by_contract[unique_contract_id] = test_id_series


In [13]:
# Function to extend future covariates for a contract
def extend_contract_covariates(id_series_train, steps, time_freq='15min'):
    """
    Extends the future covariates by appending `steps` into the future.
    Ensures new timestamps align with the existing series.
    """
    # Get the last timestamp from the existing series
    last_timestamp = id_series_train.end_time()

    # Generate future timestamps starting from the next time step
    future_timestamps = pd.date_range(start=last_timestamp + pd.Timedelta(time_freq), periods=steps, freq=time_freq)

    # Repeat the last contract ID for the future covariate values
    repeated_covariate_values = np.repeat(id_series_train.values()[-1], steps)

    # Create a TimeSeries for the future covariates with generated timestamps
    future_covariate_series = TimeSeries.from_times_and_values(future_timestamps, repeated_covariate_values,
                                                               freq=id_series_train.freq_str)

    # Append the future covariate series to the original one
    complete_covariate_series = id_series_train.append(future_covariate_series)

    return complete_covariate_series


In [21]:
# Optimized N-BEATS model for multi-step time series forecasting
nbeats_optimized = NBEATSModel(
    input_chunk_length=10,  # Input sequence length for the model
    output_chunk_length=2400,  # Prediction length in each forward pass
    generic_architecture=True,  # Use the general architecture of N-BEATS
    num_blocks=2,  # Number of building blocks in the model
    num_layers=2,  # Layers in each block
    layer_widths=128,  # Layer width inside blocks
    n_epochs=1,  # Train for only 1 epoch
    batch_size=16,  # Number of samples processed at once
    dropout=0.1,  # Dropout rate for regularization
    random_state=45,  # Set seed for reproducibility
    optimizer_kwargs={'lr': 1e-2},  # Optimizer learning rate
    force_reset=True  # Force reset to clear previous model states
)


In [22]:
def calculate_smape(true_values, predicted_values):
    return 100 * np.mean(2 * np.abs(predicted_values - true_values) / (np.abs(predicted_values) + np.abs(true_values)))

steps = 2400

# Training and prediction for optimized N-BEATS model
for contract_id in series_by_contract.keys():
    high_train_series, low_train_series, close_train_series, volume_train_series = train_series_by_contract[contract_id]
    contract_id_train_series = train_id_series_by_contract[contract_id]  # Contract ID covariate

    # Extend the covariates for future steps
    extended_contract_id_series = extend_contract_covariates(contract_id_train_series, steps=steps)

    # Fit the optimized N-BEATS model with the input series
    nbeats_optimized.fit([high_train_series, low_train_series, close_train_series, volume_train_series])

    # Predict using the N-BEATS model for the extended period
    forecast_high = nbeats_optimized.predict(n=steps, series=high_train_series)
    forecast_low = nbeats_optimized.predict(n=steps, series=low_train_series)
    forecast_close = nbeats_optimized.predict(n=steps, series=close_train_series)
    forecast_volume = nbeats_optimized.predict(n=steps, series=volume_train_series)

    # Extract individual test series from the dictionary
    test_high_series, test_low_series, test_close_series, test_volume_series = test_series_by_contract[contract_id]

    # Evaluate SMAPE for each series separately
    for predicted_series, actual_series, feature_name in zip(
        [forecast_high, forecast_low, forecast_close, forecast_volume],
        [test_high_series, test_low_series, test_close_series, test_volume_series],
        ['high', 'low', 'close', 'volume']):

        future_test_series = actual_series.slice_intersect(predicted_series)

        # Calculate SMAPE error
        smape_error = calculate_smape(future_test_series.values(), predicted_series.slice_intersect(future_test_series).values())
        print(f"sMAPE for {feature_name} in contract {contract_id}: {smape_error:.2f}%")


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type             | Params | Mode 
-------------------------------------------------------------
0 | criterion       | MSELoss          | 0      | train
1 | train_criterion | MSELoss          | 0      | train
2 | val_criterion   | MSELoss          | 0      | train
3 | train_metrics   | MetricCollection | 0      | train
4 | val_metrics     | MetricCollection | 0      | train
5 | stacks          | ModuleList       | 2.0 M  | train
-------------------------------------------------------------
2.0 M     Trainable params
705       Non-trainable params
2.0 M     Total params
8.081     Total estimated m

Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type             | Params | Mode 
-------------------------------------------------------------
0 | criterion       | MSELoss          | 0      | train
1 | train_criterion | MSELoss          | 0      | train
2 | val_criterion   | MSELoss          | 0      | train
3 | train_metrics   | MetricCollection | 0      | train
4 | val_metrics     | MetricCollection | 0      | train
5 | stacks          | ModuleList       | 2.0 M  | train
-------------------------------------------------------------
2.0 M     Trainable params
705       Non-trainable params
2.0 M     Total params
8.081     Total estimated m

sMAPE for high in contract 0: 200.00%
sMAPE for low in contract 0: 200.00%
sMAPE for close in contract 0: 200.00%
sMAPE for volume in contract 0: 200.00%


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1` reached.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting: |          | 0/? [00:00<?, ?it/s]

sMAPE for high in contract 1: 199.12%
sMAPE for low in contract 1: 199.35%
sMAPE for close in contract 1: 199.15%
sMAPE for volume in contract 1: 198.66%


Score Explanation:
The SMAPE scores are all around 199-200%, which indicates poor predictive performance. This high error likely stems from inadequate model training (only 1 epoch), leading to predictions that are far off from the true values. 
But we can see that we are producing almost the same result with one epoch as compared to 10 epochs in LSMT_with_active model. 
Most liekly, with more computational power and more epochs, we can get much better result. 