In LSTM(with nulll) model, we observed a very high sMAPE value indicating that the model was not performing well. 

In this approach, we preprocess the data for each asset individually by extracting the 'high,' 'low,' 'close,' and 'volume' columns and scaling them using MinMaxScaler. We then create sliding windows over the scaled data for each asset, where n_timesteps of past data are used as inputs (X), and the following 10 timesteps are predicted as outputs (y). This ensures asset-specific feature scaling and consistent input-output preparation for forecasting.

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_percentage_error
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

2024-10-21 15:03:02.762783: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Load the training data
train_df = pd.read_csv('TRAIN_Reco_2021_2022_2023.csv')

# Display the first few rows
train_df.head()

Unnamed: 0,ExecutionTime,ID,high,low,close,volume
0,2021-01-06 21:45:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
1,2021-01-06 22:00:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
2,2021-01-06 22:15:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
3,2021-01-06 22:30:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0
4,2021-01-06 22:45:00+01:00,Fri00Q1,0.0,0.0,0.0,0.0


In [4]:
# Get the number of unique items in the 'ID' column
unique_items_count = train_df['ID'].nunique()

unique_items_count


672

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Define the forecasting horizon and look-back window
n_timesteps = 10  # Look-back window
forecast_horizon = 10  # Number of steps to predict

# Initialize the scaler
scaler = MinMaxScaler()

# Function to perform systematic sampling
def systematic_sampling(data, reduction_ratio=0.2):
    """
    Perform systematic sampling to reduce the number of samples.
    
    Args:
    data: The input data (NumPy array or DataFrame values).
    reduction_ratio: The fraction of data to retain (e.g., 0.2 for 20% of the original data).
    
    Returns:
    data_sampled: Systematically sampled input data.
    """
    k = int(1 / reduction_ratio)
    indices = np.arange(0, len(data), k)  # Systematically select every k-th sample
    return data[indices]

# Apply systematic sampling to the entire dataset before further processing
train_df_sampled = systematic_sampling(train_df.values, reduction_ratio=0.2)  # Use `.values` to convert DataFrame to NumPy array

# Check the size of the dataset after sampling
print(f"Original dataset size: {len(train_df)} rows")
print(f"Sampled dataset size: {len(train_df_sampled)} rows")
print(f"Percentage of data retained after sampling: {(len(train_df_sampled) / len(train_df)) * 100:.2f}%")


Original dataset size: 69805344 rows
Sampled dataset size: 13961069 rows
Percentage of data retained after sampling: 20.00%


In [11]:
# Convert train_df_sampled back to DataFrame if needed for further processing
train_df_sampled = pd.DataFrame(train_df_sampled, columns=train_df.columns)

# Now you can proceed with the loop code using the sampled data
asset_ids = train_df_sampled['ID'].unique()  # Unique asset IDs from sampled data

# Initialize empty lists to store inputs and outputs
X, y = [], []

# Loop through each asset and scale the features, then prepare sliding windows
for asset in asset_ids:
    # Extract the data for this asset
    asset_data = train_df_sampled[train_df_sampled['ID'] == asset][['high', 'low', 'close', 'volume']].values
    
    # Scale the data using MinMaxScaler
    asset_data_scaled = scaler.fit_transform(asset_data)  # Scaling for each asset separately

    # Create sliding windows for the asset
    for i in range(len(asset_data_scaled) - n_timesteps - forecast_horizon):
        X.append(asset_data_scaled[i:i+n_timesteps])  # Past `n_timesteps` for input
        y.append(asset_data_scaled[i+n_timesteps:i+n_timesteps+forecast_horizon])  # Next 10 timesteps for output

# Convert to numpy arrays
X_sampled = np.array(X)
y_sampled = np.array(y)

# Check the shapes of the processed data
print(f"Shape of X: {X_sampled.shape}")
print(f"Shape of y: {y_sampled.shape}")

Shape of X: (13947629, 10, 4)
Shape of y: (13947629, 10, 4)


In [16]:
import numpy as np

# Define the reduction ratio (keeping 60% of the data)
reduction_ratio = 0.6

# Function to perform systematic sampling
def systematic_sampling(data, reduction_ratio=0.6):
    """
    Perform systematic sampling to reduce the number of samples.
    
    Args:
    data: The input data (e.g., X_sampled or y_sampled).
    reduction_ratio: The fraction of data to retain (e.g., 0.6 for 60% of the original data).
    
    Returns:
    data_sampled: Systematically sampled input data.
    """
    num_samples = int(len(data) * reduction_ratio)  # Calculate the exact number of samples to keep
    indices = np.random.choice(len(data), size=num_samples, replace=False)  # Randomly select indices
    return data[indices]

# Apply systematic sampling to reduce the dataset by 40% (keeping 60%)
X_sampled_new = systematic_sampling(X_sampled, reduction_ratio=0.6)
y_sampled_new = systematic_sampling(y_sampled, reduction_ratio=0.6)

# Check the shapes of the reduced dataset
print(f"Reduced Shape of X_sampled: {X_sampled_new.shape}")
print(f"Reduced Shape of y_sampled: {y_sampled_new.shape}")


Reduced Shape of X_sampled: (8368577, 10, 4)
Reduced Shape of y_sampled: (8368577, 10, 4)


**LSTM - without active state as a feature**

Step 1: Data Preparation

1.1 Prepare Data for Scenario 1

Step 2: Creating Sliding Windows for All Assets

Now, we will reshape the dataset to handle multiple assets together while preparing for LSTM input.

Sliding Window: The sliding window approach means that instead of processing the entire dataset in one go, the model is trained on sequential windows of data. This reduces the load on memory and allows you to efficiently train on a smaller portion of the dataset.

In [18]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, TimeDistributed, Input

# Define the LSTM model
model = Sequential()

# Add an explicit Input layer
model.add(Input(shape=(n_timesteps, X_sampled_new.shape[2])))  # Input shape: (n_timesteps, number of features)

# First LSTM layer with 128 units and return sequences
model.add(LSTM(128, return_sequences=True))  # Returning the full sequence to the next LSTM layer
model.add(Dropout(0.2))  # Dropout to prevent overfitting

# Second LSTM layer with 64 units and return sequences (to match the time steps for forecasting)
model.add(LSTM(64, return_sequences=True))  # Return full sequence (for the next 10 timesteps)

# TimeDistributed Dense layer to output predictions for all features (HLCV) for all assets
model.add(TimeDistributed(Dense(y_sampled_new.shape[2])))  # Predict the next 10 timesteps, with 4 features (HLCV)

# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train the model
history = model.fit(X_sampled_new, y_sampled_new, epochs=10, batch_size=32,validation_split=0.2)

# Print model summary to see the architecture
model.summary()


Epoch 1/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2027s[0m 10ms/step - loss: 0.0149 - mae: 0.0714 - val_loss: 0.0148 - val_mae: 0.0720
Epoch 2/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2020s[0m 10ms/step - loss: 0.0149 - mae: 0.0713 - val_loss: 0.0149 - val_mae: 0.0697
Epoch 3/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1898s[0m 9ms/step - loss: 0.0148 - mae: 0.0713 - val_loss: 0.0149 - val_mae: 0.0692
Epoch 4/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1981s[0m 9ms/step - loss: 0.0148 - mae: 0.0713 - val_loss: 0.0148 - val_mae: 0.0707
Epoch 5/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1978s[0m 9ms/step - loss: 0.0148 - mae: 0.0713 - val_loss: 0.0148 - val_mae: 0.0717
Epoch 6/10
[1m209215/209215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1941s[0m 9ms/step - loss: 0.0148 - mae: 0.0713 - val_loss: 0.0148 - val_mae: 0.0724
Epoch 7/10
[1m209215/209215[0m

In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load the test dataset
test_df = pd.read_csv('TEST_Reco_2024.csv')

# Initialize the same scaler that was used for training
scaler = MinMaxScaler()

# Initialize empty lists for test data inputs and outputs
X_test, y_test_true = [], []

# Define the forecasting horizon and look-back window (same as training)
n_timesteps = 10  # Look-back window
forecast_horizon = 10  # Number of steps to predict

# Get the list of unique asset IDs from the test data
asset_ids_test = test_df['ID'].unique()


In [20]:

# Loop through each asset and scale the features, then prepare sliding windows
for asset in asset_ids_test:
    # Extract the data for this asset
    asset_data = test_df[test_df['ID'] == asset][['high', 'low', 'close', 'volume']].values
    
    # Scale the data using the same MinMaxScaler used for training
    asset_data_scaled = scaler.fit_transform(asset_data)

    # Create sliding windows for the asset (same process as training)
    for i in range(len(asset_data_scaled) - n_timesteps - forecast_horizon):
        X_test.append(asset_data_scaled[i:i+n_timesteps])  # Past 10 timesteps for input
        y_test_true.append(asset_data_scaled[i+n_timesteps:i+n_timesteps+forecast_horizon])  # Next 10 timesteps for output


In [21]:
# Convert to numpy arrays
X_test = np.array(X_test)
y_test_true = np.array(y_test_true)

# Make predictions using the trained LSTM model
y_test_pred = model.predict(X_test)


[1m186463/457338[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m15:13[0m 3ms/step

IOStream.flush timed out


[1m457338/457338[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1648s[0m 4ms/step


In [22]:

# Reshape y_test_pred to match the shape of y_test_true
# Note: y_test_pred may need reshaping because it's often output in a flat format
y_test_pred = y_test_pred.reshape(y_test_true.shape[0], forecast_horizon, 4)  # Reshape to (samples, 10 timesteps, 4 features)

# Check shapes
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test_true: {y_test_true.shape}")
print(f"Shape of y_test_pred: {y_test_pred.shape}")

# Ensure the shapes are identical for comparison
if y_test_true.shape == y_test_pred.shape:
    print("Shapes match, ready for evaluation!")
else:
    print(f"Shapes do not match. y_test_true: {y_test_true.shape}, y_test_pred: {y_test_pred.shape}")


Shape of X_test: (14634816, 10, 4)
Shape of y_test_true: (14634816, 10, 4)
Shape of y_test_pred: (14634816, 10, 4)
Shapes match, ready for evaluation!


In [23]:
# Function to calculate Symmetric Mean Absolute Percentage Error (sMAPE)
def smape(y_true, y_pred, epsilon=1e-10):
    numerator = np.abs(y_true - y_pred)
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2
    smape_val = np.mean(2 * numerator / (denominator + epsilon)) * 100  # Adding epsilon to avoid division by zero
    return smape_val

# Calculate sMAPE between true test values and predicted test values
smape_score = smape(y_test_true, y_test_pred)

# Print the sMAPE score
print(f"sMAPE score on the test set: {smape_score}%")

sMAPE score on the test set: 281.95349117883126%


We still have a high sMAPE score but better than before. it indicates that the different approach to pre process the data was a good idea. 
The high score is likely due to the small scale of sub sample we took to train the data - 12%
Even with such a small sample it took around 400 minutes to run the code on CPU. 
It shows the power of computational power and how because of it we are able to revolutionse the technology with AI. 
In the next model, we will remove the null values which is approximately 80% of the data and see if we can have a better score with the training on all the assets and a much larger and better distribution of the data 