# Energy Market Code Prediction 
**Team Name : 404 Team Not Found**

**Members : Armand Hubler, Anthony-Lee Sanchez, Aswin Subramanian Maheswaran, Daniel Rosel and Marco Celati**

In this notebook, we develop a sequence-to-sequence model that transforms hourly energy consumption patterns into corresponding market codes. Our approach leverages an LSTM-based encoder-decoder architecture to handle variable-length sequences efficiently. This model can help in identifying key energy consumption trends and automatically generating market codes, paving the way for innovative applications in energy analytics and automated market insights.


> To AVOID lengthy re-training (all was done on a Nvidia RTX 4090) our best model is [here](https://drive.google.com/file/d/1EW5SHKrBPHstoQxKnck7VUUtZsfZ-Z46/view?usp=sharing)

## Setup and Dependencies

First, we install required packages and import necessary libraries:


In [33]:
!pip install matplotlib torchviz torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.utils.rnn as rnn_utils


## Environment Configuration

For local execution we are using the following file structure:
- AIHackathon
    - notebooks (THIS IS THE `WORKDIR`)
    - data (this is where we load our data from)

Configure working directory and handle Google Colab integration if needed:

In [35]:
USE_COLAB = False
WORKDIR = "/home/velocitatem/Documents/Projects/JupyterNotebooks/University/Third Year/AIHackathon/notebooks/"
if USE_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    WORKDIR = "/content/drive/MyDrive/NTT Hackathon/data/"

## Data Loading and Preprocessing

Load the datasets and perform initial preprocessing:
- Load labeled OMIE data
- Load filtered categories
- Load unit list
- Load blind dataset
- Add temporal features
- Encode categorical variables

In [36]:

df_omie_labelled = pd.read_csv(WORKDIR+'../data/df_omie_labelled.csv')
df_filtered_cat = pd.read_csv(WORKDIR+'../data/filtered_categories.csv')
df_unit_list = pd.read_csv(WORKDIR+'../data/unit_list.csv')
df_blind = pd.read_csv(WORKDIR+'../data/df_omie_blind.csv')

df_omie_labelled.columns = ['code','description', 'datetime', 'price', 'energy']
df_blind.columns = ['datetime', 'price', 'energy']
df_blind['datetime'] = pd.to_datetime(df_blind['datetime'])
df_blind['date'] = df_blind['datetime'].dt.date
df_blind['time'] = df_blind['datetime'].dt.time
df_blind['day_of_week'] = df_blind['datetime'].dt.dayofweek
df_blind['hour'] = df_blind['datetime'].dt.hour
df_blind['is_weekend'] = (df_blind['day_of_week'] > 4).astype(int)
df_blind['hour_sin'] = np.sin(2 * np.pi * df_blind['hour'] / 24)
df_blind['hour_cos'] = np.cos(2 * np.pi * df_blind['hour'] / 24)
df_blind['dow_sin'] = np.sin(2 * np.pi * df_blind['day_of_week'] / 7)
df_blind['dow_cos'] = np.cos(2 * np.pi * df_blind['day_of_week'] / 7)
df_blind['energy_percentile'] = df_blind.groupby('hour')['energy'].rank(pct=True)
df_blind['price_percentile'] = df_blind.groupby('hour')['price'].rank(pct=True)
df_blind['month'] = df_blind['datetime'].dt.month
df_blind['day_of_month'] = df_blind['datetime'].dt.day
# remove datetime, date
blind_datetime = df_blind['datetime']
df_blind = df_blind.drop(columns=['datetime','date','time'])
df_omie_labelled['datetime'] = pd.to_datetime(df_omie_labelled['datetime'])
df_omie_labelled['date'] = df_omie_labelled['datetime'].dt.date
df_omie_labelled['time'] = df_omie_labelled['datetime'].dt.time
df_omie_labelled['day_of_week'] = df_omie_labelled['datetime'].dt.dayofweek
df_omie_labelled['hour'] = df_omie_labelled['datetime'].dt.hour
df_omie_labelled['is_weekend'] = (df_omie_labelled['day_of_week'] > 4).astype(int)
df_omie_labelled['hour_sin'] = np.sin(2 * np.pi * df_omie_labelled['hour'] / 24)
df_omie_labelled['hour_cos'] = np.cos(2 * np.pi * df_omie_labelled['hour'] / 24)
df_omie_labelled['dow_sin'] = np.sin(2 * np.pi * df_omie_labelled['day_of_week'] / 7)
df_omie_labelled['dow_cos'] = np.cos(2 * np.pi * df_omie_labelled['day_of_week'] / 7)
# for all data within the same hour, what percentile is a given record in for (energy)
df_omie_labelled['energy_percentile'] = df_omie_labelled.groupby('hour')['energy'].rank(pct=True)
df_omie_labelled['price_percentile'] = df_omie_labelled.groupby('hour')['price'].rank(pct=True)
df_omie_labelled['month'] = df_omie_labelled['datetime'].dt.month
df_omie_labelled['day_of_month'] = df_omie_labelled['datetime'].dt.day
# extract numerical vale from code inot code_num
text_encode = False
from sklearn.preprocessing import LabelEncoder
codele = LabelEncoder()
# label encode code
df_omie_labelled['code'] = codele.fit_transform(df_omie_labelled['code'])
if text_encode:
    df_omie_labelled['code_num'] = df_omie_labelled['code'].str.extract('(\d+)').astype(int)
    # extract text value from code into code_text
    df_omie_labelled['code_text'] = df_omie_labelled['code'].str.extract('([a-zA-Z]+)')
    # label encode code
    df_omie_labelled['code_text'] = codele.fit_transform(df_omie_labelled['code_text'])
# labeled remove description
coltodrop = ['description','datetime','date','time']
if text_encode:
    coltodrop.append('code')
df_omie_labelled = df_omie_labelled.drop(columns=coltodrop)

## Market Class Definition (deprecated)

Define a Market class to handle energy market data and calculations:
- Track prices and energies
- Calculate averages and totals
- Store historical market data

In [37]:
class Market:
    def __init__(self):
        self.prices = []
        self.energies = []
        self.market_data = {}  # Store price and energy data with timestamps

    def add_record(self, timestamp, price, energy):
        """Adds a price and energy record to the market data."""
        self.prices.append(price)
        self.energies.append(energy)
        self.market_data[timestamp] = {'price': price, 'energy': energy}

    def calculate_average_price(self):
        """Calculates the average price of all records."""
        if not self.prices:
            return 0
        return sum(self.prices) / len(self.prices)

    def calculate_total_energy(self):
        """Calculates the total energy of all records."""
        return sum(self.energies)

    def get_market_summary(self):
        """Returns a summary of the market data."""
        avg_price = self.calculate_average_price()
        total_energy = self.calculate_total_energy()
        return {
            'average_price': avg_price,
            'total_energy': total_energy,
            'number_of_records': len(self.prices)
        }


In [38]:
def augment_data(df):
    """Augments the data by adding a new column with the average price of the market at that time."""
    market = Market()
    for index, row in df.iterrows():
        price = row['price']
        energy = row['energy']
        market.add_record(index, price, energy)
        market_summary = market.get_market_summary()
        df.at[index, 'average_price'] = market_summary['average_price']
        df.at[index, 'total_energy'] = market_summary['total_energy']
    return df

#augment_data(df_omie_labelled)

In [39]:
df_omie_labelled

Unnamed: 0,code,price,energy,day_of_week,hour,is_weekend,hour_sin,hour_cos,dow_sin,dow_cos,energy_percentile,price_percentile,month,day_of_month
0,0,0.00,0.0,3.0,0.0,0,0.0,1.000000,0.433884,-0.900969,0.227769,0.576258,2.0,29.0
1,1,0.00,0.0,3.0,0.0,0,0.0,1.000000,0.433884,-0.900969,0.227769,0.576258,2.0,29.0
2,2,0.00,46.2,3.0,0.0,0,0.0,1.000000,0.433884,-0.900969,0.889167,0.576258,2.0,29.0
3,3,0.00,0.0,3.0,0.0,0,0.0,1.000000,0.433884,-0.900969,0.227769,0.576258,2.0,29.0
4,4,-0.01,37.0,3.0,0.0,0,0.0,1.000000,0.433884,-0.900969,0.857781,0.244323,2.0,29.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712259,9323,-1.00,19.7,5.0,22.0,1,-0.5,0.866025,-0.974928,-0.222521,0.730616,0.133371,6.0,1.0
712260,9324,-1.00,5.5,5.0,22.0,1,-0.5,0.866025,-0.974928,-0.222521,0.554658,0.133371,6.0,1.0
712261,9325,-4.00,0.1,5.0,22.0,1,-0.5,0.866025,-0.974928,-0.222521,0.444660,0.098097,6.0,1.0
712262,9326,-10.00,32.1,5.0,22.0,1,-0.5,0.866025,-0.974928,-0.222521,0.835635,0.070496,6.0,1.0


In [40]:
df_blind.head()

Unnamed: 0,price,energy,day_of_week,hour,is_weekend,hour_sin,hour_cos,dow_sin,dow_cos,energy_percentile,price_percentile,month,day_of_month
0,0.0,0.0,5,23,1,-0.258819,0.965926,-0.974928,-0.222521,0.220046,0.586534,6,1
1,0.0,0.0,5,23,1,-0.258819,0.965926,-0.974928,-0.222521,0.220046,0.586534,6,1
2,0.0,46.2,5,23,1,-0.258819,0.965926,-0.974928,-0.222521,0.906264,0.586534,6,1
3,0.0,0.0,5,23,1,-0.258819,0.965926,-0.974928,-0.222521,0.220046,0.586534,6,1
4,-0.25,40.0,5,23,1,-0.258819,0.965926,-0.974928,-0.222521,0.887822,0.265386,6,1


In [41]:
# remove where price is 0 or energy is 0
def remove_zeros(df):
    return df[(df['price'] != 0) & (df['energy'] != 0)]
df_blind = remove_zeros(df_blind)
df_omie_labelled = remove_zeros(df_omie_labelled)

## Building Series
From our dataset we create chunks for each hour, this will give us two series within a reasonable range: (energy,codes) so for we can then match all the energy poitns to the codes.

In [42]:
def chunk_dataset(df):
    chunks = []
    # create chunks of hourly data, 0-1, 1-2, 2-3
    # first N records mgiht fall into the same hour then get all the records that fall into next hour
    # create multiple subsets of the data where each subset has records with the same hour but different days and months
    unique_hours = df['hour'].unique()
    unique_months = df['month'].unique()
    unique_days = df['day_of_month'].unique()
    for month in unique_months:
        for day in unique_days:
            for hour in unique_hours:
                chunk = df[(df['hour'] == hour) & (df['month'] == month) & (df['day_of_month'] == day)]
                if not chunk.empty:
                    chunks.append(chunk)
    return chunks
chunks = chunk_dataset(df_omie_labelled)
chunks[1]

Unnamed: 0,code,price,energy,day_of_week,hour,is_weekend,hour_sin,hour_cos,dow_sin,dow_cos,energy_percentile,price_percentile,month,day_of_month
320,4,-0.01,36.4,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.855786,0.243624,2.0,29.0
323,7,-0.01,0.9,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.491442,0.243624,2.0,29.0
324,8,-0.01,23.3,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.779886,0.243624,2.0,29.0
325,9,-0.01,23.0,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.777550,0.243624,2.0,29.0
326,10,0.20,20.4,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.756717,0.893208,2.0,29.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
624,9320,-2.50,54.6,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.920946,0.098950,2.0,29.0
625,9321,2.33,553.5,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.993590,0.929777,2.0,29.0
627,9323,0.19,8.6,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.625699,0.891571,2.0,29.0
628,9324,0.19,44.3,3.0,1.0,0,0.258819,0.965926,0.433884,-0.900969,0.883286,0.891571,2.0,29.0


In [43]:
dataset = []
for chunk in chunks:
    energy_array = chunk['energy'].values
    code_array = chunk['code'].values
    dataset.append({'energy_array': energy_array, 'code_array': code_array})
dataset = pd.DataFrame(dataset)
dataset

Unnamed: 0,energy_array,code_array
0,"[37.0, 1.7, 23.3, 22.3, 20.6, 30.6, 29.7, 43.7...","[4, 7, 8, 9, 10, 14, 16, 17, 18, 19, 22, 24, 2..."
1,"[36.4, 0.9, 23.3, 23.0, 20.4, 31.1, 29.5, 43.2...","[4, 7, 8, 9, 10, 14, 16, 17, 18, 19, 22, 24, 2..."
2,"[35.3, 0.9, 23.3, 23.8, 20.3, 30.9, 29.0, 43.0...","[4, 7, 8, 9, 10, 14, 16, 17, 18, 19, 22, 24, 2..."
3,"[32.7, 1.9, 23.3, 24.2, 20.1, 30.9, 28.0, 43.5...","[4, 7, 8, 9, 10, 14, 16, 17, 18, 19, 22, 24, 2..."
4,"[27.9, 3.6, 23.2, 24.5, 19.7, 31.3, 25.4, 44.0...","[4, 7, 8, 9, 10, 14, 16, 17, 18, 19, 22, 24, 2..."
...,...,...
2249,"[98.7, 18.4, 23.3, 40.0, 44.4, 41.5, 27.0, 3.3...","[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
2250,"[66.9, 12.0, 14.4, 40.0, 45.0, 39.7, 27.0, 5.0...","[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
2251,"[24.8, 4.0, 4.5, 40.0, 29.8, 37.4, 27.0, 7.8, ...","[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
2252,"[4.1, 0.1, 40.0, 37.3, 22.9, 27.0, 11.1, 20.3,...","[0, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1..."


In [44]:
# save dataset
dataset.to_csv(WORKDIR+'../data/dataset.csv', index=False)

In [45]:
dataset['energy_length'] = dataset['energy_array'].apply(lambda x: len(x))
dataset['code_length'] = dataset['code_array'].apply(lambda x: len(x))
dataset.describe()

Unnamed: 0,energy_length,code_length
count,2254.0,2254.0
mean,175.364685,175.364685
std,52.10266,52.10266
min,90.0,90.0
25%,123.0,123.0
50%,168.0,168.0
75%,228.0,228.0
max,265.0,265.0


# Model Architecture and Training

This model is a **sequence-to-sequence (Seq2Seq) model with an LSTM-based encoder-decoder architecture**, used for processing sequences of energy values (`energy_array`) and generating sequences of tokenized code (`code_array`). It includes **padding, teacher forcing, and evaluation using loss and accuracy metrics**.

## **1. Data Preparation and Chronological Splitting**
### **Why a chronological split?**
Since sequence-to-sequence tasks often involve temporal or ordered data, the dataset is **split chronologically** to ensure that the model does not see future information during training. The split is:
- **Train Set (70%)**: Used for learning.
- **Validation Set (10%)**: Used for hyperparameter tuning.
- **Test Set (20%)**: Used for final evaluation.

### **Handling Variable-Length Sequences**
- Each sequence has different lengths.
- The dataset is **padded** so that sequences within a batch have the same length.
- The **collate function** ensures that sequences are efficiently packed.

---

## **2. Defining the Dataset and DataLoader**
The dataset consists of two sequence arrays:
- **Source (`energy_array`)**: Sequence of energy values (continuous numbers).
- **Target (`code_array`)**: Sequence of token indices (discrete integers representing a vocabulary).

Each sample contains:
1. **`src_tensor`** – Float tensor of shape `(seq_len, 1)`, where `seq_len` is the sequence length.
2. **`trg_tensor`** – Long tensor of shape `(seq_len,)`, representing the tokenized target sequence.
3. **`src_length`** – The length of the input sequence (used for padding).

### **Collate Function for Efficient Batching**
- **Padding:** Ensures that all sequences within a batch have the same length.
- **`pack_padded_sequence` in PyTorch:** Used to efficiently handle variable-length sequences inside the LSTM.

---

## **3. Model Architecture**
The model follows the classic **Encoder-Decoder framework** for sequence generation, using **LSTMs (Long Short-Term Memory networks)**.

### **Encoder (Processes `energy_array`)**
- **Input**: `energy_array` (Float Tensor of shape `(batch_size, seq_len, 1)`)
- **Layers**:
  - **LSTM Layer**: Encodes the sequence into a hidden state and a cell state.
  - **`pack_padded_sequence`**: Used for efficiency in handling variable-length sequences.
- **Output**:
  - Final hidden state **(hidden, cell)** → Passed to the decoder.

#### **Why LSTMs?**
- They help capture long-term dependencies in sequential data, avoiding vanishing gradients.

---

### **Decoder (Generates `code_array`)**
- **Input**: A single token (integer index).
- **Layers**:
  - **Embedding Layer**: Converts token indices into dense vector representations.
  - **LSTM Layer**: Takes the previous hidden state and generates a new state.
  - **Fully Connected Layer**: Maps LSTM output to the vocabulary size.
- **Output**:
  - Predicted token probability distribution → Used to select the next token.

#### **Teacher Forcing**
- **During training**:
  - With **probability 𝛾**, the decoder is forced to use the true target token as input for the next step.
  - Otherwise, it uses its own prediction.
- **Purpose**: Speeds up convergence but may lead to over-reliance on ground truth.

---

### **Full Sequence-to-Sequence Model**
1. **Encoder processes `energy_array`** → Generates hidden states.
2. **Decoder generates `code_array`**:
   - Starts with the `<sos>` (start token).
   - Iterates **one step at a time**, using previous predictions as input.
   - Teacher forcing is used to stabilize training.

---

## **4. Training Process**
### **Loss Function**
- Uses **CrossEntropyLoss** with `ignore_index=0` (to ignore padding tokens).
  
### **Gradient Clipping**
- Prevents exploding gradients with `torch.nn.utils.clip_grad_norm_`.

### **Evaluation Metric**
- **Token-level accuracy**:
  - Compares predicted token indices with ground truth.
  - Ignores padding tokens.

---

## **5. Testing & Final Evaluation**
- Model is evaluated on the **test set** (unseen data).
- Computes **loss & accuracy** to measure performance.


In [None]:
# =============================================================================
# 1. Data Preparation and Chronological Split
# =============================================================================
# Assume 'dataset' is your DataFrame loaded with two columns: 'energy_array' and 'code_array'.
# The dataset is in chronological order (earliest at the top).
dataset = dataset.reset_index(drop=True)
total_samples = len(dataset)

# Split chronologically:
# Last 20% for testing.
test_size = int(0.2 * total_samples)
train_val = dataset.iloc[:-test_size]   # first 80% for training+validation
test_df = dataset.iloc[-test_size:]       # last 20% for testing

# From the training+validation portion, use the last 12.5% of the total data (~10% overall) as validation.
train_val_samples = len(train_val)
val_size = int(0.125 * total_samples)  # roughly 10% of total when training+validation is 80%
train_df = train_val.iloc[:-val_size]
val_df = train_val.iloc[-val_size:]

print(f"Train samples: {len(train_df)}, Validation samples: {len(val_df)}, Test samples: {len(test_df)}")

# Compute vocabulary size from the target token lists (assuming tokens are integers)
all_tokens = [token for tokens in dataset['code_array'] for token in tokens]
vocab_size = max(all_tokens) + 1  # Assumes tokens start at 0

# =============================================================================
# 2. Define Dataset and DataLoader with Padding
# =============================================================================
class Seq2SeqDataset(Dataset):
    def __init__(self, df):
        self.df = df.reset_index(drop=True)
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        row = self.df.iloc[index]
        # Convert energy_array (source) to tensor of shape (seq_len, 1)
        energy_seq = row['energy_array']
        src_tensor = torch.tensor(energy_seq, dtype=torch.float).unsqueeze(1)
        # Convert code_array (target) to tensor of token indices (seq_len,)
        code_seq = row['code_array']
        trg_tensor = torch.tensor(code_seq, dtype=torch.long)
        src_length = len(energy_seq)
        return src_tensor, trg_tensor, src_length

def collate_fn(batch):
    src_tensors, trg_tensors, src_lengths = zip(*batch)
    src_padded = rnn_utils.pad_sequence(src_tensors, batch_first=True, padding_value=0.0)
    trg_padded = rnn_utils.pad_sequence(trg_tensors, batch_first=True, padding_value=0)
    src_lengths = torch.tensor(src_lengths, dtype=torch.long)
    return src_padded, trg_padded, src_lengths

# Create dataset objects
train_dataset = Seq2SeqDataset(train_df)
val_dataset = Seq2SeqDataset(val_df)
test_dataset = Seq2SeqDataset(test_df)

# Create DataLoaders
BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

# =============================================================================
# 3. Define the Model Architecture
# =============================================================================
# Encoder: Processes the energy sequence.
class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
    
    def forward(self, x, lengths):
        # Pack the padded sequence for efficient processing.
        packed = rnn_utils.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, (hidden, cell) = self.lstm(packed)
        return hidden, cell

# Decoder: Generates the code sequence.
class Decoder(nn.Module):
    def __init__(self, output_dim, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, hidden_dim)
        self.lstm = nn.LSTM(hidden_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, input_token, hidden, cell):
        # input_token: (batch_size,)
        input_token = input_token.unsqueeze(1)  # Shape: (batch_size, 1)
        embedded = self.embedding(input_token)   # Shape: (batch_size, 1, hidden_dim)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))  # Shape: (batch_size, output_dim)
        return prediction, hidden, cell

# Seq2Seq Model with Teacher Forcing
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device, teacher_forcing_ratio=0.5):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        self.teacher_forcing_ratio = teacher_forcing_ratio

    def forward(self, src, src_lengths, trg):
        """
        src: Tensor of shape (batch_size, src_seq_len, input_dim)
        trg: Tensor of shape (batch_size, trg_seq_len)
        """
        batch_size = src.size(0)
        trg_seq_len = trg.size(1)
        trg_vocab_size = self.decoder.fc.out_features
        
        outputs = torch.zeros(batch_size, trg_seq_len, trg_vocab_size).to(self.device)
        hidden, cell = self.encoder(src, src_lengths)
        
        # Assume first token is <sos> (start-of-sequence)
        input_token = trg[:, 0]
        
        for t in range(1, trg_seq_len):
            output, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t] = output
            teacher_force = np.random.rand() < self.teacher_forcing_ratio
            top1 = output.argmax(1)
            input_token = trg[:, t] if teacher_force else top1
        
        return outputs

# =============================================================================
# 4. Training Setup
# =============================================================================
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_DIM = 1            # Each energy value is a float (with an added feature dimension)
HIDDEN_DIM = 128
NUM_LAYERS = 1
OUTPUT_DIM = vocab_size  # Vocabulary size for code tokens

encoder = Encoder(INPUT_DIM, HIDDEN_DIM, NUM_LAYERS).to(device)
decoder = Decoder(OUTPUT_DIM, HIDDEN_DIM, NUM_LAYERS).to(device)
model = Seq2Seq(encoder, decoder, device, teacher_forcing_ratio=0.5).to(device)

optimizer = optim.Adam(model.parameters(), lr=0.001)
# Use CrossEntropyLoss and ignore padding tokens (assumed to be index 0)
criterion = nn.CrossEntropyLoss(ignore_index=0)

# =============================================================================
# 5. Accuracy Function
# =============================================================================
def compute_accuracy(predictions, targets, pad_idx=0):
    """
    Computes token-level accuracy, ignoring padding tokens.
    predictions: Tensor of shape (N, vocab_size)
    targets: Tensor of shape (N,)
    """
    predicted_tokens = predictions.argmax(dim=1)
    mask = targets != pad_idx
    correct = (predicted_tokens[mask] == targets[mask]).float().sum()
    total = mask.float().sum()
    return correct / total if total > 0 else torch.tensor(0.0)


# Set to True to train the model
TRAIN = True
if TRAIN:
    # =============================================================================
    # 6. Training Loop with Validation (Loss & Accuracy)
    # =============================================================================
    NUM_EPOCHS = 10

    for epoch in range(NUM_EPOCHS):
        model.train()
        train_loss = 0
        train_acc = 0
        for src, trg, src_lengths in train_loader:
            src = src.to(device)          # (batch_size, src_seq_len, 1)
            trg = trg.to(device)          # (batch_size, trg_seq_len)
            src_lengths = src_lengths.to(device)
            
            optimizer.zero_grad()
            output = model(src, src_lengths, trg)
            
            # For loss/accuracy computation, ignore the first time step (start token)
            output_dim = output.shape[-1]
            output_reshaped = output[:, 1:].reshape(-1, output_dim)
            trg_reshaped = trg[:, 1:].reshape(-1)
            
            loss = criterion(output_reshaped, trg_reshaped)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
            optimizer.step()
            
            train_loss += loss.item()
            train_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()
        
        avg_train_loss = train_loss / len(train_loader)
        avg_train_acc = train_acc / len(train_loader)
        
        # Validation loop
        model.eval()
        val_loss = 0
        val_acc = 0
        with torch.no_grad():
            for src, trg, src_lengths in val_loader:
                src = src.to(device)
                trg = trg.to(device)
                src_lengths = src_lengths.to(device)
                
                output = model(src, src_lengths, trg)
                output_dim = output.shape[-1]
                output_reshaped = output[:, 1:].reshape(-1, output_dim)
                trg_reshaped = trg[:, 1:].reshape(-1)
                
                loss = criterion(output_reshaped, trg_reshaped)
                val_loss += loss.item()
                val_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()
        
        avg_val_loss = val_loss / len(val_loader)
        avg_val_acc = val_acc / len(val_loader)
        print(f"Epoch {epoch+1}/{NUM_EPOCHS} - Train Loss: {avg_train_loss:.4f}, Train Acc: {avg_train_acc:.4f} | "
            f"Val Loss: {avg_val_loss:.4f}, Val Acc: {avg_val_acc:.4f}")

    print("Training complete.")

    # =============================================================================
    # 7. Evaluation on the Test Set (Loss & Accuracy)
    # =============================================================================
    model.eval()
    test_loss = 0
    test_acc = 0
    with torch.no_grad():
        for src, trg, src_lengths in test_loader:
            src = src.to(device)
            trg = trg.to(device)
            src_lengths = src_lengths.to(device)
            
            output = model(src, src_lengths, trg)
            output_dim = output.shape[-1]
            output_reshaped = output[:, 1:].reshape(-1, output_dim)
            trg_reshaped = trg[:, 1:].reshape(-1)
            
            loss = criterion(output_reshaped, trg_reshaped)
            test_loss += loss.item()
            test_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()

    avg_test_loss = test_loss / len(test_loader)
    avg_test_acc = test_acc / len(test_loader)
    print(f"Test Loss: {avg_test_loss:.4f}, Test Accuracy: {avg_test_acc:.4f}")
    torch.save(model.state_dict(), WORKDIR+'seq2seq_model.pth')

Train samples: 1523, Validation samples: 281, Test samples: 450


### Hyperparameter Tuning

In [47]:
RUN_TUNE = False
if RUN_TUNE:
    import itertools

    # Define hyperparameter grid for tuning
    hidden_dims = [128, 256]
    teacher_forcing_ratios = [0.5, 0.7, 0.9]
    learning_rates = [0.001, 0.0005]
    num_epochs_tuning = 5  # fewer epochs for quick tuning

    best_val_acc = 0.0
    best_config = None
    best_model_state = None

    for hidden_dim, teacher_forcing_ratio, lr in itertools.product(hidden_dims, teacher_forcing_ratios, learning_rates):
        print(f"\nTraining with hidden_dim={hidden_dim}, teacher_forcing_ratio={teacher_forcing_ratio}, lr={lr}")
        
        # Initialize new model with the given hyperparameters
        encoder_tuned = Encoder(INPUT_DIM, hidden_dim, NUM_LAYERS).to(device)
        decoder_tuned = Decoder(OUTPUT_DIM, hidden_dim, NUM_LAYERS).to(device)
        model_tuned = Seq2Seq(encoder_tuned, decoder_tuned, device, teacher_forcing_ratio=teacher_forcing_ratio).to(device)
        optimizer_tuned = optim.Adam(model_tuned.parameters(), lr=lr)
        criterion_tuned = nn.CrossEntropyLoss(ignore_index=0)
        
        # Training for a few epochs to evaluate on validation set
        for epoch in range(num_epochs_tuning):
            model_tuned.train()
            train_loss = 0
            train_steps = 0
            for src, trg, src_lengths in train_loader:
                src = src.to(device)
                trg = trg.to(device)
                src_lengths = src_lengths.to(device)
                
                optimizer_tuned.zero_grad()
                output = model_tuned(src, src_lengths, trg)
                
                output_dim = output.shape[-1]
                output_reshaped = output[:, 1:].reshape(-1, output_dim)
                trg_reshaped = trg[:, 1:].reshape(-1)
                
                loss = criterion_tuned(output_reshaped, trg_reshaped)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model_tuned.parameters(), max_norm=1)
                optimizer_tuned.step()
                
                train_loss += loss.item()
                train_steps += 1
            
            avg_train_loss = train_loss / train_steps
            
            # Validation
            model_tuned.eval()
            val_loss = 0
            val_steps = 0
            val_acc = 0
            with torch.no_grad():
                for src, trg, src_lengths in val_loader:
                    src = src.to(device)
                    trg = trg.to(device)
                    src_lengths = src_lengths.to(device)
                    output = model_tuned(src, src_lengths, trg)
                    
                    output_dim = output.shape[-1]
                    output_reshaped = output[:, 1:].reshape(-1, output_dim)
                    trg_reshaped = trg[:, 1:].reshape(-1)
                    
                    loss = criterion_tuned(output_reshaped, trg_reshaped)
                    val_loss += loss.item()
                    acc = compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0)
                    val_acc += acc.item()
                    val_steps += 1
            
            avg_val_loss = val_loss / val_steps
            avg_val_acc = val_acc / val_steps
            
            print(f"Epoch {epoch+1}/{num_epochs_tuning} - Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}, Val Acc: {avg_val_acc:.4f}")
        
        # Save the model if this configuration is better than before
        if avg_val_acc > best_val_acc:
            best_val_acc = avg_val_acc
            best_config = (hidden_dim, teacher_forcing_ratio, lr)
            best_model_state = model_tuned.state_dict()
            print(f"--> New best config: hidden_dim={hidden_dim}, teacher_forcing_ratio={teacher_forcing_ratio}, lr={lr} with Val Acc {best_val_acc:.4f}")

    print(f"\nBest Configuration: hidden_dim={best_config[0]}, teacher_forcing_ratio={best_config[1]}, lr={best_config[2]} with Val Acc: {best_val_acc:.4f}")

    # Optionally, save the best model state to a file
    torch.save(best_model_state, WORKDIR+'best_seq2seq_model.pth')

### Best Hyper-Parameters

In [None]:
REBUILD_MODEL = True
# Best parameters
hidden_dim = 256
teacher_forcing_ratio = 0.9
lr = 0.001
NUM_EPOCHS = 15

# Initialize the model with the best parameters
encoder_best = Encoder(INPUT_DIM, hidden_dim, NUM_LAYERS).to(device)
decoder_best = Decoder(OUTPUT_DIM, hidden_dim, NUM_LAYERS).to(device)
model_best = Seq2Seq(encoder_best, decoder_best, device, teacher_forcing_ratio=teacher_forcing_ratio).to(device)

optimizer_best = optim.Adam(model_best.parameters(), lr=lr)
criterion_best = nn.CrossEntropyLoss(ignore_index=0)
if REBUILD_MODEL:

    # Training loop with full iterations
    for epoch in range(NUM_EPOCHS):
        model_best.train()
        train_loss = 0
        train_acc = 0
        for src, trg, src_lengths in train_loader:
            src = src.to(device)
            trg = trg.to(device)
            src_lengths = src_lengths.to(device)
            
            optimizer_best.zero_grad()
            output = model_best(src, src_lengths, trg)
            
            output_dim = output.shape[-1]
            output_reshaped = output[:, 1:].reshape(-1, output_dim)
            trg_reshaped = trg[:, 1:].reshape(-1)
            
            loss = criterion_best(output_reshaped, trg_reshaped)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model_best.parameters(), max_norm=1)
            optimizer_best.step()
            
            train_loss += loss.item()
            train_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()
        
        avg_train_loss = train_loss / len(train_loader)
        avg_train_acc = train_acc / len(train_loader)
        
        # Validation loop
        model_best.eval()
        val_loss = 0
        val_acc = 0
        with torch.no_grad():
            for src, trg, src_lengths in val_loader:
                src = src.to(device)
                trg = trg.to(device)
                src_lengths = src_lengths.to(device)
                
                output = model_best(src, src_lengths, trg)
                output_dim = output.shape[-1]
                output_reshaped = output[:, 1:].reshape(-1, output_dim)
                trg_reshaped = trg[:, 1:].reshape(-1)
                
                loss = criterion_best(output_reshaped, trg_reshaped)
                val_loss += loss.item()
                val_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()
        
        avg_val_loss = val_loss / len(val_loader)
        avg_val_acc = val_acc / len(val_loader)
        print(f"Epoch {epoch+1}/{NUM_EPOCHS} - Train Loss: {avg_train_loss:.4f}, Train Acc: {avg_train_acc:.4f} | "
            f"Val Loss: {avg_val_loss:.4f}, Val Acc: {avg_val_acc:.4f}")

    print("Training complete.")
    # save model
    torch.save(model_best.state_dict(), WORKDIR+'seq2seq_model_best_full.pth')

In [62]:
model_best = Seq2Seq(encoder_best, decoder_best, device, teacher_forcing_ratio=teacher_forcing_ratio).to(device)
model_best.load_state_dict(torch.load(WORKDIR+'seq2seq_model_best_full.pth'))
model_best.eval()

Seq2Seq(
  (encoder): Encoder(
    (lstm): LSTM(1, 256, batch_first=True)
  )
  (decoder): Decoder(
    (embedding): Embedding(9328, 256)
    (lstm): LSTM(256, 256, batch_first=True)
    (fc): Linear(in_features=256, out_features=9328, bias=True)
  )
)

In [63]:
# now test the model
model_best.eval()
test_loss = 0
test_acc = 0
with torch.no_grad():
    for src, trg, src_lengths in test_loader:
        src = src.to(device)
        trg = trg.to(device)
        src_lengths = src_lengths.to(device)
        
        output = model_best(src, src_lengths, trg)
        output_dim = output.shape[-1]
        output_reshaped = output[:, 1:].reshape(-1, output_dim)
        trg_reshaped = trg[:, 1:].reshape(-1)
        
        loss = criterion_best(output_reshaped, trg_reshaped)
        test_loss += loss.item()
        test_acc += compute_accuracy(output_reshaped, trg_reshaped, pad_idx=0).item()
test_loss /= len(test_loader)
test_acc /= len(test_loader)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_acc:.4f}")

Test Loss: 0.4465, Test Accuracy: 0.8991


## **Summary of Key Features**
✔ **Handles variable-length sequences efficiently**  
✔ **Uses an LSTM-based Encoder-Decoder architecture**  
✔ **Implements Teacher Forcing for stable training**  
✔ **Uses gradient clipping to prevent instability**  
✔ **Employs padding-aware training via `pack_padded_sequence`**  

This model is well-suited for tasks involving sequential energy-to-code transformation, making it useful for applications like **symbolic regression, program synthesis, or energy pattern recognition.**
Our best model built within 15 epoch has a test accuracy of 0.89 on completely unseen data which gives us a good amount of confidence in our predictions.

In [64]:
# load model
model = model_best

## Blind Dataset Prediction

Process the blind dataset:
- Generate predictions
- Format output
- Export results

In [65]:
# blind chunk
blind_chunks = chunk_dataset(df_blind)
blind_dataset = []
for chunk in blind_chunks:
    energy_array = chunk['energy'].values
    blind_dataset.append({'energy_array': energy_array})
blind_dataset = pd.DataFrame(blind_dataset)
blind_dataset.head()

Unnamed: 0,energy_array
0,"[40.0, 39.4, 20.7, 27.0, 10.7, 22.2, 19.5, 18...."
1,"[28.9, 39.4, 15.2, 26.8, 5.6, 11.7, 10.7, 13.0..."
2,"[40.0, 39.0, 15.4, 27.0, 9.5, 22.1, 15.0, 14.2..."
3,"[39.6, 38.9, 15.3, 27.0, 7.6, 21.4, 15.4, 14.1..."
4,"[35.8, 38.8, 15.2, 27.0, 5.5, 19.0, 15.1, 14.2..."


In [66]:
# Define inference parameters
SOS_TOKEN = 1 
EOS_TOKEN = 2 
MAX_LEN = 50   # Maximum length of the predicted sequence

def predict_sequence(model, energy_seq, device, max_len=MAX_LEN, sos_token=SOS_TOKEN, eos_token=EOS_TOKEN):
    """
    Given an energy sequence (list of floats), predicts a sequence of code tokens.
    """
    model.eval()
    # Convert energy_seq into a tensor with shape (seq_len, 1)
    src_tensor = torch.tensor(energy_seq, dtype=torch.float).unsqueeze(1)  # (seq_len, 1)
    src_length = src_tensor.size(0)
    
    # Add a batch dimension: (1, seq_len, 1)
    src_tensor = src_tensor.unsqueeze(0).to(device)
    src_length_tensor = torch.tensor([src_length], dtype=torch.long).to(device)
    
    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor, src_length_tensor)
    
    # Start with the SOS token
    input_token = torch.tensor([sos_token], dtype=torch.long).to(device)
    predicted_tokens = []
    
    for _ in range(max_len):
        with torch.no_grad():
            output, hidden, cell = model.decoder(input_token, hidden, cell)
        predicted_token = output.argmax(1).item()
        if predicted_token == eos_token:
            break
        predicted_tokens.append(predicted_token)
        input_token = torch.tensor([predicted_token], dtype=torch.long).to(device)
    
    return predicted_tokens


In [67]:
blind_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 673 entries, 0 to 672
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   energy_array  673 non-null    object
dtypes: object(1)
memory usage: 5.4+ KB


In [68]:
# suppress warning
import warnings
warnings.filterwarnings("ignore")
# now lets produce the blind dataset correctly
# for each chunk in blind dataset, we add the predicted code array
for chunk in blind_chunks:
    energy_seq = chunk['energy'].values
    maxlen = len(energy_seq)
    predicted = predict_sequence(model, energy_seq, device, max_len=maxlen,
                                 sos_token=SOS_TOKEN, eos_token=EOS_TOKEN)
    chunk['predicted_code_array'] = list(predicted)

In [69]:
# put the chunks back together
blind_df = pd.concat(blind_chunks)
# invert the label encoding
blind_df['Codigo'] = codele.inverse_transform(blind_df['predicted_code_array'])
blind_df = blind_df.drop(columns=['predicted_code_array', 'day_of_week', 'hour', 'is_weekend', 'hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'energy_percentile', 'price_percentile', 'month', 'day_of_month'])
blind_df['datetime'] = blind_datetime
# fechaHora,PrecEuro,Energia
colmap = {"datetime": "fechaHora", "price": "PrecEuro", "energy": "Energia", 'Codigo': 'Codigo'}
blind_df = blind_df.rename(columns=colmap)
blind_df = blind_df[['Codigo', 'fechaHora', 'PrecEuro', 'Energia']]
blind_df.head()

Unnamed: 0,Codigo,fechaHora,PrecEuro,Energia
4,CEVD003,2024-06-01 23:00:00,-0.25,40.0
5,CEVD209,2024-06-01 23:00:00,-15.0,39.4
6,CEVD234,2024-06-01 23:00:00,-3.5,20.7
7,CEVD279,2024-06-01 23:00:00,-0.61,27.0
8,CEVD284,2024-06-01 23:00:00,-0.01,10.7


In [71]:
# save blind dataset
blind_df.to_csv(WORKDIR+'../data/Challenge2_404TeamNotFound.csv', index=False)

In summary we have built a model that is capable of looking at any hour of a day and the energy that was generated. This model then produces an array of the appropriate codes which belong to each energy point.