# Power Transformer Oil Temperature Prediction using Informer

## Introduction

This notebook implements the **Informer model** for predicting the oil temperature (OT) of power transformers.

**Informer**: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- **Paper**: Zhou et al., AAAI 2021 **Best Paper Award** 🏆
- **Key Innovation**: Designed specifically for long sequence time-series forecasting (LSTF)

### Why Informer for Oil Temperature Prediction?

1. **Long-term Dependencies**: Oil temperature changes have thermal inertia spanning hours
2. **Efficiency**: ProbSparse attention reduces complexity from O(L²) to O(L log L)
3. **Long-horizon Forecasting**: Can predict multiple time steps ahead in one shot
4. **State-of-the-Art**: Achieved best performance on ETT dataset

### Three Key Innovations

#### 1. ProbSparse Self-Attention
- Traditional self-attention: O(L²) complexity
- Informer: O(L log L) by selecting only "active" queries
- **Idea**: Not all queries contribute equally; focus on queries with high attention scores

#### 2. Self-Attention Distilling
- Progressively reduces sequence length layer by layer
- Highlights dominant features while reducing computational burden
- Uses max pooling with stride to halve the input at each layer

#### 3. Generative Style Decoder
- Predicts entire output sequence in one forward pass
- Avoids error accumulation from step-by-step prediction
- Dramatically improves inference speed

### Dataset: ETT (Electricity Transformer Temperature)
- **Source**: https://github.com/zhouhaoyi/ETDataset
- **Sampling**: 15-minute intervals
- **Features**: 6 load features (HUFL, HULL, MUFL, MULL, LUFL, LULL)
- **Target**: OT (Oil Temperature)

## Step 1: Import Required Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Deep Learning
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Preprocessing and Metrics
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print(f"CUDA available: {torch.cuda.is_available()}")

## Step 2: Load and Explore Data

In [None]:
def load_data(filepath):
    """
    Load the ETT dataset
    """
    df = pd.read_csv(filepath)
    df['date'] = pd.to_datetime(df['date'])
    return df

# Load pre-split training and test data
train_filepath = '../dataset/processed_data/train.csv'
test_filepath = '../dataset/processed_data/test.csv'

df_train = load_data(train_filepath)
df_test = load_data(test_filepath)

# For initial exploration, use training data
df = df_train

print("Dataset loaded from pre-split files:")
print(f"  Training data: {train_filepath}")
print(f"  Test data: {test_filepath}")
print(f"\nTraining set shape: {df_train.shape}")
print(f"Test set shape: {df_test.shape}")
print(f"\nFirst few rows of training data:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())

## Step 3: Data Visualization

In [None]:
# Plot Oil Temperature (target variable)
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Full view
axes[0].plot(df['date'][:2000], df['OT'][:2000], linewidth=0.8)
axes[0].set_title('Oil Temperature Over Time (First 2000 points)', fontsize=14)
axes[0].set_xlabel('Date')
axes[0].set_ylabel('Oil Temperature (°C)')
axes[0].grid(True, alpha=0.3)

# Distribution
axes[1].hist(df['OT'], bins=50, edgecolor='black', alpha=0.7)
axes[1].set_title('Oil Temperature Distribution', fontsize=14)
axes[1].set_xlabel('Oil Temperature (°C)')
axes[1].set_ylabel('Frequency')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
features = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT']
sns.heatmap(df[features].corr(), annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, fmt='.2f')
plt.title('Feature Correlation Heatmap', fontsize=14)
plt.tight_layout()
plt.show()

## Step 4: Extract Time Features

Informer uses time features to enhance the model's understanding of temporal patterns.
We extract: hour, day of week, day of month, and month.

In [None]:
def extract_time_features(df):
    """
    Extract time features from datetime column
    
    Returns:
    --------
    df : pd.DataFrame with added time features
    """
    df = df.copy()
    
    # Extract time components
    df['hour'] = df['date'].dt.hour
    df['day_of_week'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6
    df['day_of_month'] = df['date'].dt.day
    df['month'] = df['date'].dt.month
    
    # Normalize to [0, 1] range
    df['hour_norm'] = df['hour'] / 23.0
    df['day_of_week_norm'] = df['day_of_week'] / 6.0
    df['day_of_month_norm'] = (df['day_of_month'] - 1) / 30.0
    df['month_norm'] = (df['month'] - 1) / 11.0
    
    return df

# Extract time features for both datasets
df_train = extract_time_features(df_train)
df_test = extract_time_features(df_test)
df = df_train  # For exploration

print("Time features extracted for both train and test sets!")
print("\nColumns:", df.columns.tolist())
print("\nSample time features:")
print(df[['date', 'hour', 'day_of_week', 'day_of_month', 'month']].head(10))

## Step 5: Data Preprocessing and Dataset Creation

### Key Differences from RNN:
- **Input**: `seq_len` historical time steps
- **Label**: `label_len` + `pred_len` (decoder input + prediction target)
- **Time Stamps**: Time features for both encoder and decoder

In [None]:
class InformerDataset(Dataset):
    """
    Dataset for Informer model
    
    Parameters:
    -----------
    data : pd.DataFrame
        Input dataframe with features and target
    seq_len : int
        Encoder input length (look-back window)
    label_len : int
        Decoder input length (start token)
    pred_len : int
        Prediction length (forecast horizon)
    feature_cols : list
        List of feature column names
    target_col : str
        Target column name
    time_cols : list
        List of time feature column names
    scaler : StandardScaler or None
        Fitted scaler (for test set) or None (for train set)
    """
    def __init__(self, data, seq_len, label_len, pred_len, 
                 feature_cols, target_col, time_cols, scaler=None):
        self.seq_len = seq_len
        self.label_len = label_len
        self.pred_len = pred_len
        
        # Normalize features
        if scaler is None:
            self.scaler_X = StandardScaler()
            self.scaler_y = StandardScaler()
            
            self.data_X = self.scaler_X.fit_transform(data[feature_cols].values)
            self.data_y = self.scaler_y.fit_transform(data[[target_col]].values)
        else:
            self.scaler_X, self.scaler_y = scaler
            self.data_X = self.scaler_X.transform(data[feature_cols].values)
            self.data_y = self.scaler_y.transform(data[[target_col]].values)
        
        # Time features (already normalized)
        self.data_stamp = data[time_cols].values
        
    def __len__(self):
        return len(self.data_X) - self.seq_len - self.pred_len + 1
    
    def __getitem__(self, index):
        # Encoder input
        s_begin = index
        s_end = s_begin + self.seq_len
        
        # Decoder input (overlaps with encoder)
        r_begin = s_end - self.label_len
        r_end = r_begin + self.label_len + self.pred_len
        
        # Input features for encoder
        seq_x = self.data_X[s_begin:s_end]
        seq_x_mark = self.data_stamp[s_begin:s_end]
        
        # Input for decoder (features + zero padding for prediction)
        seq_y = np.concatenate([
            self.data_X[r_begin:r_begin+self.label_len],
            np.zeros((self.pred_len, self.data_X.shape[1]))
        ], axis=0)
        seq_y_mark = self.data_stamp[r_begin:r_end]
        
        # Target (actual values to predict)
        target = self.data_y[r_begin:r_end]
        
        return (
            torch.FloatTensor(seq_x),
            torch.FloatTensor(seq_x_mark),
            torch.FloatTensor(seq_y),
            torch.FloatTensor(seq_y_mark),
            torch.FloatTensor(target)
        )

print("InformerDataset class defined!")

In [None]:
# Configuration
seq_len = 96      # 24 hours (96 * 15min)
label_len = 48    # 12 hours overlap
pred_len = 24     # 6 hours prediction

feature_cols = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL']
target_col = 'OT'
time_cols = ['hour_norm', 'day_of_week_norm', 'day_of_month_norm', 'month_norm']

# Use pre-split data
print(f"Using pre-split data:")
print(f"  Train set: {len(df_train)} samples")
print(f"  Test set: {len(df_test)} samples")

# Create datasets
train_dataset = InformerDataset(
    df_train, seq_len, label_len, pred_len,
    feature_cols, target_col, time_cols
)

test_dataset = InformerDataset(
    df_test, seq_len, label_len, pred_len,
    feature_cols, target_col, time_cols,
    scaler=(train_dataset.scaler_X, train_dataset.scaler_y)
)

# Create data loaders
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"\nTrain batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

# Test data loading
sample_batch = next(iter(train_loader))
print(f"\nSample batch shapes:")
print(f"  Encoder input (seq_x): {sample_batch[0].shape}")
print(f"  Encoder time (seq_x_mark): {sample_batch[1].shape}")
print(f"  Decoder input (seq_y): {sample_batch[2].shape}")
print(f"  Decoder time (seq_y_mark): {sample_batch[3].shape}")
print(f"  Target: {sample_batch[4].shape}")
print(f"\nNote: Using scalers fitted on training data for test set")

## Step 6: Build Informer Model

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│                     INFORMER ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Encoder Input                    Decoder Input            │
│      ↓                                 ↓                   │
│  [Embedding]                      [Embedding]              │
│      ↓                                 ↓                   │
│  ┌──────────┐                     ┌──────────┐            │
│  │ProbSparse│ ←────────────────── │ Standard │            │
│  │Attention │      Cross-Attn     │ Attention│            │
│  │+ Distill │                     │          │            │
│  └──────────┘                     └──────────┘            │
│      ↓                                 ↓                   │
│  [Feed Forward]                   [Feed Forward]          │
│      ↓                                 ↓                   │
│  ───────────────────────────────────────→                 │
│                                         ↓                  │
│                                    [Projection]            │
│                                         ↓                  │
│                                      Output                │
└─────────────────────────────────────────────────────────────┘
```

In [None]:
class PositionalEmbedding(nn.Module):
    """Positional encoding for sequence data"""
    def __init__(self, d_model, max_len=5000):
        super(PositionalEmbedding, self).__init__()
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                             (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)  # [1, max_len, d_model]
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        # x: [batch, seq_len, d_model]
        return x + self.pe[:, :x.size(1), :]


class TokenEmbedding(nn.Module):
    """Project input features to model dimension"""
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        self.tokenConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=d_model,
            kernel_size=3,
            padding=1,
            padding_mode='circular'
        )
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
    
    def forward(self, x):
        # x: [batch, seq_len, c_in]
        x = x.permute(0, 2, 1)  # [batch, c_in, seq_len]
        x = self.tokenConv(x)
        x = x.permute(0, 2, 1)  # [batch, seq_len, d_model]
        return x


class TimeFeatureEmbedding(nn.Module):
    """Embed time features"""
    def __init__(self, d_model, embed_dim=4):
        super(TimeFeatureEmbedding, self).__init__()
        self.embed = nn.Linear(embed_dim, d_model)
    
    def forward(self, x):
        # x: [batch, seq_len, 4] (hour, day_of_week, day, month)
        return self.embed(x)


class DataEmbedding(nn.Module):
    """Complete embedding: token + positional + temporal"""
    def __init__(self, c_in, d_model, dropout=0.1):
        super(DataEmbedding, self).__init__()
        self.value_embedding = TokenEmbedding(c_in, d_model)
        self.position_embedding = PositionalEmbedding(d_model)
        self.temporal_embedding = TimeFeatureEmbedding(d_model)
        self.dropout = nn.Dropout(p=dropout)
    
    def forward(self, x, x_mark):
        # x: [batch, seq_len, c_in]
        # x_mark: [batch, seq_len, 4]
        x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)
        return self.dropout(x)

print("Embedding modules defined!")

In [None]:
class ProbAttention(nn.Module):
    """
    ProbSparse Self-Attention Mechanism
    
    Key Innovation:
    - Select top-u queries based on sparsity measurement
    - Only compute attention for selected queries
    - Reduces complexity from O(L²) to O(L log L)
    """
    def __init__(self, mask_flag=True, factor=5, scale=None, 
                 attention_dropout=0.1, output_attention=False):
        super(ProbAttention, self).__init__()
        self.factor = factor
        self.scale = scale
        self.mask_flag = mask_flag
        self.output_attention = output_attention
        self.dropout = nn.Dropout(attention_dropout)
    
    def _prob_QK(self, Q, K, sample_k, n_top):
        """
        Calculate query sparsity measurement
        
        Returns:
        --------
        Q_K : tensor
            Sampled Q-K scores
        M_top : tensor
            Top-u query indices based on sparsity
        """
        # Q: [B, H, L, D]
        B, H, L_Q, D = Q.shape
        _, _, L_K, _ = K.shape
        
        # Calculate sampled Q-K scores
        K_expand = K.unsqueeze(-3).expand(B, H, L_Q, L_K, D)
        index_sample = torch.randint(L_K, (L_Q, sample_k))  # Random sampling
        K_sample = K_expand[:, :, torch.arange(L_Q).unsqueeze(1), index_sample, :]
        Q_K_sample = torch.matmul(Q.unsqueeze(-2), K_sample.transpose(-2, -1)).squeeze(-2)
        
        # Sparsity measurement: max - mean
        M = Q_K_sample.max(-1)[0] - torch.div(Q_K_sample.sum(-1), L_K)
        M_top = M.topk(n_top, sorted=False)[1]
        
        # Calculate full Q-K for top queries
        Q_reduce = Q[torch.arange(B)[:, None, None],
                     torch.arange(H)[None, :, None],
                     M_top, :]  # [B, H, n_top, D]
        Q_K = torch.matmul(Q_reduce, K.transpose(-2, -1))  # [B, H, n_top, L_K]
        
        return Q_K, M_top
    
    def _get_initial_context(self, V, L_Q):
        """
        Initialize context with mean of V
        """
        B, H, L_V, D = V.shape
        if not self.mask_flag:
            # Mean pooling
            V_sum = V.mean(dim=-2)
            context = V_sum.unsqueeze(-2).expand(B, H, L_Q, V_sum.shape[-1]).clone()
        else:
            # Cumulative mean for masked attention
            context = V.cumsum(dim=-2)
        return context
    
    def _update_context(self, context_in, V, scores, index, L_Q):
        """
        Update context with selected queries
        """
        B, H, L_V, D = V.shape
        
        if self.mask_flag:
            attn_mask = ProbMask(B, H, L_Q, index, scores, device=V.device)
            scores.masked_fill_(attn_mask.mask, -np.inf)
        
        attn = torch.softmax(scores, dim=-1)
        context_in[torch.arange(B)[:, None, None],
                   torch.arange(H)[None, :, None],
                   index, :] = torch.matmul(attn, V).type_as(context_in)
        
        if self.output_attention:
            attns = (torch.ones([B, H, L_V, L_V]) / L_V).type_as(attn).to(attn.device)
            attns[torch.arange(B)[:, None, None], torch.arange(H)[None, :, None], index, :] = attn
            return context_in, attns
        else:
            return context_in, None
    
    def forward(self, queries, keys, values, attn_mask=None):
        """
        Forward pass of ProbSparse Attention
        
        Parameters:
        -----------
        queries : [B, L_Q, H, D]
        keys : [B, L_K, H, D]
        values : [B, L_V, H, D]
        """
        B, L_Q, H, D = queries.shape
        _, L_K, _, _ = keys.shape
        
        queries = queries.transpose(2, 1)  # [B, H, L_Q, D]
        keys = keys.transpose(2, 1)        # [B, H, L_K, D]
        values = values.transpose(2, 1)    # [B, H, L_V, D]
        
        # Calculate number of queries to select
        U_part = self.factor * np.ceil(np.log(L_K)).astype('int').item()
        u = self.factor * np.ceil(np.log(L_Q)).astype('int').item()
        
        U_part = U_part if U_part < L_K else L_K
        u = u if u < L_Q else L_Q
        
        # Calculate sparse attention
        scores_top, index = self._prob_QK(queries, keys, sample_k=U_part, n_top=u)
        
        # Scaling
        scale = self.scale or 1.0 / np.sqrt(D)
        if scale is not None:
            scores_top = scores_top * scale
        
        # Get context
        context = self._get_initial_context(values, L_Q)
        context, attn = self._update_context(context, values, scores_top, index, L_Q)
        
        return context.transpose(2, 1).contiguous(), attn


class ProbMask:
    """Mask for ProbSparse Attention"""
    def __init__(self, B, H, L, index, scores, device="cpu"):
        _mask = torch.ones(L, scores.shape[-1], dtype=torch.bool).to(device).triu(1)
        _mask_ex = _mask[None, None, :].expand(B, H, L, scores.shape[-1])
        indicator = _mask_ex[torch.arange(B)[:, None, None],
                             torch.arange(H)[None, :, None],
                             index, :].to(device)
        self.mask = indicator.view(scores.shape).to(device)

print("ProbSparse Attention defined!")

In [None]:
class AttentionLayer(nn.Module):
    """Multi-head attention wrapper"""
    def __init__(self, attention, d_model, n_heads, d_keys=None, d_values=None):
        super(AttentionLayer, self).__init__()
        
        d_keys = d_keys or (d_model // n_heads)
        d_values = d_values or (d_model // n_heads)
        
        self.inner_attention = attention
        self.query_projection = nn.Linear(d_model, d_keys * n_heads)
        self.key_projection = nn.Linear(d_model, d_keys * n_heads)
        self.value_projection = nn.Linear(d_model, d_values * n_heads)
        self.out_projection = nn.Linear(d_values * n_heads, d_model)
        self.n_heads = n_heads
    
    def forward(self, queries, keys, values, attn_mask=None):
        B, L, _ = queries.shape
        _, S, _ = keys.shape
        H = self.n_heads
        
        queries = self.query_projection(queries).view(B, L, H, -1)
        keys = self.key_projection(keys).view(B, S, H, -1)
        values = self.value_projection(values).view(B, S, H, -1)
        
        out, attn = self.inner_attention(queries, keys, values, attn_mask)
        out = out.view(B, L, -1)
        
        return self.out_projection(out), attn


class EncoderLayer(nn.Module):
    """Informer Encoder Layer with distilling"""
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        super(EncoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.attention = attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu
    
    def forward(self, x, attn_mask=None):
        # Multi-head attention
        new_x, attn = self.attention(x, x, x, attn_mask=attn_mask)
        x = x + self.dropout(new_x)
        x = self.norm1(x)
        
        # Feed forward
        y = x
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))
        
        return self.norm2(x + y), attn


class ConvLayer(nn.Module):
    """Distilling layer: progressively reduce sequence length"""
    def __init__(self, c_in):
        super(ConvLayer, self).__init__()
        self.downConv = nn.Conv1d(
            in_channels=c_in,
            out_channels=c_in,
            kernel_size=3,
            padding=1,
            padding_mode='circular'
        )
        self.norm = nn.BatchNorm1d(c_in)
        self.activation = nn.ELU()
        self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
    
    def forward(self, x):
        x = self.downConv(x.permute(0, 2, 1))
        x = self.norm(x)
        x = self.activation(x)
        x = self.maxPool(x)
        x = x.transpose(1, 2)
        return x


class Encoder(nn.Module):
    """Informer Encoder with distilling"""
    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
        super(Encoder, self).__init__()
        self.attn_layers = nn.ModuleList(attn_layers)
        self.conv_layers = nn.ModuleList(conv_layers) if conv_layers is not None else None
        self.norm = norm_layer
    
    def forward(self, x, attn_mask=None):
        attns = []
        if self.conv_layers is not None:
            for attn_layer, conv_layer in zip(self.attn_layers, self.conv_layers):
                x, attn = attn_layer(x, attn_mask=attn_mask)
                x = conv_layer(x)
                attns.append(attn)
            x, attn = self.attn_layers[-1](x)
            attns.append(attn)
        else:
            for attn_layer in self.attn_layers:
                x, attn = attn_layer(x, attn_mask=attn_mask)
                attns.append(attn)
        
        if self.norm is not None:
            x = self.norm(x)
        
        return x, attns

print("Encoder modules defined!")

In [None]:
class DecoderLayer(nn.Module):
    """Informer Decoder Layer"""
    def __init__(self, self_attention, cross_attention, d_model, d_ff=None,
                 dropout=0.1, activation="relu"):
        super(DecoderLayer, self).__init__()
        d_ff = d_ff or 4 * d_model
        self.self_attention = self_attention
        self.cross_attention = cross_attention
        self.conv1 = nn.Conv1d(in_channels=d_model, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=d_model, kernel_size=1)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = F.relu if activation == "relu" else F.gelu
    
    def forward(self, x, cross, x_mask=None, cross_mask=None):
        # Self attention
        x = x + self.dropout(self.self_attention(x, x, x, attn_mask=x_mask)[0])
        x = self.norm1(x)
        
        # Cross attention
        x = x + self.dropout(self.cross_attention(x, cross, cross, attn_mask=cross_mask)[0])
        x = self.norm2(x)
        
        # Feed forward
        y = x
        y = self.dropout(self.activation(self.conv1(y.transpose(-1, 1))))
        y = self.dropout(self.conv2(y).transpose(-1, 1))
        
        return self.norm3(x + y)


class Decoder(nn.Module):
    """Informer Decoder"""
    def __init__(self, layers, norm_layer=None):
        super(Decoder, self).__init__()
        self.layers = nn.ModuleList(layers)
        self.norm = norm_layer
    
    def forward(self, x, cross, x_mask=None, cross_mask=None):
        for layer in self.layers:
            x = layer(x, cross, x_mask=x_mask, cross_mask=cross_mask)
        
        if self.norm is not None:
            x = self.norm(x)
        
        return x

print("Decoder modules defined!")

In [None]:
class Informer(nn.Module):
    """
    Complete Informer Model
    
    Parameters:
    -----------
    enc_in : int
        Number of encoder input features
    dec_in : int
        Number of decoder input features
    c_out : int
        Number of output features
    seq_len : int
        Input sequence length
    label_len : int
        Start token length for decoder
    out_len : int
        Output sequence length
    factor : int
        ProbSparse attention factor
    d_model : int
        Model dimension
    n_heads : int
        Number of attention heads
    e_layers : int
        Number of encoder layers
    d_layers : int
        Number of decoder layers
    d_ff : int
        Feed-forward dimension
    dropout : float
        Dropout rate
    attn : str
        Attention type ('prob' for ProbSparse)
    activation : str
        Activation function
    output_attention : bool
        Whether to output attention weights
    distil : bool
        Whether to use distilling
    """
    def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, out_len,
                 factor=5, d_model=512, n_heads=8, e_layers=3, d_layers=2,
                 d_ff=512, dropout=0.0, attn='prob', activation='gelu',
                 output_attention=False, distil=True, device=torch.device('cuda:0')):
        super(Informer, self).__init__()
        self.pred_len = out_len
        self.attn = attn
        self.output_attention = output_attention
        
        # Encoding
        self.enc_embedding = DataEmbedding(enc_in, d_model, dropout)
        self.dec_embedding = DataEmbedding(dec_in, d_model, dropout)
        
        # Attention
        Attn = ProbAttention
        
        # Encoder
        self.encoder = Encoder(
            [
                EncoderLayer(
                    AttentionLayer(
                        Attn(False, factor, attention_dropout=dropout, output_attention=output_attention),
                        d_model, n_heads),
                    d_model,
                    d_ff,
                    dropout=dropout,
                    activation=activation
                ) for _ in range(e_layers)
            ],
            [
                ConvLayer(d_model) for _ in range(e_layers - 1)
            ] if distil else None,
            norm_layer=torch.nn.LayerNorm(d_model)
        )
        
        # Decoder
        self.decoder = Decoder(
            [
                DecoderLayer(
                    AttentionLayer(
                        Attn(True, factor, attention_dropout=dropout, output_attention=False),
                        d_model, n_heads),
                    AttentionLayer(
                        Attn(False, factor, attention_dropout=dropout, output_attention=False),
                        d_model, n_heads),
                    d_model,
                    d_ff,
                    dropout=dropout,
                    activation=activation,
                ) for _ in range(d_layers)
            ],
            norm_layer=torch.nn.LayerNorm(d_model)
        )
        
        # Projection
        self.projection = nn.Linear(d_model, c_out, bias=True)
    
    def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec,
                enc_self_mask=None, dec_self_mask=None, dec_enc_mask=None):
        """
        Forward pass
        
        Parameters:
        -----------
        x_enc : [batch, seq_len, enc_in]
        x_mark_enc : [batch, seq_len, 4]
        x_dec : [batch, label_len + pred_len, dec_in]
        x_mark_dec : [batch, label_len + pred_len, 4]
        
        Returns:
        --------
        output : [batch, pred_len, c_out]
        """
        # Encoder
        enc_out = self.enc_embedding(x_enc, x_mark_enc)
        enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)
        
        # Decoder
        dec_out = self.dec_embedding(x_dec, x_mark_dec)
        dec_out = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask)
        dec_out = self.projection(dec_out)
        
        if self.output_attention:
            return dec_out[:, -self.pred_len:, :], attns
        else:
            return dec_out[:, -self.pred_len:, :]  # [B, L, D]

print("Complete Informer model defined!")

## Step 7: Initialize Model

In [None]:
# Model hyperparameters
model_config = {
    'enc_in': len(feature_cols),      # 6 features
    'dec_in': len(feature_cols),      # 6 features
    'c_out': 1,                       # 1 target (OT)
    'seq_len': seq_len,               # 96
    'label_len': label_len,           # 48
    'out_len': pred_len,              # 24
    'factor': 5,                      # ProbSparse factor
    'd_model': 512,                   # Model dimension
    'n_heads': 8,                     # Attention heads
    'e_layers': 2,                    # Encoder layers
    'd_layers': 1,                    # Decoder layers
    'd_ff': 2048,                     # Feed-forward dimension
    'dropout': 0.05,                  # Dropout
    'attn': 'prob',                   # ProbSparse attention
    'activation': 'gelu',             # Activation
    'output_attention': False,        # Don't output attention
    'distil': True,                   # Use distilling
    'device': device
}

# Initialize model
model = Informer(**model_config).to(device)

print("Model initialized!")
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

## Step 8: Training Configuration and Loop

In [None]:
# Training configuration
num_epochs = 20
learning_rate = 1e-4
patience = 5

# Optimizer and loss
optimizer = Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss()
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3, verbose=True)

# Training history
history = {
    'train_loss': [],
    'test_loss': []
}

best_test_loss = float('inf')
patience_counter = 0

print("Training configuration:")
print(f"  Epochs: {num_epochs}")
print(f"  Learning rate: {learning_rate}")
print(f"  Batch size: {batch_size}")
print(f"  Device: {device}")
print(f"\nStarting training...\n")

In [None]:
# Training loop
for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    
    for batch_idx, (seq_x, seq_x_mark, seq_y, seq_y_mark, target) in enumerate(train_loader):
        # Move to device
        seq_x = seq_x.to(device)
        seq_x_mark = seq_x_mark.to(device)
        seq_y = seq_y.to(device)
        seq_y_mark = seq_y_mark.to(device)
        target = target.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        output = model(seq_x, seq_x_mark, seq_y, seq_y_mark)
        
        # Calculate loss (only on prediction part)
        loss = criterion(output, target[:, -pred_len:, :])
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item()
    
    avg_train_loss = train_loss / len(train_loader)
    history['train_loss'].append(avg_train_loss)
    
    # Testing
    model.eval()
    test_loss = 0.0
    
    with torch.no_grad():
        for seq_x, seq_x_mark, seq_y, seq_y_mark, target in test_loader:
            seq_x = seq_x.to(device)
            seq_x_mark = seq_x_mark.to(device)
            seq_y = seq_y.to(device)
            seq_y_mark = seq_y_mark.to(device)
            target = target.to(device)
            
            output = model(seq_x, seq_x_mark, seq_y, seq_y_mark)
            loss = criterion(output, target[:, -pred_len:, :])
            test_loss += loss.item()
    
    avg_test_loss = test_loss / len(test_loader)
    history['test_loss'].append(avg_test_loss)
    
    # Learning rate scheduling
    scheduler.step(avg_test_loss)
    
    # Print progress
    print(f"Epoch [{epoch+1}/{num_epochs}] - "
          f"Train Loss: {avg_train_loss:.6f}, "
          f"Test Loss: {avg_test_loss:.6f}")
    
    # Early stopping
    if avg_test_loss < best_test_loss:
        best_test_loss = avg_test_loss
        patience_counter = 0
        # Save best model
        torch.save(model.state_dict(), 'best_informer_model.pth')
        print(f"  → New best model saved (Test Loss: {best_test_loss:.6f})")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"\nEarly stopping triggered after {epoch+1} epochs")
            break

print("\nTraining completed!")
print(f"Best test loss: {best_test_loss:.6f}")

# Load best model
model.load_state_dict(torch.load('best_informer_model.pth'))

## Step 9: Training History Visualization

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(history['train_loss'], label='Training Loss', linewidth=2)
plt.plot(history['test_loss'], label='Test Loss', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Informer Model Training History', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Step 10: Model Evaluation and Predictions

In [None]:
def make_predictions(model, data_loader, scaler_y, device):
    """
    Make predictions and inverse transform to original scale
    """
    model.eval()
    predictions = []
    actuals = []
    
    with torch.no_grad():
        for seq_x, seq_x_mark, seq_y, seq_y_mark, target in data_loader:
            seq_x = seq_x.to(device)
            seq_x_mark = seq_x_mark.to(device)
            seq_y = seq_y.to(device)
            seq_y_mark = seq_y_mark.to(device)
            
            output = model(seq_x, seq_x_mark, seq_y, seq_y_mark)
            
            # Move to CPU and convert to numpy
            pred = output.cpu().numpy()
            actual = target[:, -pred_len:, :].cpu().numpy()
            
            predictions.append(pred)
            actuals.append(actual)
    
    # Concatenate all batches
    predictions = np.concatenate(predictions, axis=0)  # [N, pred_len, 1]
    actuals = np.concatenate(actuals, axis=0)          # [N, pred_len, 1]
    
    # Reshape for inverse transform
    pred_shape = predictions.shape
    predictions = predictions.reshape(-1, 1)
    actuals = actuals.reshape(-1, 1)
    
    # Inverse transform
    predictions = scaler_y.inverse_transform(predictions)
    actuals = scaler_y.inverse_transform(actuals)
    
    # Reshape back
    predictions = predictions.reshape(pred_shape)
    actuals = actuals.reshape(pred_shape)
    
    return predictions, actuals

# Make predictions
print("Making predictions...")
train_pred, train_actual = make_predictions(model, train_loader, train_dataset.scaler_y, device)
test_pred, test_actual = make_predictions(model, test_loader, test_dataset.scaler_y, device)

print(f"\nPrediction shapes:")
print(f"  Train: {train_pred.shape}")
print(f"  Test: {test_pred.shape}")

# For metric calculation, use only the last time step of each prediction
# Or average across prediction horizon
train_pred_flat = train_pred[:, -1, 0]  # Last prediction of each sequence
train_actual_flat = train_actual[:, -1, 0]
test_pred_flat = test_pred[:, -1, 0]
test_actual_flat = test_actual[:, -1, 0]

print(f"\nFlattened for metrics:")
print(f"  Train: {train_pred_flat.shape}")
print(f"  Test: {test_pred_flat.shape}")

In [None]:
def calculate_metrics(y_true, y_pred, set_name='Test'):
    """
    Calculate and display regression metrics
    """
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    print(f"\n{set_name} Set Performance Metrics:")
    print("=" * 50)
    print(f"MSE (Mean Squared Error):        {mse:.4f}")
    print(f"RMSE (Root Mean Squared Error):  {rmse:.4f}°C")
    print(f"MAE (Mean Absolute Error):       {mae:.4f}°C")
    print(f"R² Score:                        {r2:.4f}")
    print(f"MAPE (Mean Absolute % Error):    {mape:.2f}%")
    print("=" * 50)
    
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2, 'MAPE': mape}

# Calculate metrics
train_metrics = calculate_metrics(train_actual_flat, train_pred_flat, 'Training')
test_metrics = calculate_metrics(test_actual_flat, test_pred_flat, 'Test')

## Step 11: Results Visualization

In [None]:
# Plot predictions vs actual
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

# Test set - full view
axes[0].plot(test_actual_flat, label='Actual', alpha=0.7, linewidth=1.5)
axes[0].plot(test_pred_flat, label='Predicted', alpha=0.7, linewidth=1.5)
axes[0].set_title('Informer: Oil Temperature Prediction - Test Set (Full View)', fontsize=14)
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Oil Temperature (°C)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test set - zoomed view
zoom_range = 500
axes[1].plot(test_actual_flat[:zoom_range], label='Actual', alpha=0.7, linewidth=1.5)
axes[1].plot(test_pred_flat[:zoom_range], label='Predicted', alpha=0.7, linewidth=1.5)
axes[1].set_title(f'Informer: Oil Temperature Prediction - Test Set (First {zoom_range} Samples)', fontsize=14)
axes[1].set_xlabel('Sample')
axes[1].set_ylabel('Oil Temperature (°C)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(test_actual_flat, test_pred_flat, alpha=0.5, s=20)
plt.plot([test_actual_flat.min(), test_actual_flat.max()],
         [test_actual_flat.min(), test_actual_flat.max()],
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Oil Temperature (°C)', fontsize=12)
plt.ylabel('Predicted Oil Temperature (°C)', fontsize=12)
plt.title('Informer: Predicted vs Actual Oil Temperature (Test Set)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Residual analysis
residuals = test_actual_flat - test_pred_flat

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Residual plot
axes[0].scatter(test_pred_flat, residuals, alpha=0.5, s=20)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_xlabel('Predicted Oil Temperature (°C)')
axes[0].set_ylabel('Residuals (°C)')
axes[0].set_title('Residual Plot')
axes[0].grid(True, alpha=0.3)

# Residual distribution
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuals (°C)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Residuals')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Residual Statistics:")
print(f"Mean: {residuals.mean():.4f}°C")
print(f"Std: {residuals.std():.4f}°C")
print(f"Min: {residuals.min():.4f}°C")
print(f"Max: {residuals.max():.4f}°C")

### Multi-step Prediction Visualization

Visualize how well the model predicts across the entire prediction horizon

In [None]:
# Visualize multi-step predictions for a few samples
n_samples = 5
sample_indices = np.random.choice(len(test_pred), n_samples, replace=False)

fig, axes = plt.subplots(n_samples, 1, figsize=(12, 3*n_samples))

for idx, sample_idx in enumerate(sample_indices):
    ax = axes[idx] if n_samples > 1 else axes
    
    time_steps = np.arange(pred_len)
    actual = test_actual[sample_idx, :, 0]
    predicted = test_pred[sample_idx, :, 0]
    
    ax.plot(time_steps, actual, 'o-', label='Actual', linewidth=2, markersize=6)
    ax.plot(time_steps, predicted, 's-', label='Predicted', linewidth=2, markersize=6)
    ax.set_xlabel('Prediction Step')
    ax.set_ylabel('Oil Temperature (°C)')
    ax.set_title(f'Sample {sample_idx}: {pred_len}-step Ahead Prediction')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 12: Conclusion and Model Comparison

In [None]:
# Create summary comparison
summary_df = pd.DataFrame({
    'Metric': ['MSE', 'RMSE (°C)', 'MAE (°C)', 'R²', 'MAPE (%)'],
    'Training': [
        train_metrics['MSE'],
        train_metrics['RMSE'],
        train_metrics['MAE'],
        train_metrics['R2'],
        train_metrics['MAPE']
    ],
    'Test': [
        test_metrics['MSE'],
        test_metrics['RMSE'],
        test_metrics['MAE'],
        test_metrics['R2'],
        test_metrics['MAPE']
    ]
})

print("\n" + "=" * 70)
print("INFORMER MODEL PERFORMANCE SUMMARY")
print("=" * 70)
print(summary_df.to_string(index=False))
print("=" * 70)

print("\n### Key Observations:")
print(f"1. The Informer achieves R² score of {test_metrics['R2']:.4f} on test set")
print(f"2. Average prediction error (MAE): {test_metrics['MAE']:.4f}°C")
print(f"3. MAPE: {test_metrics['MAPE']:.2f}%")

if abs(train_metrics['R2'] - test_metrics['R2']) < 0.05:
    print("4. Model shows good generalization (minimal overfitting)")
else:
    print("4. Some overfitting detected - consider more regularization")

print("\n### Model Architecture:")
print(f"- Input Sequence Length: {seq_len} steps (24 hours)")
print(f"- Prediction Horizon: {pred_len} steps (6 hours)")
print(f"- Model Dimension: {model_config['d_model']}")
print(f"- Attention Heads: {model_config['n_heads']}")
print(f"- Encoder Layers: {model_config['e_layers']}")
print(f"- Decoder Layers: {model_config['d_layers']}")
print(f"- Total Parameters: {sum(p.numel() for p in model.parameters()):,}")

print("\n### Informer Advantages:")
print("1. ProbSparse Attention: O(L log L) complexity vs O(L²) in vanilla Transformer")
print("2. Self-Attention Distilling: Progressive dimension reduction")
print("3. Generative Decoder: One-shot long sequence prediction")
print("4. Explicitly models temporal patterns with time features")
print("5. State-of-the-art performance on long sequence forecasting")

print("\n### Comparison with Other Models:")
print("Expected Performance Ranking (on ETT dataset):")
print("1. Informer (SOTA) - Best for long sequences")
print("2. LSTM/GRU - Good sequential modeling")
print("3. MLP - Fast but limited temporal modeling")
print("4. Random Forest - Good for nonlinear patterns")
print("5. Linear Regression - Baseline")

print("\n### Why Informer Outperforms RNN:")
print("1. RNNs suffer from vanishing gradients on long sequences")
print("2. Informer's attention captures long-range dependencies better")
print("3. Parallel computation vs sequential RNN processing")
print("4. Explicit temporal encoding enhances pattern recognition")

## Optional: Save Model and Predictions

In [None]:
# Save predictions for further analysis
# results = {
#     'test_actual': test_actual_flat,
#     'test_pred': test_pred_flat,
#     'test_metrics': test_metrics
# }
# np.save('informer_results.npy', results)
# print("Results saved to informer_results.npy")

# Model is already saved as 'best_informer_model.pth'