# Looking at the Informer model 

In what follows we look at the informer model, it's parts and the general idea. <br>

The Informer model, was designed for efficient and effective long-sequence time-series forecasting. <br>
Improving the usual Transformer architectures to handle long sequences more efficiently.<br>

This model identify deviations in patterns over time by forecasting expected values and comparing them to actual observations. <br>


Consists of:

 -   **Encoder**: Processes the input time series data to capture temporal dependencies.
 -   **Decoder**: Generates future time steps based on the encoded information.
 -   **ProbSparse Self-Attention Mechanism**: Efficiently handles long sequences by selecting only the most relevant attention scores.
 -   **Embedding Layers**: Convert raw input data and temporal features into a format suitable for the model.
 
 
By forecasting future time steps and comparing them with actual observations, anomalies can be detected when there is a significant deviation.<br>
Temporal Features: Embedding temporal features helps the model understand and anticipate regular patterns, making it easier to spot irregularities.

## 1. Informer Class

Check **model.py**

This is the main class that brings together the encoder, decoder, and embedding layers.

Components:

 -    Encoding: Prepares input data for the encoder.
 -    Encoder: Processes input sequences.
 -    Decoder: Generates output sequences.
 -    Projection Layer: Maps the decoder output to the output dimensions.

```python
class Informer(nn.Module):
    def __init__(self, enc_in, dec_in, c_out, seq_len, label_len, out_len, 
                factor=5, d_model=512, n_heads=8, e_layers=3, d_layers=2, d_ff=512, 
                dropout=0.0, attn='prob', embed='fixed', freq='h', activation='gelu', 
                output_attention = False, distil=True, mix=True,
                device=torch.device('cuda:0')):
        ...
    def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, ...):
        ...


As all the following classes, it inherits from **nn.Module**, making it a PyTorch neural network module.

**Parameters**

- enc_in: Number of input features for the encoder.
- dec_in: Number of input features for the decoder.
- c_out: Number of output features (usually equal to dec_in).
- seq_len: Length of the input sequence to the encoder.
- label_len: Length of the known output sequence (used as input to the decoder).
- out_len: Length of the output sequence to predict.

**Hyperparameters**

- factor: Controls the sparsity in ProbSparse attention.
- d_model: Dimension of the model (embedding size).
- n_heads: Number of attention heads.
- e_layers: Number of encoder layers.
- d_layers: Number of decoder layers.
- d_ff: Dimension of the feed-forward networks (usually 4 times d_model).
- dropout: Dropout rate for regularization.
- attn: Type of attention mechanism ('prob' for ProbSparse, 'full' for full attention).
- embed: Type of embedding ('fixed' or 'timeF').
- freq: Granularity of time features ('h' for hourly, etc.).
- activation: Activation function used in feed-forward networks.
- output_attention: Whether to output attention weights (useful for visualization).
- distil: Whether to use distilling in the encoder (using convolutional layers to reduce sequence length).
- mix: Whether to use mix attention in the decoder.
- device: The device on which to run the model (CPU or GPU).

## Components Initialization

### a. Embedding Layers

```python
self.enc_embedding = DataEmbedding(enc_in, d_model, embed, freq, dropout)
self.dec_embedding = DataEmbedding(dec_in, d_model, embed, freq, dropout)
```

Convert raw input data (x_enc, x_dec) and their corresponding time features (x_mark_enc, x_mark_dec) into embeddings.

- Value Embedding: Embeds the raw time series values.
- Position Embedding: Adds positional information to the embeddings.
- Temporal Embedding: Adds time-related features to the embeddings.

### b. Attention Mechanism Selection

```python
Attn = ProbAttention if attn == 'prob' else FullAttention
```

Chooses the attention mechanism based on the attn parameter.
- ProbAttention: Efficient attention mechanism for long sequences.
- FullAttention: Standard attention mechanism.

### c. Encoder Initialization

```python
self.encoder = Encoder(
    [
        EncoderLayer(
            AttentionLayer(
                Attn(False, factor, attention_dropout=dropout, output_attention=output_attention),
                d_model, n_heads, mix=False
            ),
            d_model,
            d_ff,
            dropout=dropout,
            activation=activation
        ) for l in range(e_layers)
    ],
    [
        ConvLayer(d_model) for l in range(e_layers - 1)
    ] if distil else None,
    norm_layer=torch.nn.LayerNorm(d_model)
)
```

- Encoder Layers: A list of EncoderLayer instances.
- Convolutional Layers (ConvLayer): Optional layers for downsampling (used if distil is True).
- Normalization Layer: Applies layer normalization after the encoder stack.


Details:
- AttentionLayer: Applies self-attention.
  - Attn: The chosen attention mechanism (ProbAttention or FullAttention).
  - Parameters:
    - mask_flag: Indicates whether to apply attention masking.
    - factor: Controls sparsity in ProbAttention.
    - attention_dropout: Dropout rate in the attention mechanism.
    - output_attention: Whether to output attention weights.
    
- Feed-Forward Network (FFN):
  - Conv1d Layers: Two convolutional layers acting as position-wise FFN.
  - Activation Function: Specified by *activation* parameter (e.g., ReLU, GELU).
  
- Layer Normalization and Dropout: Applied after attention and FFN.



### d. Decoder Initialization

```python
self.decoder = Decoder(
    [
        DecoderLayer(
            AttentionLayer(
                Attn(True, factor, attention_dropout=dropout, output_attention=False),
                d_model, n_heads, mix=mix
            ),
            AttentionLayer(
                FullAttention(False, factor, attention_dropout=dropout, output_attention=False),
                d_model, n_heads, mix=False
            ),
            d_model,
            d_ff,
            dropout=dropout,
            activation=activation,
        )
        for l in range(d_layers)
    ],
    norm_layer=torch.nn.LayerNorm(d_model)
)
```

- Decoder Layers: A list of DecoderLayer instances.
- Normalization Layer: Applies layer normalization after the decoder stack.


Details:
- Self-Attention Layer: Processes decoder inputs.
    - Attn: The chosen attention mechanism, with mask_flag=True to prevent attending to future positions.

- Cross-Attention Layer: Attends over the encoder outputs.
    - FullAttention: Always uses full attention for cross-attention.
- Feed-Forward Network (FFN): Similar to the encoder's FFN.
- Layer Normalization and Dropout: Applied after each sub-layer.



### e. Projection Layer

```python
self.projection = nn.Linear(d_model, c_out, bias=True)
```

Maps the output of the decoder to the desired output dimensions.

## Forward Method

```python
def forward(self, x_enc, x_mark_enc, x_dec, x_mark_dec, ...):
    ...
```

This method defines how data flows through the model.

**Inputs**
 -   x_enc: Input sequence to the encoder (shape: [batch_size, seq_len, enc_in]).
 -   x_mark_enc: Time features corresponding to x_enc (shape: [batch_size, seq_len, num_time_features]).
 -   x_dec: Input sequence to the decoder (shape: [batch_size, out_len, dec_in]).
 -   x_mark_dec: Time features corresponding to x_dec (shape: [batch_size, out_len, num_time_features]).
 -   enc_self_mask, dec_self_mask, dec_enc_mask: Optional attention masks.
 -   viz_data: Optional data for visualization (not used now).

**Steps**

#### a. Embedding

```python
enc_out = self.enc_embedding(x_enc, x_mark_enc)
dec_out = self.dec_embedding(x_dec, x_mark_dec)
```

Convert raw inputs and time features into embeddings.
 - Value Embedding: Embed the input values.
 - Position Embedding: Add positional information.
 - Temporal Embedding: Embed time features.
 - Dropout: Apply dropout to prevent overfitting.
 
 
 
#### b. Encoding

```python
enc_out, attns = self.encoder(enc_out, attn_mask=enc_self_mask)
```

Process the encoder embeddings to capture temporal dependencies.

 - Encoder Layers: Each layer applies self-attention and feed-forward network.
 - Optional Convolutional Layers: Reduce sequence length and capture local dependencies (if distil is True).
 - Attention Masks: Can be applied to prevent attention to certain positions.
 
 
#### c. Decoding

```python
dec_out = self.decoder(dec_out, enc_out, x_mask=dec_self_mask, cross_mask=dec_enc_mask)
```

Generate predictions by attending over encoder outputs and decoder inputs.

 - Self-Attention in Decoder: Processes the decoder inputs, using causal masking to prevent access to future positions.
 - Cross-Attention: Allows the decoder to attend to encoder outputs.
 - Feed-Forward Network: Further processes the combined information.
 
 
#### d. Projection

```python
dec_out = self.projection(dec_out)
```

Map the decoder output to the desired output dimension (*c_out*).


 
#### e. Output

```python
if self.output_attention:
    return dec_out[:, -self.pred_len:, :], attns
else:
    return dec_out[:, -self.pred_len:, :]
```

Return the final predictions (and optionally attention weights).

- dec_out[:, -self.pred_len:, :]: Extracts the last pred_len time steps from the decoder output.
- attns: If output_attention is True, returns attention weights for analysis or visualization.


## 2. Attention is all you need

https://arxiv.org/pdf/1706.03762

Classes defined in **attn.py**

### Full Attention


Implements the standard full self-attention mechanism.<br>
Is used when we set the attention mechanism to *'full'*.


 - Computes attention scores for all pairs of input positions.
 - Can be computationally intensive for long sequences.
 
```python
class FullAttention(nn.Module):
    def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
        ...
    def forward(self, queries, keys, values, attn_mask):
        ...


### ProbAttention


Implements the ProbSparse self-attention mechanism.
The default attention mechanism in the Informer model.

- Uses probability sampling to select the most relevant queries and keys.
- Reduces computational complexity from $O(L^2)$ to $O(L\text{ }log⁡L)$, where $L$ is the sequence length.
 
```python
class ProbAttention(nn.Module):
    def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
        ...
    def forward(self, queries, keys, values, attn_mask):
        ...




## 3. AttentionLayer

Wraps around the attention mechanisms.

- Projects input embeddings into queries, keys, and values.
- Applies the attention mechanism.
- Mixes the output if specified.


```python
class AttentionLayer(nn.Module):
    def __init__(self, attention, d_model, n_heads, d_keys=None, d_values=None, mix=False):
        ...
    def forward(self, queries, keys, values, attn_mask):
        ...


## 4. Encoder and EncoderLayer

Check **encoder.py**

### EncoderLayer

A single layer within the encoder stack.

- **Self-Attention**: Captures dependencies within the input sequence.
- **Position-wise Feed-Forward Network (FFN)**: Processes the output of the attention layer.
- **Layer Normalization and Dropout**: Stabilizes training and prevents overfitting.

```python
class EncoderLayer(nn.Module):
    def __init__(self, attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        ...
    def forward(self, x, attn_mask=None):
        ...


### Encoder

Stacks multiple EncoderLayer instances.

- Can include convolutional layers (ConvLayer) for downsampling and capturing local dependencies.
- Applies normalization at the end.

```python
class Encoder(nn.Module):
    def __init__(self, attn_layers, conv_layers=None, norm_layer=None):
        ...
    def forward(self, x, attn_mask=None):
        ...



### ConvLayer

Applies convolutional operations to capture local temporal features.

- Downsamples the sequence length.
- Uses circular padding to handle sequence borders.

```python
class ConvLayer(nn.Module):
    def __init__(self, c_in):
        ...
    def forward(self, x):
        ...



## 5. Decoder and DecoderLayer

Check **decoder.py**

### DecoderLayer

A single layer within the decoder stack.

- **Self-Attention**: Processes the decoder input.
- **Cross-Attention**: Attends over the encoder output.
- **Feed-Forward Network**: Further processes the combined information.
- **Layer Normalization and Dropout**.

```python
class DecoderLayer(nn.Module):
    def __init__(self, self_attention, cross_attention, d_model, d_ff=None, dropout=0.1, activation="relu"):
        ...
    def forward(self, x, cross, x_mask=None, cross_mask=None):
        ...


### Decoder

Stacks multiple *DecoderLayer* instances.

- Processes the decoder inputs and the encoder outputs to generate predictions.
- Applies normalization at the end.

```python
class Decoder(nn.Module):
    def __init__(self, layers, norm_layer=None):
        ...
    def forward(self, x, cross, x_mask=None, cross_mask=None):
        ...




## 6. Embedding Layers

Check **embed.py**

### DataEmbedding

Converts raw input data into embeddings suitable for the model.

- **Value Embedding (TokenEmbedding)**: Embeds the input time series values.
- **Position Embedding (PositionalEmbedding)**: Adds positional information to the embeddings.
- **Temporal Embedding (TemporalEmbedding or TimeFeatureEmbedding)**: Encodes temporal features like time of day, day of the week, etc.
- **Dropout**: Prevents overfitting.

```python
class DataEmbedding(nn.Module):
    def __init__(self, c_in, d_model, embed_type='fixed', freq='h', dropout=0.1):
        ...
    def forward(self, x, x_mark):
        ...


### TokenEmbedding

Applies a convolution to embed input values.

- Uses a 1D convolution to capture local patterns in the input data.

```python
class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        ...
    def forward(self, x):
        ...


### PositionalEmbedding

Adds positional information to the embeddings.

- Uses sine and cosine functions at different frequencies.

```python
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        ...
    def forward(self, x):
        ...



### TemporalEmbedding

Encodes temporal features extracted from timestamps.

- Embeds features like hour of the day, day of the week, month, etc.
- Helps the model to capture seasonal patterns.


```python
class TemporalEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='fixed', freq='h'):
        ...
    def forward(self, x):
        ...



### TimeFeatureEmbedding

Alternative to *TemporalEmbedding*, uses a linear layer to embed time features.

- Used when **embed_type** is set to 'timeF'.


```python
class TimeFeatureEmbedding(nn.Module):
    def __init__(self, d_model, embed_type='timeF', freq='h'):
        ...
    def forward(self, x):
        ...




## 7. Projection Layer

Maps the decoder output to the output dimensions (e.g., the number of features to predict).

 - Applies a linear transformation to produce the final output.

```python
self.projection = nn.Linear(d_model, c_out, bias=True)


# Data Flow through the model


1. Input Preparation:
   - x_enc: Encoder input sequence ( dcm rate data).
   - x_mark_enc: Temporal features for the encoder input.
   - x_dec: Decoder input sequence (e.g., placeholder for future values).
   - x_mark_dec: Temporal features for the decoder input.

1. Embedding:
   - The inputs are passed through the DataEmbedding layer to generate embeddings.

1. Encoding:
   - The encoder processes the embeddings to capture temporal dependencies.

1. Decoding:
   - The decoder uses the encoder output and its own inputs to generate predictions.

1. Projection:
   - The decoder output is passed through the projection layer to obtain the final output.

