# AML Homework 3: Theory


## Table of contents:
* [The Transformer Architecture](#transformer) - **11 Points**
    * Understanding the Attention Mechanism
    * Scaled Dot Product Attention - **1 Point**
    * Multi-Head Attention - **2 Points**
    * The Encoder - Decoder Block - **3 Points**
    * Positional Encoding - **1 Point**
    * Transformer Network - **2 Points**
    * Optimizer - **1 Point**
    * Question - **1 Point**
* [Graph Metanetworks](#gmn) **6 Points**
    * A glimpse on the neural network to parameter graph conversion
    * Loading some parameter graphs
    * MPNN implementation
        * Edge Model - **1 Point**
        * Node Model - **2 Points**
        * Global Model - **1 Point**
        * MPNN - **2 Points**

## Initial Setup
Run the following cells to sync with Google Drive if you run from Google Colab, and to install the required torch_scatter library.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

**Replace the path in the cd command with yours**

In [None]:
%cd /content/drive/MyDrive/HW03_AML2425/Theory

In [None]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-2.4.0%2Bcu121.html

## The Transformer Architecture <a class="anchor" id="transformer"></a>

This notebook serves as a comprehensive guide to the fundamental components of the Transformer model, a highly influential architecture in deep learning. Since the release of the seminal paper by Vaswani et al. titled [Attention Is All You Need](https://arxiv.org/abs/1706.03762) in 2017, the Transformer design has consistently surpassed performance benchmarks, particularly in the field of natural language processing. Transformers equipped with a vast number of parameters have demonstrated the ability to generate extensive and compelling text, thus opening up new frontiers in AI applications.
It is imperative to gain a thorough understanding of the inner workings of the Transformer architecture and to be able to implement it independently, a task we will accomplish within the context of this notebook.

In [1]:
import torch
from torch.nn.functional import softmax
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

import copy
from copy import deepcopy
import numpy as np
import math
import scipy.io
import os
import random


## Imports for plotting
import matplotlib.pyplot as plt
plt.set_cmap('cividis')
%matplotlib inline
from IPython.display import set_matplotlib_formats
from matplotlib.colors import to_rgb
import matplotlib
matplotlib.rcParams['lines.linewidth'] = 2.0
import seaborn as sns
sns.reset_orig()

torch.manual_seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)
random.seed(0)

torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark = False

### Understanding the Attention Mechanism <a class="anchor" id="att_mechanism"></a>

In recent years, particularly in sequence-related tasks, the attention mechanism has emerged as a crucial component within neural networks. This mechanism comprises a set of layers that have gained substantial attention due to their effectiveness. The primary purpose of the attention mechanism is to compute a weighted average of elements within a sequence. These weights are dynamically determined based on an input query and the keys associated with the elements. But what does this exactly entail?

Essentially, the goal is to calculate an average that takes into account the true values of each element, rather than assigning equal weight to all. To achieve this, an attention mechanism typically consists of four key components:

* **Query:** The query is a feature vector that helps identify specific elements within the sequence that require attention or consideration.

* **Keys**: Each input element is associated with a key, which is also a feature vector. These keys provide insights into what each element contributes or when it becomes relevant. They are designed to enable the identification of elements that deserve attention based on the query.

* **Values**: For each input element, there is a corresponding value vector. The aim is to compute an average using these value vectors.

* **Score Function:** To determine the items deserving of attention, a scoring function denoted as $f_{attn}$ must be defined. This function takes the query and a key as inputs and yields both the attention weight and score for the query-key pair. Typically, common similarity metrics such as dot products or simple multi-layer perceptrons (MLPs) are employed for this purpose.

***How are key (K), query (Q) and value (V) computed?***
In these formulas, we'll denote the original representations as $(x_i)$ for each element in the sequence.

1. **Key (K) Computation**:
   - The key vector for each element $i$ is computed by multiplying the original representation $x_i$ by a learned key weight matrix $W^K$.
   - Mathematically, the key vector $k_i$ is obtained as follows:

     $k_i = x_i \cdot W^K$

2. **Query (Q) Computation**:
   - Similarly, the query vector for each element $i$ is computed by multiplying the original representation $x_i$ by a learned query weight matrix $W^Q$.
   - Mathematically, the query vector $q_i$ is obtained as follows:

     $q_i = x_i \cdot W^Q$

3. **Value (V) Computation**:
   - The value vector for each element $i$ is computed by multiplying the original representation $x_i$ by a learned value weight matrix $W^V$.
   - Mathematically, the value vector $v_i$ is obtained as follows:

     $v_i = x_i \cdot W^V$

Here's a bit more explanation:

- $x_i$ represents the original representation (e.g., word embedding or feature vector) of the $i$-th element in the sequence.

- $W^K$, $W^Q$, and $W^V$ are learnable weight matrices specific to the key, query, and value computations, respectively. These weight matrices are shared across all elements in the sequence but may have different dimensions based on the desired dimensionality of the key, query, and value spaces.

- After computing the key, query, and value vectors for each element in the sequence, these vectors are used in the self-attention mechanism to calculate attention scores, which determine how much each element attends to others in the sequence. This process is typically followed by a weighted sum of the value vectors to obtain the final output for each element.

To obtain the weights for averaging, a softmax function is applied to the scores produced by the scoring function across all elements. Consequently, value vectors associated with keys most similar to the query receive higher weights in the averaging process.

$$
\alpha_i = \frac{\exp\left(f_{attn}\left(\text{K}_i, \text{Q}\right)\right)}{\sum_j \exp\left(f_{attn}\left(\text{K}_j, \text{Q}\right)\right)}, \hspace{5mm} \text{out}_i = \sum_i \alpha_i \cdot \text{V}_i
$$

Here is an example of attention over a sequence:

<center width="100%" style="padding:25px"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial6/attention_example.svg?raw=1" width="750px"></center>

In this scenario, each word in the sequence has an associated key and value. The scoring function evaluates the similarity between the query and all the keys to determine the weights. These attention weights are then used to compute the weighted average of the word values.

It's important to note that self-attention is a variant of attention applied within the Transformer architecture. In self-attention, each element in the sequence serves as both a key and a value and undergoes an attention layer. This layer assesses the similarity between the keys of all sequence elements based on the query of each element, ultimately producing unique averaged value vectors for each element.

### Scaled Dot Product Attention  (**1 Point**) <a class="anchor" id="scaled_dot_product"></a>

The core concept behind self-attention is the scaled dot product attention, which aims to create an efficient attention mechanism that enables each element within a sequence to attend to every other element. This mechanism is designed to strike a balance between computational efficiency and expressive power.

The inputs to the dot product attention consist of queries ($Q\in\mathbb{R}^{T\times d_k}$), keys ($K\in\mathbb{R}^{T\times d_k}$), and values ($V\in\mathbb{R}^{T\times d_v}$). Here, $T$ represents the sequence length, while $d_k$ and $d_v$ denote the hidden dimensions for $Q$, $K$, and $V$.

The dot product attention is computed as follows:

$$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The matrix multiplication $QK^T$ produces a matrix with dimension $T\times T$ by doing the dot product for every distinct pair of queries and keys. The attention logits for a particular element $i$ to every other element in the sequence are shown in each row. We use a softmax on these and multiply by the value vector to get a weighted mean (the weights being determined by the attention). The computation graph below provides another viewpoint on this attention technique.

<center width="100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial6/scaled_dot_product_attn.svg?raw=1" width="210px"></center>

$1/\sqrt{d_k}$, the scaling factor, is crucial to maintain an appropriate variance of attention values after initialization. As a result, $Q$ and $K$ may also have a variance of close to $1$.


*Note: Keep in mind that we initialize our layers with the purpose of having equal variance across the model. Dot products over two vectors with variances $\sigma^2$, however, produce scalars with $d_k$-times larger variance:*

$$q_i \sim \mathcal{N} (0,\sigma^2), k_i \sim \mathcal{N}(0,\sigma^2) \to \text{Var}\left(\sum_{i=1}^{d_k} q_i\cdot k_i\right) = \sigma^4\cdot d_k$$


*The optional masking of particular entries in the attention matrix is shown by the block labeled "Mask (opt.)" in the diagram above. When calculating the attention values, we pad the sentences to the same length and mask out the padding tokens.*

---

After the discussion regarding the scaled dot-product attention mechanism, please proceed to finalize the code for the `Attention` class as illustrated below.

In [16]:
class Attention(nn.Module):
    ''' Scaled Dot-Product Attention '''

    def __init__(self, attn_dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(attn_dropout)

    def forward(self, query, key, value, mask=None):

        '''
        add here the code regarding the argument of the softmax function as defined above
        '''
        QK = query @ key.transpose(-2, -1)

        d_k = key.shape[-1]
        
        attn = QK / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))

        if mask is not None:
            attn = attn.masked_fill(mask == 0, -1e9)

        attn = self.dropout(F.softmax(attn, dim=-1))

        '''
        Computed attn, calculate the final output of the attention layer
        '''

        output = attn @ value

        return output, attn


**Do not modify the code below.**

After implementing the scaled dot-product attention mechanism, let's proceed with the completion of the `Attention` class below. For this initial implementation, we will not include the mask, which will be introduced and utilized in a subsequent step when building the `MultiHeadAttention` class.

Some random $Q$, $K$ and $V$ are generated to compute some attention outputs.

In [17]:
torch.manual_seed(0)

seq_len, d_k = 3, 2
q = torch.randn(seq_len, d_k)
k = torch.randn(seq_len, d_k)
v = torch.randn(seq_len, d_k)
attention = Attention()
output, attn = attention(q, k, v)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("Output\n", output)
print("Attention\n", attn)


Q
 tensor([[ 1.5410, -0.2934],
        [-2.1788,  0.5684],
        [-1.0845, -1.3986]])
K
 tensor([[ 0.4033,  0.8380],
        [-0.7193, -0.4033],
        [-0.5966,  0.1820]])
V
 tensor([[-0.8567,  1.1006],
        [-1.0712,  0.1227],
        [-0.5663,  0.3731]])
Output
 tensor([[-0.9328,  0.8123],
        [-0.9093,  0.3966],
        [-0.9970,  0.3056]])
Attention
 tensor([[0.6291, 0.2395, 0.2425],
        [0.1387, 0.4749, 0.4975],
        [0.0842, 0.6800, 0.3469]])


### Multi-Head Attention  (**2 Points**) <a class="anchor" id="multi_head"></a>


A network can effectively focus on various aspects of a sequence, thanks to the scaled dot product attention mechanism. However, for sequence elements, a single weighted average often falls short because they may need to consider multiple distinct characteristics. To address this limitation, we enhance the attention mechanism by introducing multiple heads, each equipped with its own set of query-key-value triplets applied to the same input features. Essentially, we transform a single query, key, and value matrix into $h$ sub-queries, sub-keys, and sub-values, and then independently process them through the scaled dot product attention. These individual head outputs are subsequently combined using a final weight matrix through concatenation.

$$
\begin{split}
    \text{Multihead}(Q,K,V) & = \text{Concat}(\text{head}_1,...,\text{head}_h)W^{O}\\
    \text{where } \text{head}_i & = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V)
\end{split}
$$

We refer to this as Multi-Head Attention layer. We can visually see it here:

<center width="100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial6/multihead_attention.svg?raw=1" width="230px"></center>

Set the feature map, $X\in\mathbb{R}^{B\times T\times d_{\text{model}}}$, as $Q$, $K$ and $V$ (with $B$ as the batch size, $T$ the sequence length, $d_{\text{model}}$ the hidden dimensionality of $X$). The weights $W^{Q}$, $W^{K}$, and $W^{V}$ can transform $X$ to the corresponding queries, keys, and values of the input. The final result is produced by multiplying the concatenated output by the weight matrix $W^{0}$


Complete the `MultiHeadAttention` class below.

In [18]:
class MultiHeadAttention(nn.Module):

    def __init__(self, num_heads, d_model, dropout=0.1):
        """
        Take in model size and number of heads.
        """
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0
        #  We assume d_v always equals d_k
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.query_ff = nn.Linear(d_model, d_model)
        self.key_ff = nn.Linear(d_model, d_model)
        self.value_ff = nn.Linear(d_model, d_model)
        self.attn_ff = nn.Linear(d_model, d_model)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        self.attention = Attention(attn_dropout=dropout)

    def forward(self, query, key, value, mask=None, return_attention=False):

        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k.
        # The query is given as example, you should do the same for key and value
        query = self.query_ff(query).view(nbatches, -1, self.num_heads, self.d_k).transpose(1, 2)
        '''
        Add your code below
        '''
        key = self.key_ff(key).view(nbatches, -1, self.num_heads, self.d_k).transpose(1, 2)
        value = self.value_ff(value).view(nbatches, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 2) Apply attention on all the projected vectors in batch.
        '''
        Add your code below
        '''
        x, self.attn = self.attention(query, key, value, mask)
        

        # 3) "Concat" using a view and apply a final linear.
        x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.num_heads * self.d_k)

        if return_attention:
            return self.attn_ff(x), self.attn

        return self.attn_ff(x)


**Do not change the following code.**

In [19]:
torch.manual_seed(0)
np.random.seed(0)

num_heads = 8
d_model = 512


self_attn = MultiHeadAttention(num_heads, d_model)


x = torch.tensor(np.random.rand(1, 7,512)).float()

attn_out = self_attn(x, x, x)
attn_out

tensor([[[ 0.2365, -0.1411,  0.0168,  ..., -0.2253,  0.1767,  0.2225],
         [ 0.1931, -0.0895,  0.0500,  ..., -0.2019,  0.1098,  0.2202],
         [ 0.2050, -0.1029, -0.0381,  ..., -0.1815,  0.1637,  0.2313],
         ...,
         [ 0.1796, -0.1051,  0.0153,  ..., -0.1890,  0.0949,  0.2301],
         [ 0.2132, -0.1249,  0.0418,  ..., -0.2321,  0.1163,  0.2321],
         [ 0.1697, -0.0129,  0.0232,  ..., -0.1104,  0.1187,  0.1810]]],
       grad_fn=<ViewBackward0>)

An essential feature of the multi-head attention mechanism is its **permutation-equivariance** concerning input elements, a critical aspect of this framework. In practical terms, if we were to interchange the first and second items within the input sequence, the output remains entirely unchanged. This property signifies that multi-head attention views the input not as a strict sequence but rather as a collection of items. It is this very characteristic that gives the Transformer architecture and the multi-head attention block their remarkable potency and versatility.

### The Encoder-Decoder Block (**3 Points**) <a class="anchor" id="encoder_decoder"></a>

The original Transformer model, as presented in the paper, was designed primarily for neural machine translation tasks, where it excels at translating sentences from one language to another, such as English to French. The key architectural concept used in the Transformer is the encoder-decoder architecture. In this setup, the encoder processes an input sentence, extracting meaningful features, which are then leveraged by the decoder to generate an output sentence, effectively performing translation.

The completet Transformer architecture is illustrated below (figure credit - [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)).:

<center width="100%"><img src="https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial6/transformer_architecture.svg?raw=1" width="400px"></center>

Let's examine the Encoder block more in depth. Understanding it will result in an easier comprehension of the Decoder block.


The encoder is constructed by applying a sequence of identical blocks, denoted as $N$. Given an input $x$, the initial operation is the application of a Multi-Head Attention block. Subsequently, the output is augmented with the original input using a residual connection, and the sum is then normalized through a layer normalization. This process is formally represented as:

 $\text{LayerNorm}(x+\text{Multihead}(x,x,x))$ ($x$ being $Q$, $K$ and $V$ input to the attention layer).

Residual connections are instrumental in ensuring a smooth gradient flow throughout the model and preserving vital information about the original sequence.

Layer normalization serves multiple purposes—it accelerates training, provides a degree of regularization, and maintains consistent feature magnitudes across the sequence elements.

Additionally, a small fully connected feed-forward network (FFN) is incorporated into the model, applied uniformly to each position. The transformation, inclusive of the residual connection, can be summarized as:

$$
\begin{split}
    \text{FFN}(x) & = \max(0, xW_1+b_1)W_2 + b_2\\
    x & = \text{LayerNorm}(x + \text{FFN}(x))
\end{split}
$$

To further enhance model robustness and prevent overfitting, dropout layers are strategically employed in the MLP, both on its output and in conjunction with the Multi-Head Attention as regularization measures.

Add your solution to the `EncoderBlock` and `DecoderBlock` classes.

In [None]:
class EncoderBlock(nn.Module):

    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):
        """
        Inputs:
            input_dim - Dimensionality of the input
            num_heads - Number of heads to use in the attention block
            dim_feedforward - Dimensionality of the hidden layer in the MLP
            dropout - Dropout probability to use in the dropout layers
        """
        super().__init__()

        # Attention layer
        self.self_attn = MultiHeadAttention(num_heads, input_dim)

        # Two-layer MLP
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),
            nn.Dropout(dropout),
            nn.ReLU(inplace=True),
            nn.Linear(dim_feedforward, input_dim)
        )

        # Layers to apply in between the main layers
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self_attention part (use self.norm1)
        '''
        Add your code below
        '''

        # MLP part (use self.norm2)
        '''
        Add your code below
        '''

        return x


In [None]:
class DecoderBlock(nn.Module):

    def __init__(self, input_dim, num_heads, dim_feedforward, dropout=0.0):
        """
        Inputs:
            input_dim - Dimensionality of the input
            num_heads - Number of heads to use in the attention block
            dim_feedforward - Dimensionality of the hidden layer in the MLP
            dropout - Dropout probability to use in the dropout layers
        """
        super().__init__()

        # Self Attention layer
        self.self_attn = MultiHeadAttention(num_heads, input_dim)
        # Attention Layer
        self.src_attn = MultiHeadAttention(num_heads, input_dim)

        # Two-layer MLP
        self.linear_net = nn.Sequential(
            nn.Linear(input_dim, dim_feedforward),
            nn.Dropout(dropout),
            nn.ReLU(inplace=True),
            nn.Linear(dim_feedforward, input_dim)
        )

        # Layers to apply in between the main layers
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.norm3 = nn.LayerNorm(input_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, memory, src_mask, tgt_mask):
        # Self-Attention part (use self.norm1)
        '''
        Add your code below
        '''

        # Attention part (use self.norm2)
        # Recall that memory is the output of the encoder and replaces x as
        # the key and value in the attention layer
        '''
        Add your code below
        '''

        # MLP part (use self.norm3)
        '''
        Add your code below
        '''

        return x


### Positional Encoding  (**1 Point**)

Positional information plays a vital role in tasks like language understanding, where the order of words in a sequence is crucial. To incorporate this positional context into our model, we can utilize positional encoding. Even if we were to learn embeddings for every possible position, it would not be feasible for sequences of varying lengths. Therefore, a more practical approach is to employ feature patterns that the network can discern from the input features and potentially generalize to longer sequences.

Following the solution of Vaswani et al., the positional encoding is defined as:

$$
PE_{(pos,i)} = \begin{cases}
    \sin\left(\frac{pos}{10000^{i/d_{\text{model}}}}\right) & \text{if}\hspace{3mm} i \text{ mod } 2=0\\
    \cos\left(\frac{pos}{10000^{(i-1)/d_{\text{model}}}}\right) & \text{otherwise}\\
\end{cases}
$$

In this equation, $PE_{(pos, i)}$ represents the positional encoding value at position $pos$ within the sequence and hidden dimension $i$. The combination of these values forms the positional information, which is added to the initial input features and concatenated across all hidden dimensions. This strategy allows the model to capture and utilize positional context effectively.

In [None]:
class PositionalEncoding(nn.Module):
    """
    Implement the PE function.
    """

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        '''
        Add your code below
        '''
        pe = pe.unsqueeze(0) # the final dimension is (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], requires_grad=False)
        return self.dropout(x)


To gain a deeper understanding of positional encoding, we can visualize it. We'll generate a sequence-based image that represents positional encoding across hidden dimensions. In this visualization, each pixel will signify the adjustment made to the input feature to encode a specific position.

**Do not change the following code.**

In [None]:
encod_block = PositionalEncoding(d_model=48, dropout=0.1, max_len=96)
pe = encod_block.pe.squeeze().T.cpu().numpy()

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,3))
pos = ax.imshow(pe, cmap="RdGy", extent=(1,pe.shape[1]+1,pe.shape[0]+1,1))
fig.colorbar(pos, ax=ax)
ax.set_xlabel("Position in sequence")
ax.set_ylabel("Hidden dimension")
ax.set_title("Positional encoding over hidden dimensions")
ax.set_xticks([1]+[i*10 for i in range(1,1+pe.shape[1]//10)])
ax.set_yticks([1]+[i*10 for i in range(1,1+pe.shape[0]//10)])
plt.show()

The sine and cosine waves with various wavelengths that encode the position in the hidden dimensions are easily visible. To better understand the pattern, we can examine the sine/cosine wave for each hidden dimension separately. The positional encoding for the hidden dimensions is shown in the image below.

In [None]:
sns.set_theme()
fig, ax = plt.subplots(2, 2, figsize=(12,4))
ax = [a for a_list in ax for a in a_list]
for i in range(len(ax)):
    ax[i].plot(np.arange(1,17), pe[i,:16], color=f'C{i}', marker="o", markersize=6, markeredgecolor="black")
    ax[i].set_title(f"Encoding in hidden dimension {i+1}")
    ax[i].set_xlabel("Position in sequence", fontsize=10)
    ax[i].set_ylabel("Positional encoding", fontsize=10)
    ax[i].set_xticks(np.arange(1,17))
    ax[i].tick_params(axis='both', which='major', labelsize=10)
    ax[i].tick_params(axis='both', which='minor', labelsize=8)
    ax[i].set_ylim(-1.2, 1.2)
fig.subplots_adjust(hspace=0.8)
sns.reset_orig()
plt.show()

### Transformer Network  (**2 Points**)
Everything we've talked about up to this point is summarized in the `Transformer` class. You will need all of the components (`EncoderBlock`, `DecoderBlock` and `PositionalEncoding`) previously seen to complete it.

In [None]:
class Transformer(nn.Module):
    def __init__(self, enc_inp_size, dec_inp_size, dec_out_size, N=6,
                   d_model=512, dim_feedforward=2048, num_heads=8, dropout=0.1,
                   mean=[0,0],std=[0,0]):
        super(Transformer, self).__init__()

        self.d_model = d_model
        self.num_heads = num_heads
        self.dim_feedforward = dim_feedforward
        self.dropout = dropout
        self.N = N
        self.mean = mean
        self.std = std
        self.enc_inp_size = enc_inp_size
        self.dec_inp_size = dec_inp_size
        self.dec_out_size = dec_out_size

        self.encoder = nn.ModuleList([deepcopy(
            EncoderBlock(d_model, num_heads, dim_feedforward, dropout)) for _ in range(N)])
        self.decoder = nn.ModuleList([deepcopy(
            DecoderBlock(d_model, num_heads, dim_feedforward, dropout)) for _ in range(N)])
        self.pos_enc = PositionalEncoding(d_model, dropout)
        self.pos_dec = PositionalEncoding(d_model, dropout)
        self.src_embed = nn.Linear(enc_inp_size, d_model)
        self.tgt_embed = nn.Linear(dec_inp_size, d_model)
        self.out = nn.Linear(d_model, dec_out_size)

        self.init_weights()


    def forward(self, src, trg, src_mask, trg_mask):

        # First part of the forward pass: embedding and positional encoding
        # both for the source and target
        '''
        Add your code below
        '''

        # Second part of the forward pass: the encoder and decoder layers.
        # Look at the arguments of the forward pass of the encoder and decoder
        # and recall that the encoder output is used as the memory in the decoder.
        '''
        Add your code below
        '''

        return output


    # Initialize parameters with Glorot / fan_avg.
    def init_weights(self):
        for p in self.encoder.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.decoder.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.pos_enc.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.pos_dec.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.src_embed.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.tgt_embed.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)
        for p in self.out.parameters():
            if p.dim() > 1: nn.init.xavier_uniform_(p)



### Optimizer  (**1 Point**)

Here we select same **optimizer** proposed into original Transformer paper.

It uses some initial warmup epochs, where the learning rate is increased. Then it slowly decreases according to a number of epoch and the chosen embedding size. The resulting formula is:

LR = $\frac{F}{\sqrt{D}} min( \frac{1}{\sqrt{epoch}},\ epoch \cdot W^{-\frac{3}{2}}) $

where F is a scaling factor, D is the model embedding size, W is the number of warmup epochs.

In [None]:
class SqrtScheduler(torch.optim.lr_scheduler._LRScheduler):

    def __init__(self, optimizer, factor, model_size, warmup, max_iters):
        self.warmup = warmup
        self.max_num_iters = max_iters
        self.factor = factor
        self.model_size = model_size
        super().__init__(optimizer)

    def get_lr(self):
        lr_factor = self.get_lr_factor(epoch=self.last_epoch+1)
        return [base_lr + lr_factor for base_lr in self.base_lrs]

    def get_lr_factor(self, epoch):
        '''
        Add here your code
        '''
        return lr_factor

In [None]:
# Needed for initializing the lr scheduler
p = nn.Parameter(torch.empty(4,4))
optimizer = torch.optim.Adam([p], lr=0)

lr_scheduler = SqrtScheduler(optimizer=optimizer, factor = 1.0, model_size = 512, warmup=10, max_iters=2000)

# Plotting
epochs = list(range(2000))
sns.set_theme()
plt.figure(figsize=(8,3))
plt.plot(epochs, [lr_scheduler.get_lr_factor(e+1) for e in epochs])
plt.ylabel("Learning rate factor")
plt.xlabel("Iterations (in batches)")
plt.title("Cosine Warm-up Learning Rate Scheduler")
plt.show()
sns.reset_orig()

**Do not change the following code. It is used as a sanity check to verify the good implementation of your code.**

In [None]:
# Select GPU device for the training if available
if not torch.cuda.is_available():
    device=torch.device("cpu")
    print("Current device:", device)
else:
    device=torch.device("cuda")
    print("Current device:", device, "- Type:", torch.cuda.get_device_name(0))


enc_input_size = 2
dec_input_size = 3
dec_output_size = 3


num_heads = 8
d_model = 512
dim_feedforward = 2048
dropout = 0.1
preds_num = 8

def subsequent_mask(size):
    """
    Mask out subsequent positions.
    """
    attn_shape = (1, size, size)
    mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
    return torch.from_numpy(mask) == 0

torch.manual_seed(0)
tf = Transformer(enc_input_size, dec_input_size, dec_output_size, N=6,
            d_model=d_model, dim_feedforward=dim_feedforward,
            num_heads=num_heads, dropout=dropout).to(device)

In [None]:
np.random.seed(0)

batch = torch.tensor(np.random.rand(1, 8,4)).float().to(device)
inp = batch[:,1:,0:2].to(device)
target = batch[:,:-1,2:].to(device)

# We create a third mask channel to append to the 2 speeds.
# This helps the decoder differentiating between start of sequence token (with mask token 1) and target speeds (with mask token 0)
# Summarizing: start_of_seq token is (0,0) and the mask token is 1 ---> [0, 0, 1]
#              target inputs are (u_i, v_i) and the mask token is 0 ---> [u_i, v_i, 0]
start_of_seq = torch.Tensor([0, 0, 1]).unsqueeze(0).unsqueeze(1).repeat(target.shape[0], 1, 1).to(device)
target_c = torch.zeros((target.shape[0], target.shape[1], 1)).to(device)
target = torch.cat((target, target_c), -1)
# Final decoder input is the concatenation of them along temporal dimension
dec_inp = torch.cat((start_of_seq, target), 1)

# Source attention is enabled between all the observed input (mask elements are setted to 1)
src_att = torch.ones((inp.shape[0], 1, inp.shape[1])).to(device)
# For the target attention we mask future elements to prevent model cheating (corresponding future mask elements are setted to False)
# The mask is changed dinamically to use teacher forcing learning
trg_att = subsequent_mask(dec_inp.shape[1]).repeat(dec_inp.shape[0], 1, 1).to(device)

# Source, target and corresponding attention mask are passed to the model for the forward step
tf.eval()
pred = tf(inp.float(), dec_inp.float(), src_att, trg_att)
pred

### Question (**1 Point**)
Q: *Considering the Recurrent Neural Network (RNN) architecture as the previous state-of-the-art for sequence modeling, what are the main advantages of the Transformer architecture?*

A: ...

---
## Graph Metanetworks <a class="anchor" id="gmn"></a>

This section of the notebook will delve into the paper [Graph Metanetworks for Processing Diverse Neural Architectures](https://arxiv.org/abs/2312.04501v2) by Lim et al., which introduces a novel framework for representing and processing the parameters of neural networks as graph structures. You will find below an explanation of the key paper sections, particularly focusing on the foundational elements from sections 1, 2.1, and 2.3, and you are asked to implement the message passing GNN that processes this graphs.

<center width="100%" style="padding:25px"><img src="https://drive.google.com/thumbnail?id=1f76OXJFxIr0KZwWAI7nEtSBGEZB9hWN_&sz=w1000"></center>

The paper introduces a new approach about metanetworks (also known as *hypernetworks*), which are neural networks that take the parameters of other neural networks as input. Traditional metanetwork approaches have 2 main limitations:
* they are often able to process only specific architectures, such as MLPs or CNNs, hence struggling with generalization;
* they consider parameters of neural networks as flattened 1D tensors, hence losing all the structural information of the original network.

In "Graph Metanetworks for Processing Diverse Neural Architectures" the authors propose a new method called Graph Metanetworks (GMNs), which encodes neural networks as *parameter graphs*. These are graphs where each parameter is represented as an edge between nodes, which in turn represent single neurons or a group of them. It's important to notice that we call these graphs parameter graphs because we design the graphs so that each parameter is associated with a single edge. These graphs are then fed into standard Graph Neural Networks (also known as Message Passing Neural Networks, MPNNs) to be processed while also respecting important structural properties of the original neural network. The GMN approach is conceived to generalize across various architectures such as linear, convolutional and attention layers, normalization layers, ResNet blocks, and so on. The MPNN outputs a single fixed-length representation for each node, edge, and global feature, we can then use what we prefer depending on the downstream task being addressed. For instance, one graph-level task is to predict the scalar accuracy of an input neural network on some dataset. An edge-level task is to predict new weights for an input neural network to change its functionality somehow. Imagination is the limit!

### A glimpse on the neural network to parameter graph conversion

Every non-recurrent neural network naturally defines a so called "computation graph" as a DAG, where nodes are neurons and edges hold neural network parameter weight values. So this already gives us a method to construct a weighted graph, however the computation graph approach has some downsides. First of all, it may be expensive due to weight sharing, and secondly we may want to add more features - such as layer numbers - on top of this graph, to help performance and expressive power when processing it. In the following image there is an example computation graph for a network with a single convolutional layer. The layer has a 2 x 2 filter kernel, a single input and output channel, and applies the filter with a stride of 2. Even in this small case of a 4x4 input image, the computation graph has 16 edges for only 4 parameters due to weight sharing (for visual clarity, bias terms are ignored here):

<center width="100%" style="padding:25px"><img src="https://drive.google.com/thumbnail?id=1-iFerr4suNhQ_Qza5IpDDCd8JJY__Qja&sz=w500"></center>

Instead, the authors of the paper propose the "parameter graph" representation, in which each parameter of the original network is associated with a single edge of the graph (whereas a parameter may be associated to many edges in a computation graph). For the convolutional layer case, the parameter graph construction allocates one node for each input and output channel, we then have parallel edges between each input and output node for each spatial location in the filter kernel, making this a multigraph. One bias node is added, and the bias parameters are encoded as edges from that bias node to each output channel of the layer. The following figure depicts the parameter graph for a convolution layer with a 2 x 2 filter, one input channel, and two output channels. Note that we have two output channels here, unlike the computation graph in the previous figure, where there is only one.

<center width="100%" style="padding:25px"><img src="https://drive.google.com/thumbnail?id=1odruV6T_bL6vyddbcopJLpi0QRJIDXWd&sz=w500"></center>

The generated parameter graph will have: a feature vector $\mathbf{v_i}$ associated with each node, a feature vector $\mathbf{e_{(i,j)}}$ associated with each edge, and a global feature vector $\mathbf{u}$ associated with the entire graph. The $i^{th}$ node feature vector represents $i^{th}$ node position inside the neural network and its type (linear, conv, normalization, ...). The $\mathbf{e_{(i,j)}}$ edge feature vector represents the parameter, type and position inside the network of the edge starting from node $j$ and reaching node $i$. The global feature could represent anything from a graph-level label to an embedding of something that we want to inject inside the network.

### Loading some parameter graphs

Here we load 2 pretrained MLPs and build a batch containing 2 parameter graphs, just to gain familiarity with the data representation being used and to test the subsequent code. We will use the PyTorch Geometric library which lets us compose Graph Neural Networks in an easy and modular way, exactly like Pytorch does for standard neural networks. Feel free to look at the [documentation](https://pytorch-geometric.readthedocs.io/en/latest/).

The functions ```sequential_to_arch``` and ```arch_to_graph``` are actually converting the neural network into the parameter graph.

In [2]:
# create a batch of 2 graphs using torch_geometrics Data and Batch classes
import torch
from torch_geometric.data import Data, Batch
from gmn.graph_construct.model_arch_graph import sequential_to_arch, arch_to_graph

torch.manual_seed(0)


data = torch.load("data/mlp1.pt", map_location=torch.device("cpu"))['pdata'][-1]
arch = sequential_to_arch(data)
x, edge_index, edge_attr = arch_to_graph(arch)
g_data1 = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, u=torch.randn(1, 8))

data = torch.load("data/mlp2.pt", map_location=torch.device("cpu"))['pdata'][-1]
arch = sequential_to_arch(data)
x, edge_index, edge_attr = arch_to_graph(arch)
g_data2 = Data(x=x, edge_index=edge_index, edge_attr=edge_attr, u=torch.randn(1, 8))

batch = Batch.from_data_list([g_data1, g_data2])


print(g_data1)
print(g_data2)
print(batch) # batch.batch has shape [n_nodes], it contains the index of the graph in the batch that each node belongs to

Data(x=[893, 3], edge_index=[2, 52650], edge_attr=[52650, 6], u=[1, 8])
Data(x=[893, 3], edge_index=[2, 52650], edge_attr=[52650, 6], u=[1, 8])
DataBatch(x=[1786, 3], edge_index=[2, 105300], edge_attr=[105300, 6], u=[2, 8], batch=[1786], ptr=[3])


  data = torch.load("data/mlp1.pt", map_location=torch.device("cpu"))['pdata'][-1]
  data = torch.load("data/mlp2.pt", map_location=torch.device("cpu"))['pdata'][-1]


In PyTorch Geometric, graph data is typically represented using the ```torch_geometric.data.Data``` class. Each graph is stored as a set of tensors that describe its nodes, edges, and potentially global properties:
* **x** contains the node features. Here, a graph has 893 nodes, and each node has 3 features;
* **edge_index** encodes the connections (directed edges) between nodes. The shape [2, 52650] means there are 52650 edges in the graph, and the 2 represents the source and target nodes for each edge. The first row of edge_index contains the source nodes of the edges, and the second row contains the target nodes;
* **edge_attr** represents the features of the edges. Each of the 52650 edges has 6 features. Note that edge_attr[i] is the feature vector associated with the edge in edge_index[:, i];
* **u** represents the global feature vector associated with the graph. The shape [1, 8] indicates there is 1 graph and it has a global feature vector of size 8.

Batching is done with the ```torch_geometric.data.Batch``` class, by concatenating the node and edge information of all graphs into a single tensor, as simple as that. In this example, we have a batch of two graphs. You can ignore the **ptr** attribute, while the new **batch** attribute is very useful to keep track of which nodes belong to which graph in the batch. It is a tensor containing for each node the index of the graph in the batch that the node belongs to. For example, the first 893 nodes here would have the value 0 in batch.batch (indicating they belong to the first graph), and the next 893 nodes would have the value 1 (indicating they belong to the second graph).

### MPNN implementation

We will now implement the Message Passing Neural Network that acts as the backbone in GMN. We will proceed by implementing the single components that process edge features, node features and global features separately, and then merge them together in the final class. Please refer to section 2.3 of the [paper](https://arxiv.org/abs/2312.04501) for more details.

The network implemented here is a generalization of a GNN that updates node, edge, and global features all together. For a graph, let $v_i \in \mathbb{R}^{d_v}$ be the feature of node $i$, $e_{(i,j)} \in \mathbb{R}^{d_e}$ a feature of the directed edge $(i,j)$, $u \in \mathbb{R}^{d_u}$ be a global feature associated to the entire graph, and let $E$ be the set of edges in the graph. The directed edge $(i,j)$ represents an edge starting from $j$ and ending at $i$. We allow multigraphs, where there can be several edges (and hence several edge features) between a pair of nodes $(i,j)$; thus, we let $E_{(i,j)}$ denote the set of edge features associated with $(i,j)$.

### Edge Model (1 point)

In the edge model each edge updates its features based on the features of the nodes it connects and the global feature. The mathematical formulation of the edge model is:

$$e_{(i,j)} \leftarrow \text{MLP}^{e}(v_i, v_j, e_{(i,j)}, u)$$

**IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**

In [25]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import MetaLayer
from torch_scatter import scatter

class EdgeModel(nn.Module):
    def __init__(self, in_dim, out_dim, activation=True):
        super().__init__()
        if activation:
            self.edge_mlp = nn.Sequential(nn.Linear(in_dim, out_dim), nn.ReLU())
        else:
            self.edge_mlp = nn.Sequential(nn.Linear(in_dim, out_dim))

    def forward(self, src, dest, edge_attr, u, batch):
        # **IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**
        # src, dest: [E, F_x], where E is the number of edges. src is the source node features and dest is the destination node features of each edge.
        # edge_attr: [E, F_e]
        # u: [B, F_u], where B is the number of graphs.
        # batch: only here it will have shape [E] with max entry B - 1, because here it indicates the graph index for each edge.

        '''
        Add your code below
        '''
        u_batch = u[batch]
        edge_input = torch.cat([src, dest, edge_attr, u_batch], dim=1)
        updated_edge = self.edge_mlp(edge_input)

        return updated_edge

### Node Model (2 points)

In the node model each node updates its features based on its current feature, the aggregated information from neighboring nodes and edges plus the global feature, and the global feature once again. This is defined by:

$$v_i \leftarrow \text{MLP}_v^2 \left(v_i, \sum_{j, e_{(i,j)} \in E(i,j)} \text{MLP}_v^1 (v_i, v_j, e_{(i,j)}, u), u \right)$$

$\text{MLP}_v^1$ and $\text{MLP}_v^2$ are Multi-Layer Perceptrons applied to node, edge and global features. The sum aggregates messages from all neighboring nodes $v_j$ connected by edges $e_{(i,j)}$. For the aggregation, please consider using the [scatter](https://pytorch-scatter.readthedocs.io/en/latest/functions/scatter.html#torch_scatter.scatter) function from pytorch_scatter, the documentation is really clear.

**IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**

In [30]:
class NodeModel(nn.Module):
    def __init__(self, in_dim_mlp1, in_dim_mlp2, out_dim, activation=True, reduce='sum'):
        super().__init__()
        self.reduce = reduce
        if activation:
            self.node_mlp_1 = nn.Sequential(nn.Linear(in_dim_mlp1, out_dim), nn.ReLU())
            self.node_mlp_2 = nn.Sequential(nn.Linear(in_dim_mlp2, out_dim), nn.ReLU())
        else:
            self.node_mlp_1 = nn.Sequential(nn.Linear(in_dim_mlp1, out_dim))
            self.node_mlp_2 = nn.Sequential(nn.Linear(in_dim_mlp2, out_dim))

    def forward(self, x, edge_index, edge_attr, u, batch):
        # **IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**
        # x: [N, F_x], where N is the number of nodes.
        # edge_index: [2, E] with max entry N - 1.
        # edge_attr: [E, F_e]
        # u: [B, F_u]
        # batch: [N] with max entry B - 1.

        '''
        Add your code below
        '''
        u_batch = u[batch]
        ### MLP1 section ###
        dest, src = edge_index
        edge_info = torch.cat([x[src], edge_attr], dim=1)
        

        edge_messages = self.node_mlp_1(edge_info)


        # aggregation of all messages
        aggregated_messages = scatter(edge_messages, dest, dim=0, reduce=self.reduce)

        ### MLP2 section ###
        node_info = torch.cat([x, aggregated_messages, u_batch], dim=1)

        updated_node_features = self.node_mlp_2(node_info)

        return updated_node_features

### Global Model (1 point)

The global feature $u$ is updated based on aggregations of all node and edge features, and the global feature itself:

$$u \leftarrow \text{MLP}_u \left( \sum_i v_i, \sum_{e \in E} e, u \right)$$

**IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**

In [19]:
class GlobalModel(nn.Module):
    def __init__(self, in_dim, out_dim, activation=True, reduce='sum'):
        super().__init__()
        if activation:
            self.global_mlp = nn.Sequential(nn.Linear(in_dim, out_dim), nn.ReLU())
        else:
            self.global_mlp = nn.Sequential(nn.Linear(in_dim, out_dim))
        self.reduce = reduce

    def forward(self, x, edge_index, edge_attr, u, batch):
        #**IMPORTANT: YOU ARE NOT ALLOWED TO USE FOR LOOPS!**
        # x: [N, F_x], where N is the number of nodes.
        # edge_index: [2, E] with max entry N - 1.
        # edge_attr: [E, F_e]
        # u: [B, F_u]
        # batch: [N] with max entry B - 1.

        '''
        Add your code below
        '''
        # Aggregate node features
        print("SONO NEL GLOBAL MODEL")
        u_batch = u[batch]
        node_aggregated = scatter(x, batch, dim=0, reduce=self.reduce)
        
        # Aggregate edges features
        edge_batch = batch[edge_index[0]]
        edge_aggregated = scatter(edge_attr, edge_batch, dim=0, reduce=self.reduce)
        
        # Concatenare le informazioni dei nodi aggregati, degli archi aggregati e del vettore globale u
        global_info = torch.cat([node_aggregated, edge_aggregated, u_batch], dim=1)
        
        # Pass to MLP
        updated_global = self.global_mlp(global_info)
        
        return updated_global

### MPNN (2 points)

Now you are asked to put pieces together and build the complete MPNN. We will use the [torch_geometric.nn.MetaLayer](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.MetaLayer.html) class, which helps us by handling some of the scaffold code. Please notice that inside the MetaLayer firstly the edge model is applied, then the node model and then the global one, each of them using the output of the previous model. If you need, give a look to the [source code](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/models/meta.html#MetaLayer) of the class to better understand the shapes of the tensors that you have to expect.

In [31]:
class MPNN(nn.Module):

    def __init__(self, node_in_dim, edge_in_dim, global_in_dim, hidden_dim, node_out_dim, edge_out_dim, global_out_dim, num_layers,
                use_bn=True, dropout=0.0, reduce='sum'):
        super().__init__()
        self.convs = nn.ModuleList()
        self.node_norms = nn.ModuleList()
        self.edge_norms = nn.ModuleList()
        self.global_norms = nn.ModuleList()
        self.use_bn = use_bn
        self.dropout = dropout
        self.reduce = reduce

        assert num_layers >= 2

        '''
        Instantiate the first layer models with correct parameters below
        '''

        edge_model = EdgeModel(in_dim=edge_in_dim + 2 * node_in_dim + global_in_dim, out_dim=hidden_dim)
        node_model = NodeModel(in_dim_mlp1=node_in_dim + hidden_dim, in_dim_mlp2=node_in_dim + hidden_dim + global_in_dim, out_dim=hidden_dim)
        global_model = GlobalModel(in_dim=node_in_dim + edge_in_dim + global_in_dim, out_dim=hidden_dim)
        self.convs.append(MetaLayer(edge_model=edge_model, node_model=node_model, global_model=global_model))
        self.node_norms.append(nn.BatchNorm1d(hidden_dim))
        self.edge_norms.append(nn.BatchNorm1d(hidden_dim))
        self.global_norms.append(nn.BatchNorm1d(hidden_dim))

        for _ in range(num_layers-2):
            '''
            Add your code below
            '''
            # add batch norm after each MetaLayer
            edge_model = EdgeModel(in_dim=edge_in_dim + 2 * node_in_dim + global_in_dim, out_dim=hidden_dim)
            node_model = NodeModel(in_dim_mlp1=node_in_dim + hidden_dim, in_dim_mlp2=node_in_dim + hidden_dim + global_in_dim, out_dim=hidden_dim)
            global_model = GlobalModel(in_dim=node_in_dim + edge_in_dim + global_in_dim, out_dim=hidden_dim)
            self.convs.append(MetaLayer(edge_model=edge_model, node_model=node_model, global_model=global_model))
            self.node_norms.append(nn.BatchNorm1d(hidden_dim))
            self.edge_norms.append(nn.BatchNorm1d(hidden_dim))
            self.global_norms.append(nn.BatchNorm1d(hidden_dim))


        '''
        Add your code below
        '''
        # last MetaLayer without batch norm and without using activation functions
        edge_model = EdgeModel(in_dim=edge_in_dim + 2 * node_in_dim + global_in_dim, out_dim=hidden_dim, activation=False)
        node_model = NodeModel(in_dim_mlp1=node_in_dim + hidden_dim, in_dim_mlp2=node_in_dim + hidden_dim + global_in_dim, out_dim=hidden_dim, activation=False)
        global_model = GlobalModel(in_dim=node_in_dim + edge_in_dim + global_in_dim, out_dim=hidden_dim, activation=False)
        self.convs.append(MetaLayer(edge_model=edge_model, node_model=node_model, global_model=global_model))


    def forward(self, x, edge_index, edge_attr, u, batch, *args):

        for i, conv in enumerate(self.convs):
            '''
            Add your code below
            '''
            x, edge_attr, u = conv(x, edge_index, edge_attr, u, batch)

            if i != len(self.convs)-1 and self.use_bn:
                '''
                Add your code below this line, but before the dropout
                '''
                x = self.node_norms[i](x)
                edge_attr = self.edge_norms[i](edge_attr)
                u = self.global_norms[i](u)



            x = F.dropout(x, p=self.dropout, training=self.training)
            edge_attr = F.dropout(edge_attr, p=self.dropout, training=self.training)
            u = F.dropout(u, p=self.dropout, training=self.training)

        return x, edge_attr, u


**Do not change the following code. It is used as a sanity check to verify the good implementation of your code.**

In [32]:
torch.manual_seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)
random.seed(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# try the MPNN model
node_in_dim = 3
edge_in_dim = 6
global_in_dim = 8
hidden_dim = 16
node_out_dim = 8
edge_out_dim = 8
global_out_dim = 8
num_layers = 3
use_bn = True
dropout = 0.1
reduce = 'sum'

model = MPNN(node_in_dim, edge_in_dim, global_in_dim, hidden_dim, node_out_dim, edge_out_dim, global_out_dim, num_layers, use_bn, dropout, reduce)

# use the batch object created before
x, edge_attr, u = model(batch.x, batch.edge_index, batch.edge_attr, batch.u, batch.batch)
print(x.shape, edge_attr.shape, u.shape)
print()
print(x)
print()
print(edge_attr)
print()
print(u)

src shape: torch.Size([105300, 3])
dest shape: torch.Size([105300, 3])
edge_attr shape: torch.Size([105300, 6])
u shape: torch.Size([2, 8])
u_batch shape: torch.Size([105300, 8])
edge_input shape: torch.Size([105300, 20])
updated_edge shape: torch.Size([105300, 16])
Edge_info shape: torch.Size([105300, 19])
DOPO IL PRIMO MLP


NameError: name 'scatter' is not defined