In [None]:
# Please, make a copy of the notebook before we start.
import torch
import torch.nn as nn

# **Graph convolutions**

*   GCNs are similar to convolutions on images in the sense that the weights are typically shared over all locations in the graph.
*   GCNs rely on message passing methods, which means that vertices exchange information with the neighbours.

### **Message passing**

<center width="100%" style="padding:10px"> <img src ="https://drive.google.com/uc?id=1615w8L1EE5Eu3PSlg8XYlNcyWwepr1p-" width="700px"></center>


*   Step 1: Each node creates a "*message*", i.e. a feature vector.
*   Step 2: The messages are sent to the neighbours, so a node receives one message from a neighbour if they are connected via an edge.
*   Step 3: An aggregation function is applied to the messages received by each node.
  - Typical aggregation functions: *sum*, *mean*.

Let's define the message passing in mathematical terms:

> * $x_i$ - a feature vector of node $i$ summarized in an $N \times D$ matrix $X$, where $N$ is the number of nodes, $D$ is the number of input features.
* $\hat A = A + I$ - sum of the adjacency matrix and an identity matrix. $\hat A \in R^{N \times N}$ This way we have edges corresponding to the connection of a node to itself.
* $\hat D$ - a diagonal node degree matrix of $\hat A$.
* $H^l$ - a feature matrix of the $l$-th layer. $H^0 = X$
* $W^l$ - a weight matrix of the $l$-th layer.
* $\sigma$ - activation function, typically of the ReLU family.

$$H^{l+1} = \sigma\left(\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}H^{l}W^{l}\right)$$

**$H^l W^l$ creates a message made of node features and multiplication with $\hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}$ is responsible for normalization and averaging of the messages that arrived from the neighbours.**

### **Graph convolutional layer**

In [None]:
class GCNLayer(nn.Module):
    def __init__(self, c_in, c_out):
      super().__init__()
      self.projection = nn.Linear(c_in, c_out)

    def forward(self, node_feats, adj_matrix):
        """
        Inputs:
            node_feats - Tensor with node features of shape [batch_size, num_nodes, c_in]
            adj_matrix - Batch of adjacency matrices of the graph. If there is an edge from i to j, adj_matrix[b,i,j]=1 else 0.
                         Supports directed edges by non-symmetric matrices. Assumes to already have added the identity connections. 
                         Shape: [batch_size, num_nodes, num_nodes]
        """
        # TODO: write the forward pass together.
        num_neighbours = adj_matrix.sum(dim=-1, keepdim=True) #Sum accross columns to get the neighbours
        node_feats = self.projection(node_feats) #Create a message for all nodes
        node_feats = torch.bmm(adj_matrix, node_feats) #Passing the messsage
        node_feats = node_feats/num_neighbours #Averaging the message

        return node_feats

### **Exercise 1.1: Understanding the GCN layer**

To further understand the GCN layer, apply it to the example graph above. 

Using `torch`, please, do the following:

*   Create the adjacency matrix for the example graph above.
  - **hint**: use `torch.Tensor`, create a matrix of the shape (4, 4) and reshape it to (1, 4, 4).
*   Create a feature tensor `node_feats` of the shape (1, 4, 2), aranging numbers from 0 to 8.
  - **hint**: check `torch.arange` and `torch.view`. 
*   Create a `gcn_layer`, using the class we wrote above and intialize the linear weight projection matrix as an identity matrix and set the biases to zero, so you could easily see the message passing mechanism in action, since the input features will be equal to the messages in this case.
  - **hint**: set `c_in` and `c_out` in such a way that the node features will have the same shape after the `self.projection` has been applied.
  - **hint**: change these two attributes: `self.projection.weight.data` and `self.projection.bias.data`
*   Create an output feature tensor `out_feats` by passing it through the GCN layer (don't forget to use the adjacency matrix in the forward pass).
*   Print: the adjacency matrix, `node_feats` and `out_feats`.
*   Verify the outputs of the nodes. Are they the averages of the neighbouring node features? Can you prove it?
*   Think about the questions: 
  - How can we pass the information between the nodes 1 and 3?
  - How does the computation time for receiveing a message for node $i$ scale when we increase the number of nodes that we want to include in the message passing procedure? Assume that the path between $i$ and those nodes exists.




In [None]:
adjacency_matrix = torch.Tensor([[1, 1, 0, 0], [1, 1, 1, 1], [0, 1, 1, 1], [0, 1, 1, 1]]) # Cretaing the adj matrix
adjacency_matrix = torch.reshape(adjacency_matrix, (1, 4, 4)) #Reshaping
print(adjacency_matrix.shape)
print(adjacency_matrix)

torch.Size([1, 4, 4])
tensor([[[1., 1., 0., 0.],
         [1., 1., 1., 1.],
         [0., 1., 1., 1.],
         [0., 1., 1., 1.]]])


In [None]:
node_feats = torch.arange(start=0, end=8, step=1,)
node_feats = torch.reshape(node_feats, (1, 4, 2))
print(node_feats.shape)
print(node_feats)

torch.Size([1, 4, 2])
tensor([[[0, 1],
         [2, 3],
         [4, 5],
         [6, 7]]])


In [None]:
gcn_layer = GCNLayer(c_in=2 , c_out=2) # initializing the GCNLayer class
gcn_layer.projection.weight.data = torch.eye(2)
gcn_layers.projection.bias.data = torch.zeros(2)

In [32]:
gcn_layer(
    node_feats=node_feats,
    adj_matrix=adjacency_matrix
)

tensor([[[1., 2.],
         [3., 4.],
         [4., 5.],
         [4., 5.]]], grad_fn=<DivBackward0>)

In [27]:
# Exercise 1.1
node_feats = torch.arange(8, dtype=torch.float32).view(1, 4, 2)
adj_matrix = torch.Tensor([[[1, 1, 0, 0],
                            [1, 1, 1, 1],
                            [0, 1, 1, 1],
                            [0, 1, 1, 1]]])

gcn_layer = GCNLayer(c_in=2, c_out=2)
gcn_layer.projection.weight.data = torch.Tensor([[1., 0.], [0., 1.]])
gcn_layer.projection.bias.data = torch.Tensor([0., 0.])

with torch.no_grad():
    out_feats = gcn_layer(node_feats, adj_matrix)

print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats) 

Adjacency matrix tensor([[[1., 1., 0., 0.],
         [1., 1., 1., 1.],
         [0., 1., 1., 1.],
         [0., 1., 1., 1.]]])
Input features tensor([[[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]]])
Output features tensor([[[1., 2.],
         [3., 4.],
         [4., 5.],
         [4., 5.]]])


### **GCN and its limitation**

<center width="100%" style="padding:10px"> <img src ="https://drive.google.com/uc?id=1OQusIcuEng0MJThwtWGXSmC8vA-euFcS" width="600px"></center>


**The features for nodes 3 and 4 are the same because they have the same adjacent nodes (including itself). Therefore, GCN layers can make the network forget node-specific information if we just take a mean over all messages.** (figure credit - [Thomas Kipf, 2016](https://tkipf.github.io/graph-convolutional-networks/))

#### Solutions:
*   Residual connections.
*   Weigh the self-connections higher.
*   Define a separate weight matrix for the self-connections. 
*   **Compute weights dynamically using graph attention.**



### **Graph attention networks**


*   Attention describes a weighted average of multiple elements with the weights dynamically computed based on an input query and elements' keys (figure credit - [Tutorial on transformers and attention mechanism](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html)).

<center width="100%" style="padding:10px"> <img src ="https://drive.google.com/uc?id=1nQWAIMKknM5oAolkP7f51eHsYOqYXepY" width="700px"></center>

*   As in the GCN, the graph attention (GAT) layer creates a message for each node using a weight matrix.
*   **Query** - the message from the node itself.
*   **Keys = Values** - the messages from the neighbours and **the node itself**.
*   **The score function** - a one-layer MLP with the LeakyReLU non-linearity.
*   The graph structure is injected via masking some of the nodes when attention coefficients are computed. Usually one takes into account only the first-order neighbours of a node $N_i$.

$$\alpha_{ij} = \frac{\exp\left(\text{LeakyReLU}\left(\mathbf{a}\left[\mathbf{W}h_i||\mathbf{W}h_j\right]\right)\right)}{\sum_{k\in\mathcal{N}_i} \exp\left(\text{LeakyReLU}\left(\mathbf{a}\left[\mathbf{W}h_i||\mathbf{W}h_k\right]\right)\right)}$$


<center width="100%" style="padding:10px"> <img src ="https://drive.google.com/uc?id=1ZB6cbFtCdRznoBa4sTimyyP3_cwmN5_8" width="300px"></center>

The final message of the $i$-th node is computed according to:

$$\mathbf{h_i}'=\sigma\left(\sum_{j\in\mathcal{N}_i}\alpha_{ij}\mathbf{W}\mathbf{h_j}\right)$$

#### **Multi-head attention**

<center width="100%" style="padding:10px"> <img src ="https://drive.google.com/uc?id=1WCiCGqrXlQU-WAYmvfADFi1p7tTBsFM6" width="400px"></center>


*   For layers from $0$ to $L-1$:
  $$\mathbf{h_i}'= \parallel_{k=1}^{K} \sigma\left(\sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\mathbf{h_j}\right)$$
*   For the last layer L:
$$\mathbf{h_i}'= \sigma\left(  \frac{1}{K} \sum_{k=1}^{K} \sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\mathbf{h_j}\right)$$
