<a href="https://colab.research.google.com/github/dp457/Graph-Neural-Network/blob/main/Heterogenous_Graph_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A large set of real-world datasets are stored as **heterogeneous graphs**, motivating the introduction of specialized functionality. For example, most graphs in the area of recommendation, such as social graphs, are heterogeneous, as they store information about different types of entities and their different types of relations. This tutorial helps us understand how to map heterogenous graphs and used as input to GNN.

Heterogeneous graphs come with different types of information attached to nodes and edges. Thus, a single node or edge feature tensor cannot hold all node or edge features of the whole graph, due to differences in type and dimensionality. Different data structure, message passing formulation changes accordingly, allowing the computation of the message and update function based on node or edge type.

The data we consider is a heterogenous graph split between he four node types author, paper, institution and field of study. It is one of the four types
*   **writes** - Author writes a specific paper
*   **affiliated with** - author affiliated with a specific institution.
*   **cites** - A paper cites another paper.
*   **has topic** - Paper has a topic of a specific field of study.





In [1]:
!pip install torch_geometric

Collecting torch_geometric
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/63.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m61.4/63.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m61.4/63.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m599.0 kB/s[0m eta [36m0:00:00[0m
Downloading torch_geometric-2.6.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch_geometric
Successfully installed torch_geometric-2.6.1


In [2]:
import torch
from torch_geometric.data import HeteroData

data = HeteroData()

# Node features
num_papers, num_authors, num_institutions, num_fields = 5, 3, 2, 4
feat_paper, feat_author, feat_inst, feat_field = 16, 8, 6, 10

data['paper'].x = torch.randn(num_papers, feat_paper)             # [5, 16]
data['author'].x = torch.randn(num_authors, feat_author)          # [3, 8]
data['institution'].x = torch.randn(num_institutions, feat_inst)  # [2, 6]
data['field_of_study'].x = torch.randn(num_fields, feat_field)    # [4, 10]

# Edge indices
num_edges_cites, num_edges_writes, num_edges_affiliated, num_edges_topic = 6, 4, 3, 5

data['paper', 'cites', 'paper'].edge_index = torch.randint(0, num_papers, (2, num_edges_cites))
data['author', 'writes', 'paper'].edge_index = torch.stack([
    torch.randint(0, num_authors, (num_edges_writes,)),
    torch.randint(0, num_papers, (num_edges_writes,))
], dim=0)
data['author', 'affiliated_with', 'institution'].edge_index = torch.stack([
    torch.randint(0, num_authors, (num_edges_affiliated,)),
    torch.randint(0, num_institutions, (num_edges_affiliated,))
], dim=0)
data['paper', 'has_topic', 'field_of_study'].edge_index = torch.stack([
    torch.randint(0, num_papers, (num_edges_topic,)),
    torch.randint(0, num_fields, (num_edges_topic,))
], dim=0)

# Edge attributes
feat_cites, feat_writes, feat_affiliated, feat_topic = 4, 3, 2, 5

data['paper', 'cites', 'paper'].edge_attr = torch.randn(num_edges_cites, feat_cites)
data['author', 'writes', 'paper'].edge_attr = torch.randn(num_edges_writes, feat_writes)
data['author', 'affiliated_with', 'institution'].edge_attr = torch.randn(num_edges_affiliated, feat_affiliated)
data['paper', 'has_topic', 'field_of_study'].edge_attr = torch.randn(num_edges_topic, feat_topic)

print(data)


HeteroData(
  paper={ x=[5, 16] },
  author={ x=[3, 8] },
  institution={ x=[2, 6] },
  field_of_study={ x=[4, 10] },
  (paper, cites, paper)={
    edge_index=[2, 6],
    edge_attr=[6, 4],
  },
  (author, writes, paper)={
    edge_index=[2, 4],
    edge_attr=[4, 3],
  },
  (author, affiliated_with, institution)={
    edge_index=[2, 3],
    edge_attr=[3, 2],
  },
  (paper, has_topic, field_of_study)={
    edge_index=[2, 5],
    edge_attr=[5, 5],
  }
)


Node or edge tensors will be automatically created upon first access and indexed by string keys. Node types are identified by a single string while edge types are identified by using a triplet **(source_node_type, edge_type, destination_node_type)**.  As such, the data object allows different feature dimensionalities for each type.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import HeteroConv, GCNConv, SAGEConv, GATConv

In [10]:
class HeteroGNN(nn.Module):
  def __init__(self, metadata, hidden_channels, out_channels):
    super().__init__()

    # Define heterogenous convolution layers
    # metadata = (node_types, edge_types) -> from data.metadata()

    self.conv1 = HeteroConv({
    ('paper', 'cites', 'paper'): GCNConv(-1, hidden_channels, add_self_loops=True),
    ('author', 'writes', 'paper'): SAGEConv((-1, -1), hidden_channels),
    ('author', 'affiliated_with', 'institution'): SAGEConv((-1, -1), hidden_channels),
    ('paper', 'has_topic', 'field_of_study'): GATConv((-1, -1), hidden_channels, add_self_loops=False)
      }, aggr='mean')

    self.lin_dict = nn.ModuleDict()
    for node_type in metadata[0]:
            self.lin_dict[node_type] = nn.Linear(hidden_channels, out_channels)

  def forward(self, x_dict, edge_index_dict, edge_attr_dict=None):

    # 1. Message Passing
    x_dict = self.conv1(x_dict, edge_index_dict)

    # 2. Apply non-linearity
    x_dict = {k: F.relu(v) for k, v in x_dict.items()}

    # 3. Apply per-node-type linear classifier / projection
    out_dict = {k: self.lin_dict[k](v) for k, v in x_dict.items()}

    return out_dict

In [12]:
# Grab metadata from data (node_types, edge_types)
from torch_geometric.transforms import ToUndirected
metadata = data.metadata()

data = ToUndirected()(data)

model = HeteroGNN(metadata, hidden_channels=32, out_channels=8)

out = model(data.x_dict, data.edge_index_dict, data.edge_attr_dict)
print({k: v.shape for k, v in out.items()})

{'paper': torch.Size([5, 8]), 'institution': torch.Size([2, 8]), 'field_of_study': torch.Size([4, 8])}


# Heterogeneous GNNs

Standard message passing GNNs cannot be applied to heterogenous graph data as node and edge features are of different types cannot be processed by the same functions, as there is a difference in the feature type.

Nature way --> implement message and update functions individually for each edge type. During runtime, MP-GNN would have to iterate over the edge type dictionaries during message computation and over node dictionaries during the node updates. In order to avoid unncessary runtime overheads, and to create heterogenous MP-GNNs, as simple as possible, there are three ways:

1. Automatically convert the homogenous model to heterogenous model.
2. Define individual functions using different types of PyG wrappers for heterogenous convolution.
3. Deploy existing heterogenous GNN operators.
