### prepare environment
We use PyTorch Geometric (PyG) to conduct our experiment. PyG provides a set of graph benchmark dataset and standard graph neural networks (GNNs) model.

**Note**: the code used in this experiment is following [PyG tutorial](https://pytorch-geometric.readthedocs.io/en/latest/notes/colabs.html)

In [None]:
# install required packages for PyG
!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

[K     |████████████████████████████████| 7.9 MB 43.4 MB/s 
[K     |████████████████████████████████| 3.5 MB 22.1 MB/s 
[K     |████████████████████████████████| 145 kB 33.2 MB/s 
[K     |████████████████████████████████| 74 kB 2.8 MB/s 
[K     |████████████████████████████████| 112 kB 63.0 MB/s 
[K     |████████████████████████████████| 596 kB 40.7 MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone


In [None]:
# import all necessary libraries
import time
from collections import Counter

from prettytable import PrettyTable
import torch
from torch import nn
import torch.nn.functional as F

from torch_geometric import datasets
import torch_geometric.transforms as T
from torch_geometric.nn import GCNConv

torch.manual_seed(2022)  # to reproduce the result

<torch._C.Generator at 0x7efba1d700f0>

### Dataset
We use two different datasets to compare how a GNNs model perform on them.

The first dataset is a standard citation network dataset, `Cora`. In this dataset, nodes represent scientific documents and edges represent the citation links between them. The dataset is well-cleaned with ~ 2.7k nodes. Each node contains rich features from its text content. We use the default training and testing set that were splitted by [paper](https://arxiv.org/abs/1603.08861)

The second dataset is a collection from Amazon products, [source](https://arxiv.org/abs/1811.05868). Nodes represent products and edges represent a fact that 2 products are in the same order (bought together). We use the large dataset, `Computers`, with ~ 13k nodes. Node features are users' review represented as bag-of-word vector (_claim_: not a high quality representation). This dataset is larger and also noiser than the first one, it's inherently a challenge for our model.

Let's download the datasets and see some statistics about them!

In [None]:
# get dataset
cora_ds = datasets.Planetoid(root='data/Planetoid', name='Cora')
amazon_ds = datasets.Amazon(root='data/Amazon', name='Computers')

# get some stats
ds_names = ['Cora', 'Amazon']
data_stats = {
    'Cora': {},
    'Amazon': {}
}

for name, dataset in zip(ds_names, [cora_ds, amazon_ds]):
  data_stats[name]['Number of graphs'] = len(dataset)
  data_stats[name]['Number of features'] = dataset.num_features
  data_stats[name]['Number of classes'] = dataset.num_classes
  data_stats[name]['Number of nodes'] = dataset[0].num_nodes
  data_stats[name]['Number of edges'] = dataset[0].num_edges
  data_stats[name]['Number of node per class'] = dict(Counter(dataset[0].y.tolist()))
  data_stats[name]['Average node degree'] = round(dataset[0].num_edges / dataset[0].num_nodes)
  data_stats[name]['Train-test prepared?'] = hasattr(dataset[0], "train_mask")
  data_stats[name]['Has isolated nodes'] = dataset[0].has_isolated_nodes()
  data_stats[name]['Has self-loops'] = dataset[0].has_self_loops()
  data_stats[name]['Is undirected graph'] = dataset[0].is_undirected()

table = PrettyTable()
table.field_names = ['Attribute'] + ds_names
for attribute in sorted(data_stats[ds_names[0]].keys(), reverse=True):
  table.add_row([attribute, data_stats[ds_names[0]][attribute], data_stats[ds_names[1]][attribute]])

print('Dataset stats comparison:')
print(table)

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!
Downloading https://github.com/shchur/gnn-benchmark/raw/master/data/npz/amazon_electronics_computers.npz
Processing...


Dataset stats comparison:
+--------------------------+----------------------------------------------------------+--------------------------------------------------------------------------------------+
|        Attribute         |                           Cora                           |                                        Amazon                                        |
+--------------------------+----------------------------------------------------------+--------------------------------------------------------------------------------------+
|   Train-test prepared?   |                           True                           |                                        False                                         |
|     Number of nodes      |                           2708                           |                                        13752                                         |
| Number of node per class | {3: 818, 4: 426, 0: 351, 2: 418, 1: 217, 5: 298, 6: 180} | {4: 5158, 8

Done!


Intuitively, the Amazon dataset will be the harder for our model:
  - it has more nodes, larger `average node degree` means our computation graph will be expanded.
  - some nodes are isolated and negatively contribute to the model.
  - unbalance: class `4` has 5k samples, while class `9` and `5` have only ~ 300 samples.
  - ...

Thanks to the Cora dataset, it's quite clean and we do not have to do anything with it except normalizing features. But the Amazon dataset is unbalance between classes and we need to pick our training/testing set. From the statistics table, we will keep number of training at 100 nodes/class when splitting.

In [None]:
# define transformation operator for amazon dataset
# randomly choose train/test sample
transform_op = T.Compose([
  T.NormalizeFeatures(),
  T.RandomNodeSplit(split='random', num_train_per_class=100, num_test=0.2)
])

# get the first graph from dataset and apply transformed operator
amazon_graph = transform_op(amazon_ds[0])
print('Amazon graph: ', amazon_graph)

# normalize cora graph
cora_graph = T.NormalizeFeatures()(cora_ds[0])
print('Cora graph: ', cora_graph)

Amazon graph:  Data(x=[13752, 767], edge_index=[2, 491722], y=[13752], train_mask=[13752], val_mask=[13752], test_mask=[13752])
Cora graph:  Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])


Let's look at our training dataset again

In [None]:
table = PrettyTable()
table.field_names = ['Attribute', 'Cora', 'Amazon']
table.add_row(['training', cora_graph.train_mask.sum().item(), amazon_graph.train_mask.sum().item()])
table.add_row(['testing', cora_graph.test_mask.sum().item(), amazon_graph.test_mask.sum().item()])
table.add_row(['validation', cora_graph.val_mask.sum().item(), amazon_graph.val_mask.sum().item()])

print(table)

+------------+------+--------+
| Attribute  | Cora | Amazon |
+------------+------+--------+
|  training  | 140  |  1000  |
|  testing   | 1000 |  2750  |
| validation | 500  |  500   |
+------------+------+--------+


### Model

Now we define a simple model with 2 layers GCN layer, each layer is equivalent to a neural-message-passing iteration.

In [None]:
class GCN(torch.nn.Module):
    def __init__(self, num_features, hidden_channels, num_classes):
        super().__init__()
        torch.manual_seed(11)
        self.conv1 = GCNConv(num_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, num_classes)

    def forward(self, x, edge_idx):
        # 1st neural message passing
        x = self.conv1(x, edge_idx)
        x = x.relu()
        x = F.dropout(x, p=0.5, training=self.training)

        # 2nd neural message passing
        x = self.conv2(x, edge_idx)

        return x


In [None]:
# create 2 models for 2 datasets with the same architecture
cora_model = GCN(num_features=cora_ds.num_features, hidden_channels=16, num_classes=cora_ds.num_classes)
amazon_model = GCN(num_features=amazon_ds.num_features, hidden_channels=16, num_classes=amazon_ds.num_classes)

print('GCN cora:', cora_model)
print()
print('GCN amazon:', amazon_model)

GCN cora: GCN(
  (conv1): GCNConv(1433, 16)
  (conv2): GCNConv(16, 7)
)

GCN amazon: GCN(
  (conv1): GCNConv(767, 16)
  (conv2): GCNConv(16, 10)
)


### Training

Firstly, we train the `GCN Cora` model since the Cora dataset is small so it takes less computation and time to learn.

In [None]:
optimizer = torch.optim.Adam(cora_model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()


def train_cora():
      cora_model.train()
      optimizer.zero_grad()  # clear gradients from previous iteration

      out = cora_model(cora_graph.x, cora_graph.edge_index)  # Perform a single forward pass.
      loss = criterion(out[cora_graph.train_mask], cora_graph.y[cora_graph.train_mask])  # here we use only training nodes for backward step
      loss.backward()  # calculate gradient
      optimizer.step()  # update model's parameters
      return loss

def test_cora():
      cora_model.eval()
      out = cora_model(cora_graph.x, cora_graph.edge_index)
      pred = out.argmax(dim=1)  # get the class with highest probability as prediction
      test_correct = (pred[cora_graph.test_mask] == cora_graph.y[cora_graph.test_mask])  # check model agains true label in test set
      test_acc = int(test_correct.sum()) / int(cora_graph.test_mask.sum())  # get ratio of correct predictions as accuracy
      return test_acc*100

start_time = time.time()
for epoch in range(0, 100):
    loss = train_cora()
    if epoch % 10 == 0:
      print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

print('---')
test_acc = test_cora()
print(f'Test Accuracy for Cora model: {test_acc:.4f}')

print(f'time executed: {time.time() - start_time:.2f} secs')

Epoch: 000, Loss: 1.9467
Epoch: 010, Loss: 1.8673
Epoch: 020, Loss: 1.7243
Epoch: 030, Loss: 1.5250
Epoch: 040, Loss: 1.3139
Epoch: 050, Loss: 1.1252
Epoch: 060, Loss: 0.9324
Epoch: 070, Loss: 0.8099
Epoch: 080, Loss: 0.6694
Epoch: 090, Loss: 0.5821
---
Test Accuracy for Cora model: 82.2000
time executed: 2.02 secs


It's interesting that the training size is quite small, only 140 samples are provided, but the model was able to learn Cora representation and made a good prediction about 1000 other nodes, with accuracy of 82%.

Let's train the GCN Amazon, notice that we set all parameters the same but definitely with more iterations!

In [None]:
optimizer = torch.optim.Adam(amazon_model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()


def train_amazon():
      amazon_model.train()
      optimizer.zero_grad()  # clear gradients from previous iteration

      out = amazon_model(amazon_graph.x, amazon_graph.edge_index)  # Perform a single forward pass.
      loss = criterion(out[amazon_graph.train_mask], amazon_graph.y[amazon_graph.train_mask])  # here we use only training nodes for backward step
      loss.backward()  # calculate gradient
      optimizer.step()  # update model's parameters
      return loss

def test_amazon():
      amazon_model.eval()
      out = amazon_model(amazon_graph.x, amazon_graph.edge_index)
      pred = out.argmax(dim=1)  # get the class with highest probability as prediction
      test_correct = (pred[amazon_graph.test_mask] == amazon_graph.y[amazon_graph.test_mask])  # check model agains true label in test set
      test_acc = int(test_correct.sum()) / int(amazon_graph.test_mask.sum())  # get ratio of correct predictions as accuracy
      return test_acc*100

start_time = time.time()
for epoch in range(0, 800):
    loss = train_amazon()
    if epoch % 50 == 0:
      print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

print('---')
test_acc = test_amazon()
print(f'Test Accuracy for Amazon model: {test_acc:.4f}')

print(f'time executed: {(time.time() - start_time)/60:.2f} mins')

Epoch: 000, Loss: 2.3024
Epoch: 050, Loss: 2.2193
Epoch: 100, Loss: 2.0553
Epoch: 150, Loss: 1.8726
Epoch: 200, Loss: 1.7015
Epoch: 250, Loss: 1.5806
Epoch: 300, Loss: 1.5130
Epoch: 350, Loss: 1.4458
Epoch: 400, Loss: 1.4036
Epoch: 450, Loss: 1.3624
Epoch: 500, Loss: 1.3403
Epoch: 550, Loss: 1.3395
Epoch: 600, Loss: 1.3072
Epoch: 650, Loss: 1.3045
Epoch: 700, Loss: 1.2805
Epoch: 750, Loss: 1.2840
---
Test Accuracy for Amazon model: 69.8182
time executed: 5.82 mins


The GCN Amazon model is promising to converge with 70% accuracy. This toy experiment presents the powerful of GNNs model with the ability to generalize in different datasets.

However, with a larger graph dataset, more nodes and more relations, this experiment shows the computation challenge of neural-message-passing. Imagine a social networks with billion of nodes, how many hours it will take to learn a good representation!