## Node Classification with DGL 

본 튜토리얼은 WSDM21 Conference를 참고해서 작성했습니다. `DGL`을 사용해서 Node Classification을 하는 방법에 대해서 다룹니다. 

In [92]:
import dgl 
import torch
import torch.nn as nn 
import torch.optim as optim 
import torch.nn.functional as F 

In [93]:
# CoraDataset을 이용해서 semi-supervised node classification을 수행합니다. 
# https://arxiv.org/abs/1609.02907 : 해당 논문은 GCN을 처음 제안한 논문입니다. GCN에 관련된 내용은 https://ok-lab.tistory.com/205?category=940094 여기를 참고하시면 됩니다.
import dgl.data 

dataset = dgl.data.CoraGraphDataset()
print('Number of Categories:', dataset.num_classes)

  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
Number of Categories: 7


In [94]:
g = dataset[0]

print('Number of nodes:', g.num_nodes())
print('Number of edges:', g.num_edges())

Number of nodes: 2708
Number of edges: 10556


`DGLGraph`에서 제공하는 데이터셋에는 `ndata`와 `edata`로 구성되어 있으며, 추가적으로 그래프는 아래와 같은 `node features`들로 구성되어 있습니다.

- `train_mask`: node가 학습 데이터셋인지 아닌지 boolean 값으로 되어 있습니다. 
- `val_mask`: node가 검증 데이터셋인지 아닌지 boolean 값으로 되어 있습니다.
- `test_mask`: node가 테스트 데이터셋인지 아닌지 boolean 값으로 되어 있습니다.
- `label`: node 카테고리 정답값입니다. 
- `feat`: node features.

In [95]:
print('Node feature names:', g.ndata.keys())
print('Edge feature names:', g.edata.keys())

Node feature names: dict_keys(['feat', 'label', 'test_mask', 'train_mask', 'val_mask'])
Edge feature names: dict_keys(['__orig__'])


In [96]:
print('Number of training nodes:', g.ndata['train_mask'].int().sum().item())
print('Number of validating nodes:', g.ndata['val_mask'].int().sum().item())
print('Number of testing nodes:', g.ndata['test_mask'].int().sum().item())

print('Number of classes:', (g.ndata['label'].max()+1).item())
print('Node feature shape:', g.ndata['feat'].shape)

Number of training nodes: 140
Number of validating nodes: 500
Number of testing nodes: 1000
Number of classes: 7
Node feature shape: torch.Size([2708, 1433])


In [97]:
from dgl.nn import GraphConv 

class GCN(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, n_hidden, 'both')
        self.conv2 = GraphConv(n_hidden, n_classes, 'both')
        self.relu = nn.ReLU()
    
    def forward(self, g, in_feat):
        output = self.conv1(g, in_feat)
        output = self.relu(output)
        output = self.conv2(g, output)
        return output 

models = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)

## Training the GCN 

In [98]:
def criterion(pred_y, true_y, mask):
    pred_y = pred_y[mask]
    true_y = true_y[mask]
    return F.cross_entropy(pred_y, true_y)

def accuracy(pred_y, true_y, mask):
    pred_y = pred_y[mask]
    true_y = true_y[mask]
    return (pred_y == true_y).float().mean()

def trainer(g, model, n_epoch, device):
    optimizer = optim.Adam(model.parameters())

    best_val_acc = 0 
    best_test_acc = 0 
    g = g.to(device)
    features = g.ndata['feat'].to(device)
    labels = g.ndata['label'].to(device)
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']

    for epoch in range(1, n_epoch+1):
        pred_y = model(g, features).to(device)
        loss = criterion(pred_y, labels, train_mask)
        pred_y = pred_y.argmax(1)
        train_acc = accuracy(pred_y, labels, train_mask)
        val_acc = accuracy(pred_y, labels, val_mask)
        test_acc = accuracy(pred_y, labels, test_mask)

        if best_val_acc < val_acc : 
            best_val_acc = val_acc 
            best_test_acc = test_acc 
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 50 == 0 :
            print(f'In epoch [{epoch}/{n_epoch}]\t loss: {loss:.3f}\t train_acc: {train_acc*100:.2f}%\t val_acc: {val_acc*100:.2f}%\t test_acc: {test_acc*100:.2f}%')
            print(f'best val acc: {best_val_acc*100:.2f}%\t best test acc: {best_test_acc*100:.2f}%\n')
            

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to(device)

trainer(g, model, 1000, device)

In epoch [50/1000]	 loss: 1.869	 train_acc: 88.57%	 val_acc: 65.60%	 test_acc: 69.50%
best val acc: 65.60%	 best test acc: 68.90%

In epoch [100/1000]	 loss: 1.744	 train_acc: 92.86%	 val_acc: 68.00%	 test_acc: 71.00%
best val acc: 68.60%	 best test acc: 71.00%

In epoch [150/1000]	 loss: 1.577	 train_acc: 96.43%	 val_acc: 70.40%	 test_acc: 73.40%
best val acc: 70.40%	 best test acc: 73.40%

In epoch [200/1000]	 loss: 1.381	 train_acc: 97.14%	 val_acc: 73.00%	 test_acc: 73.90%
best val acc: 73.00%	 best test acc: 73.70%

In epoch [250/1000]	 loss: 1.175	 train_acc: 97.86%	 val_acc: 74.40%	 test_acc: 75.10%
best val acc: 74.40%	 best test acc: 74.90%

In epoch [300/1000]	 loss: 0.978	 train_acc: 97.86%	 val_acc: 75.40%	 test_acc: 75.30%
best val acc: 75.40%	 best test acc: 75.20%

In epoch [350/1000]	 loss: 0.803	 train_acc: 100.00%	 val_acc: 77.00%	 test_acc: 75.70%
best val acc: 77.00%	 best test acc: 75.80%

In epoch [400/1000]	 loss: 0.655	 train_acc: 100.00%	 val_acc: 76.80%	 test_