# Notebook 1: Reproduce RGCN Model with APIs (for review)

This notebook is the first in a series that demonstrates how to use GraphStorm's APIs to create users' own graph machine learning setup by leveraging GraphStorm's easy-to-use and great scalability features. All of these notebooks are designed to run on GraphStorm's Standalone mode, i.e., in a single Linux machine with CPUs and GPUs. 

In this notebook, we willl reproduce the GraphStorm RGCN model with the nessessary APIs and use it to conduct a node classification task on the ACM dataset. By playing with this notebook, users will be able to get familiar with these APIs.

### Prerequsites

- GraphStorm installed using pip. Please find [more details on installation of GraphStorm](https://graphstorm.readthedocs.io/en/latest/install/env-setup.html#setup-graphstorm-with-pip-packages).
- ACM data created in the [Notebook 0: Data Prepare](https://graphstorm.readthedocs.io/en/latest/notebooks/Notebook_0_Data_Prepare.html), and is stored in the `./acm_gs_1p/` folder.
- Installation of supporting libraries, e.g., matplotlib.

### Import libraries

In [1]:
# Import supporting libraries
import matplotlib.pyplot as plt

# Setup log level
import logging
logging.basicConfig(level=20)

# Import GraphStorm APIs
import graphstorm as gs
from graphstorm.trainer import GSgnnNodePredictionTrainer
from graphstorm.dataloading import GSgnnNodeTrainData, GSgnnNodeDataLoader, GSgnnNodeInferData
from graphstorm.model import (GSgnnNodeModel,
                              GSNodeEncoderInputLayer,
                              RelationalGCNEncoder,
                              EntityClassifier,
                              ClassifyLossFunc)
from graphstorm.inference import GSgnnNodePredictionInferrer
from graphstorm.eval import GSgnnAccEvaluator
from graphstorm.tracker import GSSageMakerTaskTracker

---

### 0. Initialize the GraphStorm Standalone Environment

In [2]:
gs.initialize(ip_config=None, backend='gloo')       # TODO: set logging level
device = gs.utils.setup_device(0)

# gs.initialize(**, device=)
# device = gs.get_device()

### 1. Setup Graph Data Information

First, let's setup the information of the ACM graph data for GraphStorm model training and inference.

In [3]:
acm_graph_config = './acm_gs_1p/acm.json'
graph_name = 'acm'

### 2. Setup GraphStorm Dataset and DataLoaders

In [4]:
# create a GraphStorm Dataset for the ACM graph data 
train_data = GSgnnNodeTrainData(graph_name=graph_name,         # need to remove this argument and get graph name from partion json file
                              part_config=acm_graph_config,
                              train_ntypes=['paper'],            # TODO: node type/s list for training
                              eval_ntypes=['paper'],             # TODO: node type/s list for evaluation and testing
                              node_feat_field={'author':['feat'], 'paper':['feat'],'subject':['feat']}, # TODO: paste the logic of handline input here?
                              label_field='label')             # TODO: diction of name of the label field, 

# TODO: why not just have a GSgnnNodeData?? 
# Not do it in this release

INFO:root:part 0, train: 9999, val: 1249, test: 1249


In [5]:
# setup data loaders for training, validation, and test

# TODO: define fanout as a variable
fanout = [5,5]

# TODO: one line to build three dataloaders
train_dataloader = GSgnnNodeDataLoader(dataset=train_data,
                                       target_idx=train_data.train_idxs,
                                       fanout=[5,5],
                                       batch_size=64,
                                       device=device,                    # TODO: remove this argument, but get it from gs.get_device()
                                       train_task=True)
val_dataloader = GSgnnNodeDataLoader(dataset=train_data,
                                     target_idx=train_data.val_idxs,
                                     fanout=[5,5],
                                     batch_size=256,
                                     device=device,
                                     train_task=False)
# test_dataloader = GSgnnNodeDataLoader(dataset=train_data,
#                                       target_idx=train_data.test_idxs,
#                                       fanout=[5,5],
#                                       batch_size=256,
#                                       device=device,
#                                       train_task=False)

### 3. Reproduce the GraphStorm RGCN Model for Node Classification 

Next, we use a set of GraphStorm APIs to reproduce the built-in RGCN model.

A GraphStorm model should contain the following components: 
- Input encoder for nodes (and optionally edges): process and project input features and embeddings into a certain dimension;
- GNN encoder: performs message-passing on projected node/edge inputs;
- Decoder: specific for tasks on the graph.

We can see the following codes set up a `GSgnnNodeModel` model composed of `GSNodeEncoderInputLayer`, `RelationalGCNEncoder`, `EntityClassifier` step-by-step. One can also replace individual components/layers with a custom model for development purpose.

In [6]:
# TODO: build GSgnnRGCNNodeClassificationModel() ...

# create a GraphStorm model for node tasks
model = GSgnnNodeModel(alpha_l2norm=0.)    # set an alpha_l2norm default value

# set an input layer encoder
# TODO: replace feat_size, with train_data.get_feat_size()
encoder = GSNodeEncoderInputLayer(g=train_data.g,
                                  feat_size={'author':256, 'paper':256, 'subject':256}, 
                                  embed_size=64)
model.set_node_input_encoder(encoder)

# set a GNN encoder
gnn_encoder = RelationalGCNEncoder(g=train_data.g,
                                   h_dim=64,
                                   out_dim=128,
                                   num_hidden_layers=len(fanout)-1)    # MUST be len(fanout)-1 !!
model.set_gnn_encoder(gnn_encoder)

# set a decoder specific to node-classification task
decoder = EntityClassifier(in_dim=128,
                           num_classes=14,
                           multilabel=False)
model.set_decoder(decoder)

# classification loss function
model.set_loss_func(ClassifyLossFunc(multilabel=False))

# initialize model's optimizer
model.init_optimizer(lr=0.001,                           # 1. Can we let model to init the optimizer automatically??
                     sparse_optimizer_lr=0.01,           # 2. TODO:  have default settings for the sparse lr and WD.
                     weight_decay=0)

# (Optional) uncomment to display the model architecture
# model

### 4. Setup GraphStorm Training Pipeline

GraphStorm uses its Trainers to train the RGCN model. It handles:
1. model training/evaluation loops
2. saving and recording best performed models 
3. early-stopping

In [11]:
# create a GraphStorm node task trainer for the RGCN model
trainer = GSgnnNodePredictionTrainer(model)   # TODO: set device as argument of trainers

# setup device for the trainer
trainer.setup_device(device=device)          # TODO: remove, 

# setup evaluator for the trainer:
# evaluator = GSgnnAccEvaluator(eval_frequency=100,
#                               eval_metric=['accuracy'],
#                               multilabel=False)

# setup a task tracker to output running information
task_tracker = GSSageMakerTaskTracker(log_report_frequency=evaluator.eval_frequency)     # TODO: set a default task tracker.
trainer.setup_task_tracker(task_tracker)

# trainer.setup_evaluator(evaluator)

In [12]:
# Train the model with the trainer using fit() function
trainer.fit(train_loader=train_dataloader,
            # val_loader=val_dataloader,
            # test_loader=test_dataloader,
            num_epochs=10,
            save_model_path='nc_model/')

INFO:root:Part 0 | Epoch 00000 | Batch 000 | Loss: 1.0089 | Time: 0.0309
INFO:root:Part 0 | Epoch 00000 | Batch 020 | Loss: 0.8512 | Time: 0.0320
INFO:root:Part 0 | Epoch 00000 | Batch 040 | Loss: 0.8564 | Time: 0.0336
INFO:root:Part 0 | Epoch 00000 | Batch 060 | Loss: 0.8211 | Time: 0.0325
INFO:root:Part 0 | Epoch 00000 | Batch 080 | Loss: 1.2406 | Time: 0.0344
INFO:root:Step 100 | Train loss: 0.8107
INFO:root:Part 0 | Epoch 00000 | Batch 100 | Loss: 0.9378 | Time: 0.0331
INFO:root:Part 0 | Epoch 00000 | Batch 120 | Loss: 1.1321 | Time: 0.0345
INFO:root:Part 0 | Epoch 00000 | Batch 140 | Loss: 1.3479 | Time: 0.0297
INFO:root:Epoch 0 take 5.157 seconds
INFO:root:successfully save the model to nc_model/epoch-0
INFO:root:Time on save model: 0.002 seconds
INFO:root:Part 0 | Epoch 00001 | Batch 000 | Loss: 0.7831 | Time: 0.0282
INFO:root:Part 0 | Epoch 00001 | Batch 020 | Loss: 1.0633 | Time: 0.0353
INFO:root:Part 0 | Epoch 00001 | Batch 040 | Loss: 1.1163 | Time: 0.0323
INFO:root:Step 200

### 5. Visualize Model Performance History

Next, we examine the model performance on the validation and testing sets over the training process.

In [13]:
# extract evaluation history of metrics from the trainer's evaluator:
val_metrics, test_metrics = [], []
for val_metric, test_metric in trainer.evaluator.history:
    val_metrics.append(val_metric['accuracy'])
    test_metrics.append(test_metric['accuracy'])

# plot the performance curves
fig, ax = plt.subplots()
ax.plot(val_metrics, label='val')
ax.plot(test_metrics, label='test')
ax.set(xlabel='Epoch', ylabel='Accuracy')
ax.legend(loc='best')

AttributeError: 'NoneType' object has no attribute 'history'

### 6. Inference with the Trained Model

GraphStorm automatically save the best performaned model in the given `save_model_path` argument. We can first find out what is the best model and its path.

In [14]:
# after training, the best model is saved to disk:
best_model_path = trainer.get_best_model_path()
print('Best model path:', best_model_path)

Best model path: nc_model/epoch-9


In [15]:
# we can restore the model from the saved path using the model's restore_model() function.
model.restore_model(best_model_path)

INFO:root:successfully load the model from nc_model/epoch-9
INFO:root:Time on load model: 0.004 seconds


In [20]:
# Create a dataset for inference, we use the same ACM graph
infer_data = GSgnnNodeInferData(graph_name=graph_name,
                                part_config=acm_graph_config,
                                eval_ntypes='paper',
                                node_feat_field={'author':['feat'], 'paper':['feat'],'subject':['feat']},
                                label_field='label')

# Setup dataloader for the inference dataset
infer_dataloader = GSgnnNodeDataLoader(dataset=infer_data,
                                       target_idx=infer_data.test_idxs,
                                       fanout=[50,50],
                                       batch_size=100,
                                       device=device,
                                       train_task=False)

# Create an Inferrer object
infer = GSgnnNodePredictionInferrer(model)

In [21]:
# Run inference on the inference dataset
infer.infer(infer_dataloader,
            save_embed_path='infer/embeddings',
            save_prediction_path='infer/predictions',
            use_mini_batch_infer=True)

INFO:root:save embeddings pf paper to infer/embeddings
INFO:root:Writing GNN embeddings to infer/embeddings in pytorch format.


In [14]:
# The GNN embeddings and predictions on the inference graph are saved to the folder named after the target_ntype
!ls -lh infer/embeddings/paper
!ls -lh infer/predictions/paper

total 640K
-rw-rw-r-- 1 ubuntu ubuntu 626K Jan 26 20:54 embed-00000.pt
-rw-rw-r-- 1 ubuntu ubuntu  11K Jan 26 20:54 embed_nids-00000.pt
total 84K
-rw-rw-r-- 1 ubuntu ubuntu 70K Jan 26 20:54 predict-00000.pt
-rw-rw-r-- 1 ubuntu ubuntu 11K Jan 26 20:54 predict_nids-00000.pt
