<a href="https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/graphbolt/walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Graphbolt Quick Walkthrough

The tutorial provides a quick walkthrough of operators provided by the `dgl.graphbolt` package, and illustrates how to create a GNN datapipe with the package. To learn more details about Stochastic Training of GNNs, please read the [materials](https://docs.dgl.ai/tutorials/large/index.html) provided by DGL.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dmlc/dgl/blob/master/notebooks/graphbolt/walkthrough.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/dmlc/dgl/blob/master/notebooks/graphbolt/walkthrough.ipynb)

In [None]:
# Install required packages.
import os
import torch
os.environ['TORCH'] = torch.__version__
os.environ['DGLBACKEND'] = "pytorch"

!pip install --pre dgl -f https://data.dgl.ai/wheels-test/cu118/repo.html > /dev/null

try:
    import dgl.graphbolt as gb
    installed = True
except ImportError as error:
    installed = False
    print(error)
print("DGL installed!" if installed else "DGL not found!")

## Dataset

The dataset has three primary components. *1*. An itemset, which can be iterated over as the training target. *2*. A sampling graph, which is used by the subgraph sampling algorithm to generate a subgraph. *3*. A feature store, which stores node, edge, and graph features.

* The **Itemset** is created from iterable data or tuple of iterable data.

In [None]:
node_pairs = torch.tensor(
    [[7, 0], [6, 0], [1, 3], [3, 3], [2, 4], [8, 4], [1, 4], [2, 4], [1, 5],
     [9, 6], [0, 6], [8, 6], [7, 7], [7, 7], [4, 7], [6, 8], [5, 8], [9, 9],
     [4, 9], [4, 9], [5, 9], [9, 9], [5, 9], [9, 9], [7, 9]]
)
item_set = gb.ItemSet(node_pairs, names="node_pairs")
print(item_set)

* The **SamplingGraph** is used by the subgraph sampling algorithm to generate a subgraph. In graphbolt, we provide a canonical solution, the CSCSamplingGraph, which achieves state-of-the-art time and space efficiency on CPU sampling. However, this requires enough CPU memory to host all CSCSamplingGraph objects in memory.

In [None]:
indptr = torch.tensor([0, 2, 2, 2, 4, 8, 9, 12, 15, 17, 25])
indices = torch.tensor(
    [7, 6, 1, 3, 2, 8, 1, 2, 1, 9, 0, 8, 7, 7, 4, 6, 5, 9, 4, 4, 5, 9, 5, 9, 7]
)
num_edges = 25
eid = torch.arange(num_edges)
edge_attributes = {gb.ORIGINAL_EDGE_ID: eid}
graph = gb.from_csc(indptr, indices, None, None, edge_attributes, None)
print(graph)

* The FeatureStore is used to store node, edge, and graph features. In graphbolt, we provide the TorchBasedFeature and related optimizations, such as the GPUCachedFeature, for different use cases.

In [None]:
num_nodes = 10
num_edges = 25
node_feature_data = torch.rand((num_nodes, 2))
edge_feature_data = torch.rand((num_edges, 3))
node_feature = gb.TorchBasedFeature(node_feature_data)
edge_feature = gb.TorchBasedFeature(edge_feature_data)
features = {
    ("node", None, "feat") : node_feature,
    ("edge", None, "feat") : edge_feature,
}
feature_store = gb.BasicFeatureStore(features)
print(feature_store)

## DataPipe

The DataPipe in Graphbolt is an extension of the PyTorch DataPipe, but it is specifically designed to address the challenges of training graph neural networks (GNNs). Each stage of the data pipeline loads data from different sources and can be combined with other stages to create more complex data pipelines. The intermediate data will be stored in **MiniBatch** data packs.

* **ItemSampler** iterates over input **Itemset** and create subsets.

In [None]:
datapipe = gb.ItemSampler(item_set, batch_size=3, shuffle=False)
print(next(iter(datapipe)))

* **NegativeSampler** generate negative samples and return a mix of positive and negative samples.

In [None]:
datapipe = datapipe.sample_uniform_negative(graph, 1)
print(next(iter(datapipe)))

* **SubgraphSampler** samples a subgraph from a given set of nodes from a larger graph.

In [None]:
fanouts = torch.tensor([1])
datapipe = datapipe.sample_neighbor(graph, [fanouts])
print(next(iter(datapipe)))

* **FeatureFetcher** fetchs features for node/edge in graphbolt.

In [None]:
datapipe = datapipe.fetch_feature(feature_store, node_feature_keys=["feat"], edge_feature_keys=["feat"])
print(next(iter(datapipe)))

After retrieving the required data, Graphbolt provides helper methods to convert it to the output format needed for subsequent GNN training.

* Convert to **DGLMiniBatch** format for training with DGL.

In [None]:
datapipe = datapipe.to_dgl()
print(next(iter(datapipe)))

* Copy the data to the GPU for training on the GPU.

In [None]:
datapipe = datapipe.copy_to(device="cuda")
print(next(iter(datapipe)))

## Exercise: Node classification

Similarly, the following Dataset is created for node classification, can you implement the data pipeline for the dataset?

In [None]:
# Dataset for node classification.
num_nodes = 10
nodes = torch.arange(num_nodes)
labels = torch.tensor([1, 2, 0, 2, 2, 0, 2, 2, 2, 2])
item_set = gb.ItemSet((nodes, labels), names=("seed_nodes", "labels"))

indptr = torch.tensor([0, 2, 2, 2, 4, 8, 9, 12, 15, 17, 25])
indices = torch.tensor(
    [7, 6, 1, 3, 2, 8, 1, 2, 1, 9, 0, 8, 7, 7, 4, 6, 5, 9, 4, 4, 5, 9, 5, 9, 7]
)
eid = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                    14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])
edge_attributes = {gb.ORIGINAL_EDGE_ID: eid}
graph = gb.from_csc(indptr, indices, None, None, edge_attributes, None)

num_nodes = 10
num_edges = 25
node_feature_data = torch.rand((num_nodes, 2))
edge_feature_data = torch.rand((num_edges, 3))
node_feature = gb.TorchBasedFeature(node_feature_data)
edge_feature = gb.TorchBasedFeature(edge_feature_data)
features = {
    ("node", None, "feat") : node_feature,
    ("edge", None, "feat") : edge_feature,
}
feature_store = gb.BasicFeatureStore(features)

# Datapipe.
...
print(next(iter(datapipe)))