Merge branch 'master' into tgl-fix

dmlc · Dec 4, 2023 · a60aa22 · a60aa22
2 parents 538386b + 15d05be
commit a60aa22
Show file tree

Hide file tree

Showing 60 changed files with 568 additions and 612 deletions.
diff --git a/docs/source/api/python/dgl.graphbolt.rst b/docs/source/api/python/dgl.graphbolt.rst
@@ -15,6 +15,7 @@ APIs
     :nosignatures:
     :template: graphbolt_classtemplate.rst
 
+    DataLoader
     Dataset
     Task
     ItemSet
@@ -35,17 +36,6 @@ APIs
     CopyTo
 
 
-DataLoaders
------------
-
-.. autosummary::
-    :toctree: ../../generated/
-    :nosignatures:
-    :template: graphbolt_classtemplate.rst
-
-    SingleProcessDataLoader
-    MultiProcessDataLoader
-
 Standard Implementations
 -------------------------
 

diff --git a/docs/source/graphbolt/index.rst b/docs/source/graphbolt/index.rst
@@ -8,4 +8,5 @@ framework for GNN with high flexibility and scalability.
   :maxdepth: 2
   :titlesonly:
 
+  walkthrough.nblink
   ondisk-dataset
diff --git a/docs/source/graphbolt/walkthrough.nblink b/docs/source/graphbolt/walkthrough.nblink
@@ -0,0 +1,3 @@
+{
+    "path": "../../../notebooks/graphbolt/walkthrough.ipynb"
+}
diff --git a/docs/source/guide/minibatch-custom-sampler.rst b/docs/source/guide/minibatch-custom-sampler.rst
@@ -3,143 +3,133 @@
 6.4 Implementing Custom Graph Samplers
 ----------------------------------------------
 
-Implementing custom samplers involves subclassing the :class:`dgl.dataloading.Sampler`
-base class and implementing its abstract :attr:`sample` method.  The :attr:`sample`
-method should take in two arguments:
+Implementing custom samplers involves subclassing the
+:class:`dgl.graphbolt.SubgraphSampler` base class and implementing its abstract
+:attr:`_sample_subgraphs` method. The :attr:`_sample_subgraphs` method should
+take in seed nodes which are the nodes to sample neighbors from:
 
 .. code:: python
 
-   def sample(self, g, indices):
-       pass
+    def _sample_subgraphs(self, seed_nodes):
+        return input_nodes, sampled_subgraphs
 
-The first argument :attr:`g` is the original graph to sample from while
-the second argument :attr:`indices` is the indices of the current mini-batch
--- it generally could be anything depending on what indices are given to the
-accompanied :class:`~dgl.dataloading.DataLoader` but are typically seed node
-or seed edge IDs. The function returns the mini-batch of samples for
-the current iteration.
+The method should return the input node IDs list and a list of subgraphs. Each
+subgraph is a :class:`~dgl.graphbolt.SampledSubgraph` object.
 
-.. note::
-
-    The design here is similar to PyTorch's ``torch.utils.data.DataLoader``,
-    which is an iterator of dataset. Users can customize how to batch samples
-    using its ``collate_fn`` argument. Here in DGL, ``dgl.dataloading.DataLoader``
-    is an iterator of ``indices`` (e.g., training node IDs) while ``Sampler``
-    converts a batch of indices into a batch of graph- or tensor-type samples.
 
+Any other data that are required during sampling such as the graph structure,
+fanout size, etc. should be passed to the sampler via the constructor.
 
 The code below implements a classical neighbor sampler:
 
 .. code:: python
 
-   class NeighborSampler(dgl.dataloading.Sampler):
-       def __init__(self, fanouts : list[int]):
-           super().__init__()
+    @functional_datapipe("customized_sample_neighbor")
+    class CustomizedNeighborSampler(dgl.graphbolt.SubgraphSampler):
+       def __init__(self, datapipe, graph, fanouts):
+           super().__init__(datapipe)
+           self.graph = graph
            self.fanouts = fanouts
 
-       def sample(self, g, seed_nodes):
-           output_nodes = seed_nodes
+       def _sample_subgraphs(self, seed_nodes):
            subgs = []
            for fanout in reversed(self.fanouts):
                # Sample a fixed number of neighbors of the current seed nodes.
-               sg = g.sample_neighbors(seed_nodes, fanout)
-               # Convert this subgraph to a message flow graph.
-               sg = dgl.to_block(sg, seed_nodes)
-               seed_nodes = sg.srcdata[NID]
+               input_nodes, sg = g.sample_neighbors(seed_nodes, fanout)
                subgs.insert(0, sg)
-               input_nodes = seed_nodes
-           return input_nodes, output_nodes, subgs
+               seed_nodes = input_nodes
+           return input_nodes, subgs
 
-To use this sampler with ``DataLoader``:
+To use this sampler with :class:`~dgl.graphbolt.DataLoader`:
 
 .. code:: python
 
-    graph = ...  # the graph to be sampled from
-    train_nids = ...  # an 1-D tensor of training node IDs
-    sampler = NeighborSampler([10, 15])  # create a sampler
-    dataloader = dgl.dataloading.DataLoader(
-        graph,
-        train_nids,
-        sampler,
-        batch_size=32,    # batch_size decides how many IDs are passed to sampler at once
-        ...               # other arguments
-    )
-    for i, mini_batch in enumerate(dataloader):
-        # unpack the mini batch
-        input_nodes, output_nodes, subgs = mini_batch
-        train(input_nodes, output_nodes, subgs)
+    datapipe = gb.ItemSampler(train_set, batch_size=1024, shuffle=True)
+    datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
+    datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
+    datapipe = datapipe.to_dgl()
+    datapipe = datapipe.copy_to(device)
+    dataloader = gb.DataLoader(datapipe, num_workers=0)
+
+    for data in dataloader:
+        input_features = data.node_features["feat"]
+        output_labels = data.labels
+        output_predictions = model(data.blocks, input_features)
+        loss = compute_loss(output_labels, output_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
 
 Sampler for Heterogeneous Graphs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To write a sampler for heterogeneous graphs, one needs to be aware that
-the argument ``g`` will be a heterogeneous graph while ``indices`` could be a
+the argument `graph` is a heterogeneous graph while `seeds` could be a
 dictionary of ID tensors. Most of DGL's graph sampling operators (e.g.,
 the ``sample_neighbors`` and ``to_block`` functions in the above example) can
 work on heterogeneous graph natively, so many samplers are automatically
-ready for heterogeneous graph. For example, the above ``NeighborSampler``
+ready for heterogeneous graph. For example, the above ``CustomizedNeighborSampler``
 can be used on heterogeneous graphs:
 
 .. code:: python
 
-    hg = dgl.heterograph({
-        ('user', 'like', 'movie') : ...,
-        ('user', 'follow', 'user') : ...,
-        ('movie', 'liked-by', 'user') : ...,
-    })
-    train_nids = {'user' : ..., 'movie' : ...}  # training IDs of 'user' and 'movie' nodes
-    sampler = NeighborSampler([10, 15])  # create a sampler
-    dataloader = dgl.dataloading.DataLoader(
-        hg,
-        train_nids,
-        sampler,
-        batch_size=32,    # batch_size decides how many IDs are passed to sampler at once
-        ...               # other arguments
+    import dgl.graphbolt as gb
+    hg = gb.FusedCSCSamplingGraph()
+    train_set = item_set = gb.ItemSetDict(
+        {
+            "user": gb.ItemSet(
+                (torch.arange(0, 5), torch.arange(5, 10)),
+                names=("seed_nodes", "labels"),
+            ),
+            "item": gb.ItemSet(
+                (torch.arange(5, 10), torch.arange(10, 15)),
+                names=("seed_nodes", "labels"),
+            ),
+        }
+    )
+    datapipe = gb.ItemSampler(train_set, batch_size=1024, shuffle=True)
+    datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
+    datapipe = datapipe.fetch_feature(
+        feature, node_feature_keys={"user": ["feat"], "item": ["feat"]}
     )
-    for i, mini_batch in enumerate(dataloader):
-        # unpack the mini batch
-        # input_nodes and output_nodes are dictionary while subgs are a list of
-        # heterogeneous graphs
-        input_nodes, output_nodes, subgs = mini_batch
-        train(input_nodes, output_nodes, subgs)
-
-Exclude Edges During Sampling
+    datapipe = datapipe.to_dgl()
+    datapipe = datapipe.copy_to(device)
+    dataloader = gb.DataLoader(datapipe, num_workers=0)
+
+    for data in dataloader:
+        input_features = {
+            ntype: data.node_features[(ntype, "feat")]
+            for ntype in data.blocks[0].srctypes
+        }
+        output_labels = data.labels["user"]
+        output_predictions = model(data.blocks, input_features)["user"]
+        loss = compute_loss(output_labels, output_predictions)
+        opt.zero_grad()
+        loss.backward()
+        opt.step()
+
+
+Exclude Edges After Sampling
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The examples above all belong to *node-wise sampler* because the ``indices`` argument
-to the ``sample`` method represents a batch of seed node IDs. Another common type of
-samplers is *edge-wise sampler* which, as name suggested, takes in a batch of seed
-edge IDs to construct mini-batch data. DGL provides a utility
-:func:`dgl.dataloading.as_edge_prediction_sampler` to turn a node-wise sampler to
-an edge-wise sampler. To prevent information leakge, it requires the node-wise sampler
-to have an additional third argument ``exclude_eids``. The code below modifies
-the ``NeighborSampler`` we just defined to properly exclude edges from the sampled
-subgraph:
+In some cases, we may want to exclude seed edges from the sampled subgraph. For
+example, in link prediction tasks, we want to exclude the edges in the
+training set from the sampled subgraph to prevent information leakage. To
+do so, we need to add an additional datapipe right after sampling as follows:
 
 .. code:: python
 
-   class NeighborSampler(Sampler):
-       def __init__(self, fanouts):
-           super().__init__()
-           self.fanouts = fanouts
+    datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
+    datapipe = datapipe.transform(gb.exclude_seed_edges)
 
-       # NOTE: There is an additional third argument. For homogeneous graphs,
-       #   it is an 1-D tensor of integer IDs. For heterogeneous graphs, it
-       #   is a dictionary of ID tensors. We usually set its default value to be None.
-       def sample(self, g, seed_nodes, exclude_eids=None):
-           output_nodes = seed_nodes
-           subgs = []
-           for fanout in reversed(self.fanouts):
-               # Sample a fixed number of neighbors of the current seed nodes.
-               sg = g.sample_neighbors(seed_nodes, fanout, exclude_edges=exclude_eids)
-               # Convert this subgraph to a message flow graph.
-               sg = dgl.to_block(sg, seed_nodes)
-               seed_nodes = sg.srcdata[NID]
-               subgs.insert(0, sg)
-               input_nodes = seed_nodes
-           return input_nodes, output_nodes, subgs
+Please check the API page of :func:`~dgl.graphbolt.exclude_seed_edges` for more
+details.
+
+The above API is based on :meth:`~dgl.graphbolt.SampledSubgrahp.exclude_edges`.
+If you want to exclude edges from the sampled subgraph based on some other
+criteria, you could write your own transform function. Please check the method
+for reference.
 
-Further Readings
-~~~~~~~~~~~~~~~~~~
-See :ref:`guide-minibatch-prefetching` for how to write a custom graph sampler
-with feature prefetching.
+You could also refer to examples in
+`Link Prediction <https://github.com/dmlc/dgl/blob/master/examples/sampling/graphbolt/link_prediction.py>`__.
diff --git a/docs/source/guide/minibatch-edge.rst b/docs/source/guide/minibatch-edge.rst
@@ -40,7 +40,7 @@ edges(namely, node pairs) in the training set instead of the nodes.
     datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
     datapipe = datapipe.to_dgl()
     datapipe = datapipe.copy_to(device)
-    dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
+    dataloader = gb.DataLoader(datapipe, num_workers=0)
 
 Iterating over the DataLoader will yield :class:`~dgl.graphbolt.DGLMiniBatch`
 which contains a list of specially created graphs representing the computation
@@ -93,7 +93,7 @@ You can use :func:`~dgl.graphbolt.exclude_seed_edges` alongside with
     datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
     datapipe = datapipe.to_dgl()
     datapipe = datapipe.copy_to(device)
-    dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
+    dataloader = gb.DataLoader(datapipe, num_workers=0)
     
 
 Adapt your model for minibatch training
@@ -275,7 +275,7 @@ only difference is that the train_set is now an instance of
     )
     datapipe = datapipe.to_dgl()
     datapipe = datapipe.copy_to(device)
-    dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
+    dataloader = gb.DataLoader(datapipe, num_workers=0)
 
 Things become a little different if you wish to exclude the reverse
 edges on heterogeneous graphs. On heterogeneous graphs, reverse edges

diff --git a/docs/source/guide/minibatch-gpu-sampling.rst b/docs/source/guide/minibatch-gpu-sampling.rst
@@ -3,6 +3,10 @@
 6.7 Using GPU for Neighborhood Sampling
 ---------------------------------------
 
+.. note::
+  GraphBolt does not support GPU-based neighborhood sampling yet. So this guide is
+  utilizing :class:`~dgl.dataloading.DataLoader` for illustration.
+
 DGL since 0.7 has been supporting GPU-based neighborhood sampling, which has a significant
 speed advantage over CPU-based neighborhood sampling.  If you estimate that your graph 
 can fit onto GPU and your model does not take a lot of GPU memory, then it is best to