Skip to content

Commit

Permalink
Merge branch 'master' into tgl-fix
Browse files Browse the repository at this point in the history
  • Loading branch information
amarzullo24 committed Dec 4, 2023
2 parents 538386b + 15d05be commit a60aa22
Show file tree
Hide file tree
Showing 60 changed files with 568 additions and 612 deletions.
12 changes: 1 addition & 11 deletions docs/source/api/python/dgl.graphbolt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ APIs
:nosignatures:
:template: graphbolt_classtemplate.rst

DataLoader
Dataset
Task
ItemSet
Expand All @@ -35,17 +36,6 @@ APIs
CopyTo


DataLoaders
-----------

.. autosummary::
:toctree: ../../generated/
:nosignatures:
:template: graphbolt_classtemplate.rst

SingleProcessDataLoader
MultiProcessDataLoader

Standard Implementations
-------------------------

Expand Down
1 change: 1 addition & 0 deletions docs/source/graphbolt/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@ framework for GNN with high flexibility and scalability.
:maxdepth: 2
:titlesonly:

walkthrough.nblink
ondisk-dataset
3 changes: 3 additions & 0 deletions docs/source/graphbolt/walkthrough.nblink
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"path": "../../../notebooks/graphbolt/walkthrough.ipynb"
}
188 changes: 89 additions & 99 deletions docs/source/guide/minibatch-custom-sampler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,143 +3,133 @@
6.4 Implementing Custom Graph Samplers
----------------------------------------------

Implementing custom samplers involves subclassing the :class:`dgl.dataloading.Sampler`
base class and implementing its abstract :attr:`sample` method. The :attr:`sample`
method should take in two arguments:
Implementing custom samplers involves subclassing the
:class:`dgl.graphbolt.SubgraphSampler` base class and implementing its abstract
:attr:`_sample_subgraphs` method. The :attr:`_sample_subgraphs` method should
take in seed nodes which are the nodes to sample neighbors from:

.. code:: python
def sample(self, g, indices):
pass
def _sample_subgraphs(self, seed_nodes):
return input_nodes, sampled_subgraphs
The first argument :attr:`g` is the original graph to sample from while
the second argument :attr:`indices` is the indices of the current mini-batch
-- it generally could be anything depending on what indices are given to the
accompanied :class:`~dgl.dataloading.DataLoader` but are typically seed node
or seed edge IDs. The function returns the mini-batch of samples for
the current iteration.
The method should return the input node IDs list and a list of subgraphs. Each
subgraph is a :class:`~dgl.graphbolt.SampledSubgraph` object.

.. note::

The design here is similar to PyTorch's ``torch.utils.data.DataLoader``,
which is an iterator of dataset. Users can customize how to batch samples
using its ``collate_fn`` argument. Here in DGL, ``dgl.dataloading.DataLoader``
is an iterator of ``indices`` (e.g., training node IDs) while ``Sampler``
converts a batch of indices into a batch of graph- or tensor-type samples.

Any other data that are required during sampling such as the graph structure,
fanout size, etc. should be passed to the sampler via the constructor.

The code below implements a classical neighbor sampler:

.. code:: python
class NeighborSampler(dgl.dataloading.Sampler):
def __init__(self, fanouts : list[int]):
super().__init__()
@functional_datapipe("customized_sample_neighbor")
class CustomizedNeighborSampler(dgl.graphbolt.SubgraphSampler):
def __init__(self, datapipe, graph, fanouts):
super().__init__(datapipe)
self.graph = graph
self.fanouts = fanouts
def sample(self, g, seed_nodes):
output_nodes = seed_nodes
def _sample_subgraphs(self, seed_nodes):
subgs = []
for fanout in reversed(self.fanouts):
# Sample a fixed number of neighbors of the current seed nodes.
sg = g.sample_neighbors(seed_nodes, fanout)
# Convert this subgraph to a message flow graph.
sg = dgl.to_block(sg, seed_nodes)
seed_nodes = sg.srcdata[NID]
input_nodes, sg = g.sample_neighbors(seed_nodes, fanout)
subgs.insert(0, sg)
input_nodes = seed_nodes
return input_nodes, output_nodes, subgs
seed_nodes = input_nodes
return input_nodes, subgs
To use this sampler with ``DataLoader``:
To use this sampler with :class:`~dgl.graphbolt.DataLoader`:

.. code:: python
graph = ... # the graph to be sampled from
train_nids = ... # an 1-D tensor of training node IDs
sampler = NeighborSampler([10, 15]) # create a sampler
dataloader = dgl.dataloading.DataLoader(
graph,
train_nids,
sampler,
batch_size=32, # batch_size decides how many IDs are passed to sampler at once
... # other arguments
)
for i, mini_batch in enumerate(dataloader):
# unpack the mini batch
input_nodes, output_nodes, subgs = mini_batch
train(input_nodes, output_nodes, subgs)
datapipe = gb.ItemSampler(train_set, batch_size=1024, shuffle=True)
datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
datapipe = datapipe.to_dgl()
datapipe = datapipe.copy_to(device)
dataloader = gb.DataLoader(datapipe, num_workers=0)
for data in dataloader:
input_features = data.node_features["feat"]
output_labels = data.labels
output_predictions = model(data.blocks, input_features)
loss = compute_loss(output_labels, output_predictions)
opt.zero_grad()
loss.backward()
opt.step()
Sampler for Heterogeneous Graphs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To write a sampler for heterogeneous graphs, one needs to be aware that
the argument ``g`` will be a heterogeneous graph while ``indices`` could be a
the argument `graph` is a heterogeneous graph while `seeds` could be a
dictionary of ID tensors. Most of DGL's graph sampling operators (e.g.,
the ``sample_neighbors`` and ``to_block`` functions in the above example) can
work on heterogeneous graph natively, so many samplers are automatically
ready for heterogeneous graph. For example, the above ``NeighborSampler``
ready for heterogeneous graph. For example, the above ``CustomizedNeighborSampler``
can be used on heterogeneous graphs:

.. code:: python
hg = dgl.heterograph({
('user', 'like', 'movie') : ...,
('user', 'follow', 'user') : ...,
('movie', 'liked-by', 'user') : ...,
})
train_nids = {'user' : ..., 'movie' : ...} # training IDs of 'user' and 'movie' nodes
sampler = NeighborSampler([10, 15]) # create a sampler
dataloader = dgl.dataloading.DataLoader(
hg,
train_nids,
sampler,
batch_size=32, # batch_size decides how many IDs are passed to sampler at once
... # other arguments
import dgl.graphbolt as gb
hg = gb.FusedCSCSamplingGraph()
train_set = item_set = gb.ItemSetDict(
{
"user": gb.ItemSet(
(torch.arange(0, 5), torch.arange(5, 10)),
names=("seed_nodes", "labels"),
),
"item": gb.ItemSet(
(torch.arange(5, 10), torch.arange(10, 15)),
names=("seed_nodes", "labels"),
),
}
)
datapipe = gb.ItemSampler(train_set, batch_size=1024, shuffle=True)
datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
datapipe = datapipe.fetch_feature(
feature, node_feature_keys={"user": ["feat"], "item": ["feat"]}
)
for i, mini_batch in enumerate(dataloader):
# unpack the mini batch
# input_nodes and output_nodes are dictionary while subgs are a list of
# heterogeneous graphs
input_nodes, output_nodes, subgs = mini_batch
train(input_nodes, output_nodes, subgs)
Exclude Edges During Sampling
datapipe = datapipe.to_dgl()
datapipe = datapipe.copy_to(device)
dataloader = gb.DataLoader(datapipe, num_workers=0)
for data in dataloader:
input_features = {
ntype: data.node_features[(ntype, "feat")]
for ntype in data.blocks[0].srctypes
}
output_labels = data.labels["user"]
output_predictions = model(data.blocks, input_features)["user"]
loss = compute_loss(output_labels, output_predictions)
opt.zero_grad()
loss.backward()
opt.step()
Exclude Edges After Sampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The examples above all belong to *node-wise sampler* because the ``indices`` argument
to the ``sample`` method represents a batch of seed node IDs. Another common type of
samplers is *edge-wise sampler* which, as name suggested, takes in a batch of seed
edge IDs to construct mini-batch data. DGL provides a utility
:func:`dgl.dataloading.as_edge_prediction_sampler` to turn a node-wise sampler to
an edge-wise sampler. To prevent information leakge, it requires the node-wise sampler
to have an additional third argument ``exclude_eids``. The code below modifies
the ``NeighborSampler`` we just defined to properly exclude edges from the sampled
subgraph:
In some cases, we may want to exclude seed edges from the sampled subgraph. For
example, in link prediction tasks, we want to exclude the edges in the
training set from the sampled subgraph to prevent information leakage. To
do so, we need to add an additional datapipe right after sampling as follows:

.. code:: python
class NeighborSampler(Sampler):
def __init__(self, fanouts):
super().__init__()
self.fanouts = fanouts
datapipe = datapipe.customized_sample_neighbor(g, [10, 10]) # 2 layers.
datapipe = datapipe.transform(gb.exclude_seed_edges)
# NOTE: There is an additional third argument. For homogeneous graphs,
# it is an 1-D tensor of integer IDs. For heterogeneous graphs, it
# is a dictionary of ID tensors. We usually set its default value to be None.
def sample(self, g, seed_nodes, exclude_eids=None):
output_nodes = seed_nodes
subgs = []
for fanout in reversed(self.fanouts):
# Sample a fixed number of neighbors of the current seed nodes.
sg = g.sample_neighbors(seed_nodes, fanout, exclude_edges=exclude_eids)
# Convert this subgraph to a message flow graph.
sg = dgl.to_block(sg, seed_nodes)
seed_nodes = sg.srcdata[NID]
subgs.insert(0, sg)
input_nodes = seed_nodes
return input_nodes, output_nodes, subgs
Please check the API page of :func:`~dgl.graphbolt.exclude_seed_edges` for more
details.

The above API is based on :meth:`~dgl.graphbolt.SampledSubgrahp.exclude_edges`.
If you want to exclude edges from the sampled subgraph based on some other
criteria, you could write your own transform function. Please check the method
for reference.

Further Readings
~~~~~~~~~~~~~~~~~~
See :ref:`guide-minibatch-prefetching` for how to write a custom graph sampler
with feature prefetching.
You could also refer to examples in
`Link Prediction <https://github.com/dmlc/dgl/blob/master/examples/sampling/graphbolt/link_prediction.py>`__.
6 changes: 3 additions & 3 deletions docs/source/guide/minibatch-edge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ edges(namely, node pairs) in the training set instead of the nodes.
datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
datapipe = datapipe.to_dgl()
datapipe = datapipe.copy_to(device)
dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
dataloader = gb.DataLoader(datapipe, num_workers=0)
Iterating over the DataLoader will yield :class:`~dgl.graphbolt.DGLMiniBatch`
which contains a list of specially created graphs representing the computation
Expand Down Expand Up @@ -93,7 +93,7 @@ You can use :func:`~dgl.graphbolt.exclude_seed_edges` alongside with
datapipe = datapipe.fetch_feature(feature, node_feature_keys=["feat"])
datapipe = datapipe.to_dgl()
datapipe = datapipe.copy_to(device)
dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
dataloader = gb.DataLoader(datapipe, num_workers=0)
Adapt your model for minibatch training
Expand Down Expand Up @@ -275,7 +275,7 @@ only difference is that the train_set is now an instance of
)
datapipe = datapipe.to_dgl()
datapipe = datapipe.copy_to(device)
dataloader = gb.MultiProcessDataLoader(datapipe, num_workers=0)
dataloader = gb.DataLoader(datapipe, num_workers=0)
Things become a little different if you wish to exclude the reverse
edges on heterogeneous graphs. On heterogeneous graphs, reverse edges
Expand Down
4 changes: 4 additions & 0 deletions docs/source/guide/minibatch-gpu-sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
6.7 Using GPU for Neighborhood Sampling
---------------------------------------

.. note::
GraphBolt does not support GPU-based neighborhood sampling yet. So this guide is
utilizing :class:`~dgl.dataloading.DataLoader` for illustration.

DGL since 0.7 has been supporting GPU-based neighborhood sampling, which has a significant
speed advantage over CPU-based neighborhood sampling. If you estimate that your graph
can fit onto GPU and your model does not take a lot of GPU memory, then it is best to
Expand Down

0 comments on commit a60aa22

Please sign in to comment.