Make graph compaction handle isolated nodes #1266

BarclayII · 2020-02-16T15:10:00Z

🚀 Feature

Make graph compaction keep a given set of isolated nodes.

Motivation

In the future, our recommended way of performing minibatch training on node classification is shown in the following pseudocode (see the document in #1199 for a more complete explanation):

frontiers = []
seeds = ...     # a batch of seed nodes
for _ in range(num_layers):
    frontier = dgl.sample_neighbors(graph, nodes=seeds, fanout=3)
    src, dst = frontier.all_edges()
    seeds = torch.unique(torch.cat([src, dst]))
    frontiers.insert(0, frontier)
frontiers = dgl.compact_graphs(frontiers)    # <----

input = ...   # input features
label = ...   # labels
mask = ...    # whether the node is labeled or not

h = input[frontiers[i].ndata[dgl.NID]]
y = label[frontiers[i].ndata[dgl.NID]]
m = mask[frontiers[i].ndata[dgl.NID]]
for i in range(num_layers):
    h = SAGEConv[i](frontiers[i], h)
loss = F.cross_entropy(Linear(h), y)[m].mean()
loss.backward()

If the seed nodes contain isolated nodes (i.e. those with no inbound edges), then the seed nodes would actually be removed from the sampled frontiers in compact_graphs. The consequence is that those isolated nodes would never be trained in node classification by the pipeline above.

Note that link prediction does not suffer from this problem. We recommend to construct a pair graph with edges connecting the positive and negative pairs respectively, and compact the pair graphs and frontiers together. Therefore even the isolated nodes would have at least one edge in one of the pair graphs, and would not be removed during graph compaction.

Alternatives

We can technically ignore the isolated nodes during training. It is not clear how ignoring those examples would impact performance on current benchmarks, but if a GNN model did fail to beat a baseline model on a dataset, it would be hard to determine if the performance loss is due to discarding the isolated nodes.

We can also work around it by manually adding self loops for isolated nodes, but this would introduce other subtleties, such as assigning an edge type for such self loops, and changing the formulation of GraphSAGE and other GNNs for such corner cases (although they don't explicitly speak of how to handle isolated nodes anyway).

Pitch

Handle training of isolated nodes in the same minibatch training pipeline.

The text was updated successfully, but these errors were encountered:

BarclayII · 2020-02-16T15:57:44Z

Solution is to add an optional always_preserve argument to tell compact_graphs not to remove the given nodes.

classicsong · 2020-02-17T00:57:12Z

Solution is to add an optional always_preserve argument to tell compact_graphs not to remove the given nodes.

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?
How about add a auto_self_loop option?

BarclayII · 2020-02-17T04:03:06Z

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?

Isolated nodes indeed do not get anything from message passing. In the case of GraphSAGE, it naturally reduces to MLP for mean and LSTM aggregators according to the formulation (our implementation on max aggregator currently returns -inf so it will crash, but that's another problem).

How about add a auto_self_loop option?

This actually throws a wrench in heterogeneous graph training for instance, especially if some sampling algorithm samples on only some of the relations (e.g. MEIRec). Which edge type would you assign it to?

classicsong · 2020-02-17T04:55:15Z

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?

Isolated nodes indeed do not get anything from message passing. In the case of GraphSAGE, it naturally reduces to MLP for mean and LSTM aggregators according to the formulation (our implementation on max aggregator currently returns -inf so it will crash, but that's another problem).

What I mean in 'concatenate embeddings layer by layer' is that the embeddings from each frontier is concatenated together. You can refer to nn.pytorch.conv.tagconv as an example.

How about add a auto_self_loop option?

This actually throws a wrench in heterogeneous graph training for instance, especially if some sampling algorithm samples on only some of the relations (e.g. MEIRec). Which edge type would you assign it to?

This is a problem for self_loop.

It seems there is no silver bullet in this case.

BarclayII self-assigned this Feb 16, 2020

BarclayII added a commit to BarclayII/dgl that referenced this issue Feb 16, 2020

add always_preserve (fixes dmlc#1266) and fix a silly bug

0181cf0

BarclayII closed this as completed in c3a3340 Feb 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make graph compaction handle isolated nodes #1266

Make graph compaction handle isolated nodes #1266

BarclayII commented Feb 16, 2020 •

edited

Loading

BarclayII commented Feb 16, 2020

classicsong commented Feb 17, 2020 •

edited

Loading

BarclayII commented Feb 17, 2020

classicsong commented Feb 17, 2020

Make graph compaction handle isolated nodes #1266

Make graph compaction handle isolated nodes #1266

Comments

BarclayII commented Feb 16, 2020 • edited Loading

🚀 Feature

Motivation

Alternatives

Pitch

BarclayII commented Feb 16, 2020

classicsong commented Feb 17, 2020 • edited Loading

BarclayII commented Feb 17, 2020

classicsong commented Feb 17, 2020

BarclayII commented Feb 16, 2020 •

edited

Loading

classicsong commented Feb 17, 2020 •

edited

Loading