Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make graph compaction handle isolated nodes #1266

Closed
BarclayII opened this issue Feb 16, 2020 · 4 comments
Closed

Make graph compaction handle isolated nodes #1266

BarclayII opened this issue Feb 16, 2020 · 4 comments
Assignees

Comments

@BarclayII
Copy link
Collaborator

BarclayII commented Feb 16, 2020

🚀 Feature

Make graph compaction keep a given set of isolated nodes.

Motivation

In the future, our recommended way of performing minibatch training on node classification is shown in the following pseudocode (see the document in #1199 for a more complete explanation):

frontiers = []
seeds = ...     # a batch of seed nodes
for _ in range(num_layers):
    frontier = dgl.sample_neighbors(graph, nodes=seeds, fanout=3)
    src, dst = frontier.all_edges()
    seeds = torch.unique(torch.cat([src, dst]))
    frontiers.insert(0, frontier)
frontiers = dgl.compact_graphs(frontiers)    # <----

input = ...   # input features
label = ...   # labels
mask = ...    # whether the node is labeled or not

h = input[frontiers[i].ndata[dgl.NID]]
y = label[frontiers[i].ndata[dgl.NID]]
m = mask[frontiers[i].ndata[dgl.NID]]
for i in range(num_layers):
    h = SAGEConv[i](frontiers[i], h)
loss = F.cross_entropy(Linear(h), y)[m].mean()
loss.backward()

If the seed nodes contain isolated nodes (i.e. those with no inbound edges), then the seed nodes would actually be removed from the sampled frontiers in compact_graphs. The consequence is that those isolated nodes would never be trained in node classification by the pipeline above.

Note that link prediction does not suffer from this problem. We recommend to construct a pair graph with edges connecting the positive and negative pairs respectively, and compact the pair graphs and frontiers together. Therefore even the isolated nodes would have at least one edge in one of the pair graphs, and would not be removed during graph compaction.

Alternatives

We can technically ignore the isolated nodes during training. It is not clear how ignoring those examples would impact performance on current benchmarks, but if a GNN model did fail to beat a baseline model on a dataset, it would be hard to determine if the performance loss is due to discarding the isolated nodes.

We can also work around it by manually adding self loops for isolated nodes, but this would introduce other subtleties, such as assigning an edge type for such self loops, and changing the formulation of GraphSAGE and other GNNs for such corner cases (although they don't explicitly speak of how to handle isolated nodes anyway).

Pitch

Handle training of isolated nodes in the same minibatch training pipeline.

@BarclayII
Copy link
Collaborator Author

Solution is to add an optional always_preserve argument to tell compact_graphs not to remove the given nodes.

@BarclayII BarclayII self-assigned this Feb 16, 2020
BarclayII added a commit to BarclayII/dgl that referenced this issue Feb 16, 2020
@classicsong
Copy link
Contributor

classicsong commented Feb 17, 2020

Solution is to add an optional always_preserve argument to tell compact_graphs not to remove the given nodes.

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?
How about add a auto_self_loop option?

@BarclayII
Copy link
Collaborator Author

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?

Isolated nodes indeed do not get anything from message passing. In the case of GraphSAGE, it naturally reduces to MLP for mean and LSTM aggregators according to the formulation (our implementation on max aggregator currently returns -inf so it will crash, but that's another problem).

How about add a auto_self_loop option?

This actually throws a wrench in heterogeneous graph training for instance, especially if some sampling algorithm samples on only some of the relations (e.g. MEIRec). Which edge type would you assign it to?

@classicsong
Copy link
Contributor

Even you preserve the isolated nodes (as seed nodes), they still can not get their embeddings in training through message passing. Will this cause some problems in some models that concatenate embeddings layer by layer?

Isolated nodes indeed do not get anything from message passing. In the case of GraphSAGE, it naturally reduces to MLP for mean and LSTM aggregators according to the formulation (our implementation on max aggregator currently returns -inf so it will crash, but that's another problem).

What I mean in 'concatenate embeddings layer by layer' is that the embeddings from each frontier is concatenated together. You can refer to nn.pytorch.conv.tagconv as an example.

How about add a auto_self_loop option?

This actually throws a wrench in heterogeneous graph training for instance, especially if some sampling algorithm samples on only some of the relations (e.g. MEIRec). Which edge type would you assign it to?

This is a problem for self_loop.

It seems there is no silver bullet in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants