Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Improve metapath_reachable_graph for large graphs #2248

Closed
lingfanyu opened this issue Sep 27, 2020 · 7 comments
Closed

[Feature Request] Improve metapath_reachable_graph for large graphs #2248

lingfanyu opened this issue Sep 27, 2020 · 7 comments

Comments

@lingfanyu
Copy link
Collaborator

lingfanyu commented Sep 27, 2020

🚀 Feature

I suggest the following support for dgl.metapath_reachable_graph:

  • Let users specify source nodes, from which the output graph is reachable following given metapath
  • Add a neighbor sampling version of this API

Motivation

I am trying to run Heterogeneous Graph Attention Network (HAN) on large graphs. The example HAN implementation in DGL works perfectly on small graphs like ACM (only 4k paper nodes). But I find it not easy to scale to larger datasets like ogbn-mag.

The problem

HAN defines neighborhood based on metapath. However, this step makes the graph a lot denser. Take ACM dataset as an example, there are about 4k papers, 70 fields, and 4k paper-field edges in the graph. The reachable graph of metapath paper-field-paper has about 2.2M edges.

On large datasets, there are millions of nodes, and more edge types (resulting in more potential metapaths) and performing metapath_reachable_graph for full graph does not scale.

Pitch

I want two things to happen if this feature is added:

  • Mini-batch training for HAN instead of full-batch. This requires generating reachable graph from a batch of nodes provided by users
  • Performing neighbor sampling (maybe at each step in the metapath) to further reduce the size of generated reachable graph

Alternatives

One alternative way I can think of to implement what I need using DGL's current API is to use dgl.sampling.random_walk. For a mini-batch of nodes, it might be equivalent to performing dgl.sampling.random_walk(g, nodes, metapath) for K times. And then I can take only the last step of the returned paths to form a size K neighborhood for each node in the batch.

Another way (less relevant to requested feature) to scale HAN to large graph is to design a subgraph sampling approach which needs to be aware of the heterogeneity of the graph and the user-specified metapaths. Then HAN can run on the sampled much smaller heterogeneous subgraph.

@jermainewang @BarclayII @mufeili

@BarclayII
Copy link
Collaborator

Sounds like a generalization of PinSage. PinSage basically samples some neighbors on a metapath_reachable_graph with the metapath being item -> user -> item. I think you can probably adapt the PinSage example to HAN by replacing PinSAGESampler to RandomWalkNeighborSampler, with num_traversals set to 1, num_random_walks set to the number of neighbors, and metapath to the metapath you want.

@v2psv
Copy link

v2psv commented Sep 29, 2020

I use dgl.dataloading.NodeDataLoader to get reachable metapath, then sampling subgraphs using dgl.node_subgraph, and train HAN on subgraphs in each step. Maybe there are better solutions.

@lingfanyu
Copy link
Collaborator Author

@BarclayII I was planning to implement something similar using random_walk API. It's great to know we already have this RandomWalkNeighborSampler.

@v2psv What you suggested is similar to the second alternative I proposed, but you don't make use of heterogeneity and metapath information when performing subgraph sampling. I'm not sure if the sampled subgraph will be able to maintain semantic information or will be biased toward certain edge types that have more edges. It would be nice to try this out on large dataset like MAG and OAG.

@lingfanyu
Copy link
Collaborator Author

Closed due to no further discussion

@maqy1995
Copy link
Contributor

Hi @lingfanyu, have you tried mini-batch training HAN on ogbn-mag? Is it work?

@GooLiang
Copy link

I had the same problem when building the mag dataset 3 hop meta-path using this API. The build speed is too slow and directly stuck. If this API has better performance now?

@stupidoge
Copy link

stupidoge commented Aug 27, 2023

When I use this function, the memory out of the limit.. The error said: unable to allocate 538 GiB for an array with shape.... and data type int64

could you solve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants