Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enable UVA sampling with CPU indices #3892

Merged
merged 24 commits into from
Apr 12, 2022

Conversation

BarclayII
Copy link
Collaborator

@BarclayII BarclayII commented Mar 29, 2022

Description

This PR enables specifying indices on CPU for UVA sampling to avoid duplicating the indices for every GPU.

It also adds a utility function dgl.multiprocessing.call_once_and_share() which calls a function in a single process and share the result to other processes. Requires using dgl.multiprocessing.spawn instead of torch.multiprocessing.spawn (the signatures are the same).

Fixes #3855 and #3893 .

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

@BarclayII BarclayII requested review from nv-dlasalle and jermainewang and removed request for nv-dlasalle March 29, 2022 15:03
@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 29, 2022

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

examples/pytorch/graphsage/node_classification.py Outdated Show resolved Hide resolved
examples/pytorch/graphsage/node_classification.py Outdated Show resolved Hide resolved
python/dgl/multiprocessing/pytorch.py Outdated Show resolved Hide resolved
python/dgl/dataloading/dataloader.py Show resolved Hide resolved
@jermainewang jermainewang removed their request for review April 5, 2022 15:35
@jermainewang
Copy link
Member

jermainewang commented Apr 5, 2022

I didn't follow the context very carefully but saw no change on user experience. I'm good with the PR. Will let @nv-dlasalle approve.

@yaox12
Copy link
Collaborator

yaox12 commented Apr 12, 2022

Since @nv-dlasalle is out until next week, I'd like to approve this PR.

@BarclayII BarclayII merged commit e06e63d into dmlc:master Apr 12, 2022
@BarclayII BarclayII deleted the fix-cpu-uva branch April 12, 2022 12:56
@kkranen
Copy link
Contributor

kkranen commented Apr 14, 2022

Has anyone taken a look at how using this feature increases load on host memory? While this is an awesome solution for enabling larger graphs, it may cause problems from some memory BW-limited solutions.

@kkranen
Copy link
Contributor

kkranen commented Apr 14, 2022

It seems that dgl.multiprocessing.spawn does not exist in main:
module 'dgl.multiprocessing' has no attribute 'spawn'
Should we be executing spawning by setting multiprocessing context?

@TristonC
Copy link
Collaborator

@BarclayII Could you help to answer Kyle's question?

@BarclayII
Copy link
Collaborator Author

BarclayII commented Apr 15, 2022

Ah the docstring was wrong. It no longer requires dgl.multiprocessing.spawn. I can fix that.
Although for question from @kkranen, you need to spawn the process with torch.multiprocessing.spawn and set up a PyTorch distributed process group (torch.distributed.init_process_group).

@BarclayII
Copy link
Collaborator Author

Has anyone taken a look at how using this feature increases load on host memory? While this is an awesome solution for enabling larger graphs, it may cause problems from some memory BW-limited solutions.

I think the load increase should be fine since this PR only changes how we handle CPU indices, and we only require copying CPU indices (which has the same size of the minibatch itself) to GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants