[Feature] Enable UVA sampling with CPU indices #3892

BarclayII · 2022-03-29T15:02:52Z

Description

This PR enables specifying indices on CPU for UVA sampling to avoid duplicating the indices for every GPU.

It also adds a utility function dgl.multiprocessing.call_once_and_share() which calls a function in a single process and share the result to other processes. Requires using dgl.multiprocessing.spawn instead of torch.multiprocessing.spawn (the signatures are the same).

Fixes #3855 and #3893 .

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

dgl-bot · 2022-03-29T15:03:30Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

examples/pytorch/graphsage/node_classification.py

python/dgl/multiprocessing/pytorch.py

python/dgl/dataloading/dataloader.py

jermainewang · 2022-04-05T15:36:55Z

I didn't follow the context very carefully but saw no change on user experience. I'm good with the PR. Will let @nv-dlasalle approve.

yaox12 · 2022-04-12T01:35:27Z

Since @nv-dlasalle is out until next week, I'd like to approve this PR.

kkranen · 2022-04-14T17:53:10Z

Has anyone taken a look at how using this feature increases load on host memory? While this is an awesome solution for enabling larger graphs, it may cause problems from some memory BW-limited solutions.

kkranen · 2022-04-14T22:15:26Z

It seems that dgl.multiprocessing.spawn does not exist in main:
module 'dgl.multiprocessing' has no attribute 'spawn'
Should we be executing spawning by setting multiprocessing context?

TristonC · 2022-04-14T22:48:22Z

@BarclayII Could you help to answer Kyle's question?

BarclayII · 2022-04-15T10:53:31Z

Ah the docstring was wrong. It no longer requires dgl.multiprocessing.spawn. I can fix that.
Although for question from @kkranen, you need to spawn the process with torch.multiprocessing.spawn and set up a PyTorch distributed process group (torch.distributed.init_process_group).

BarclayII · 2022-04-15T10:59:03Z

Has anyone taken a look at how using this feature increases load on host memory? While this is an awesome solution for enabling larger graphs, it may cause problems from some memory BW-limited solutions.

I think the load increase should be fine since this PR only changes how we handle CPU indices, and we only require copying CPU indices (which has the same size of the minibatch itself) to GPU.

BarclayII added 3 commits March 29, 2022 14:55

enable UVA sampling with CPU indices

97face5

add docs

eb2e681

add more docs

bd9e7c2

BarclayII requested review from nv-dlasalle and jermainewang and removed request for nv-dlasalle March 29, 2022 15:03

BarclayII mentioned this pull request Mar 29, 2022

[Examples] Update graphsage multi-gpu example to use mutliple GPUs for validation and testing. #3827

Merged

5 tasks

BarclayII added 3 commits March 29, 2022 15:15

lint

6edfbb5

fix

d5ebfdb

fix

772ffea

BarclayII mentioned this pull request Mar 30, 2022

[bugfix] Remove repeats in DDPTensorizedDataset (fix #3893) #3894

Closed

6 tasks

BarclayII linked an issue Mar 30, 2022 that may be closed by this pull request

[bug] DDP in DataLoader can place repeated 0's in batches #3893

Closed

BarclayII added 3 commits March 30, 2022 12:50

Merge branch 'master' into fix-cpu-uva

41b7ca0

better error message

08cae50

Merge branch 'fix-cpu-uva' of github.com:BarclayII/dgl into fix-cpu-uva

98afbd1

nv-dlasalle reviewed Mar 30, 2022

View reviewed changes

BarclayII added 5 commits April 3, 2022 11:04

use mp.Barrier instead of queues

586cc9a

revert

3b8db92

revert

8076d72

oops

5b877d9

Merge branch 'master' into fix-cpu-uva

ae6d4ec

BarclayII mentioned this pull request Apr 4, 2022

[Example] Cleaned GraphSAGE node classification example with PyTorch Lightning #3863

Merged

revert dgl.multiprocessing.spawn

89c1b3a

jermainewang removed their request for review April 5, 2022 15:35

Update pytorch.py

66687ed

BarclayII mentioned this pull request Apr 9, 2022

Dataloading requires 5x the memory of the indices tensor when a dictionary is passed #3918

Open

Merge branch 'master' into fix-cpu-uva

ec297a2

BarclayII added 2 commits April 9, 2022 08:04

Merge branch 'fix-cpu-uva' of github.com:BarclayII/dgl into fix-cpu-uva

f286a40

Merge branch 'master' into fix-cpu-uva

8a97de8

BarclayII added the 0.8.1 label Apr 11, 2022

BarclayII added 2 commits April 11, 2022 12:22

Merge branch 'master' into fix-cpu-uva

522174b

Merge branch 'master' into fix-cpu-uva

3a763b8

Merge branch 'master' into fix-cpu-uva

c2a63d6

jermainewang approved these changes Apr 12, 2022

View reviewed changes

BarclayII added 2 commits April 12, 2022 15:36

Merge branch 'master' into fix-cpu-uva

11fe31b

Merge branch 'master' into fix-cpu-uva

2241700

BarclayII merged commit e06e63d into dmlc:master Apr 12, 2022

BarclayII deleted the fix-cpu-uva branch April 12, 2022 12:56

BarclayII mentioned this pull request Apr 15, 2022

[Doc] Fix documentation in dgl.multiprocessing namespace #3929

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enable UVA sampling with CPU indices #3892

[Feature] Enable UVA sampling with CPU indices #3892

BarclayII commented Mar 29, 2022 •

edited

Loading

dgl-bot commented Mar 29, 2022

jermainewang commented Apr 5, 2022 •

edited

Loading

yaox12 commented Apr 12, 2022 •

edited

Loading

kkranen commented Apr 14, 2022

kkranen commented Apr 14, 2022

TristonC commented Apr 14, 2022

BarclayII commented Apr 15, 2022 •

edited

Loading

BarclayII commented Apr 15, 2022

[Feature] Enable UVA sampling with CPU indices #3892

[Feature] Enable UVA sampling with CPU indices #3892

Conversation

BarclayII commented Mar 29, 2022 • edited Loading

Description

Checklist

dgl-bot commented Mar 29, 2022

jermainewang commented Apr 5, 2022 • edited Loading

yaox12 commented Apr 12, 2022 • edited Loading

kkranen commented Apr 14, 2022

kkranen commented Apr 14, 2022

TristonC commented Apr 14, 2022

BarclayII commented Apr 15, 2022 • edited Loading

BarclayII commented Apr 15, 2022

BarclayII commented Mar 29, 2022 •

edited

Loading

jermainewang commented Apr 5, 2022 •

edited

Loading

yaox12 commented Apr 12, 2022 •

edited

Loading

BarclayII commented Apr 15, 2022 •

edited

Loading