[Bugfix] Accessing data from the indexes stored in same device #4242

chang-l · 2022-07-11T22:42:09Z

Description

To address #4234, this PR fixes the crashing example cases, rgcn and graphsage, due to recent pytorch update

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

Move index to the same device as data.

dgl-bot · 2022-07-11T22:42:40Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2022-07-11T23:26:51Z

Commit ID: 1d1851d

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

yaox12 · 2022-07-12T01:13:27Z

examples/pytorch/graphsage/advanced/train_sampling_unsupervised.py

@@ -133,6 +133,7 @@ def run(proc_id, n_gpus, args, devices, data):
        # blocks.
        tic_step = time.time()
        for step, (input_nodes, pos_graph, neg_graph, blocks) in enumerate(dataloader):
+            input_nodes = input_nodes.to(nfeat.device)


When features are on the CPU, input_nodes will first be copied to GPU in the dataloader and then copied back to the CPU for indexing. It's not caused by this PR and can be eliminated by unifying --graph_device and --data_device just like other examples. Perhaps we should take a note and fix it when we refactor this example in the future.

I generally agree. It does sound odd that copying input_nodes back and forth and it should be fixed when we refactor it. However, here, it seems input_nodes always belong to the same device as dataloader (--gpu), which is configured/controlled differently than nfeat (--data-device) and g (--graph-device)...

With the new sampling pipeline, we can specify prefetch_node_features to the sampler and don't need batch_inputs = nfeat[input_nodes].to(device) anymore.

I see... Thanks Xin for noting this. I will keep it in mind :)

chang-l · 2022-07-12T20:17:34Z

examples/pytorch/graphsage/advanced/train_sampling_unsupervised.py

-            pos_graph = pos_graph.to(device)
-            neg_graph = neg_graph.to(device)
-            blocks = [block.int().to(device) for block in blocks]
+            blocks = [block.int() for block in blocks]


Since pos_graph, neg_graph, blocks all reside in the same device as dataloader (L105), moving to device is redundant.

dgl-bot · 2022-07-12T21:04:16Z

Commit ID: 926eade

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

…4242) * First update to fix two examples * Update to fix RGCN/graphsage example and dataloader * Update

chang-l added 2 commits July 11, 2022 11:24

First update to fix two examples

5618b72

Update to fix RGCN/graphsage example and dataloader

1d1851d

TristonC requested a review from BarclayII July 11, 2022 22:50

yaox12 approved these changes Jul 12, 2022

View reviewed changes

chang-l mentioned this pull request Jul 12, 2022

[Example][Bug] A multi-gpu run crashed for graphsage example #4244

Closed

Update

926eade

chang-l commented Jul 12, 2022

View reviewed changes

yaox12 merged commit c56e27a into dmlc:master Jul 13, 2022

chang-l mentioned this pull request Jul 13, 2022

[Example][Bugfix] Add-on to PR#4242, fixing another case in graphsage example #4255

Merged

7 tasks

chang-l deleted the fix-gpu-dataindx-alignment branch July 13, 2022 20:39

BarclayII pushed a commit to BarclayII/dgl that referenced this pull request Aug 10, 2022

[Bugfix] Accessing data from the indexes stored in same device (dmlc#…

bd33086

…4242) * First update to fix two examples * Update to fix RGCN/graphsage example and dataloader * Update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Accessing data from the indexes stored in same device #4242

[Bugfix] Accessing data from the indexes stored in same device #4242

chang-l commented Jul 11, 2022

dgl-bot commented Jul 11, 2022

dgl-bot commented Jul 11, 2022

yaox12 Jul 12, 2022

chang-l Jul 12, 2022 •

edited

Loading

yaox12 Jul 13, 2022 •

edited

Loading

chang-l Jul 13, 2022

chang-l Jul 12, 2022 •

edited

Loading

dgl-bot commented Jul 12, 2022

[Bugfix] Accessing data from the indexes stored in same device #4242

[Bugfix] Accessing data from the indexes stored in same device #4242

Conversation

chang-l commented Jul 11, 2022

Description

Checklist

Changes

dgl-bot commented Jul 11, 2022

dgl-bot commented Jul 11, 2022

yaox12 Jul 12, 2022

Choose a reason for hiding this comment

chang-l Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

yaox12 Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

chang-l Jul 13, 2022

Choose a reason for hiding this comment

chang-l Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

dgl-bot commented Jul 12, 2022

chang-l Jul 12, 2022 •

edited

Loading

yaox12 Jul 13, 2022 •

edited

Loading

chang-l Jul 12, 2022 •

edited

Loading