Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Accessing data from the indexes stored in same device #4242

Merged
merged 3 commits into from
Jul 13, 2022

Conversation

chang-l
Copy link
Collaborator

@chang-l chang-l commented Jul 11, 2022

Description

To address #4234, this PR fixes the crashing example cases, rgcn and graphsage, due to recent pytorch update

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

Move index to the same device as data.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 11, 2022

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@TristonC TristonC requested a review from BarclayII July 11, 2022 22:50
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 11, 2022

Commit ID: 1d1851d

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@@ -133,6 +133,7 @@ def run(proc_id, n_gpus, args, devices, data):
# blocks.
tic_step = time.time()
for step, (input_nodes, pos_graph, neg_graph, blocks) in enumerate(dataloader):
input_nodes = input_nodes.to(nfeat.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When features are on the CPU, input_nodes will first be copied to GPU in the dataloader and then copied back to the CPU for indexing. It's not caused by this PR and can be eliminated by unifying --graph_device and --data_device just like other examples. Perhaps we should take a note and fix it when we refactor this example in the future.

Copy link
Collaborator Author

@chang-l chang-l Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree. It does sound odd that copying input_nodes back and forth and it should be fixed when we refactor it. However, here, it seems input_nodes always belong to the same device as dataloader (--gpu), which is configured/controlled differently than nfeat (--data-device) and g (--graph-device)...

Copy link
Collaborator

@yaox12 yaox12 Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new sampling pipeline, we can specify prefetch_node_features to the sampler and don't need batch_inputs = nfeat[input_nodes].to(device) anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... Thanks Xin for noting this. I will keep it in mind :)

pos_graph = pos_graph.to(device)
neg_graph = neg_graph.to(device)
blocks = [block.int().to(device) for block in blocks]
blocks = [block.int() for block in blocks]
Copy link
Collaborator Author

@chang-l chang-l Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pos_graph, neg_graph, blocks all reside in the same device as dataloader (L105), moving to device is redundant.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2022

Commit ID: 926eade

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@yaox12 yaox12 merged commit c56e27a into dmlc:master Jul 13, 2022
@chang-l chang-l deleted the fix-gpu-dataindx-alignment branch July 13, 2022 20:39
BarclayII pushed a commit to BarclayII/dgl that referenced this pull request Aug 10, 2022
…4242)

* First update to fix two examples

* Update to fix RGCN/graphsage example and dataloader

* Update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants