Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Record stream when using another CUDA stream for data transfer #4250

Merged
merged 5 commits into from
Jul 14, 2022

Conversation

yaox12
Copy link
Collaborator

@yaox12 yaox12 commented Jul 13, 2022

Description

This is to fix #4247. pytorch/pytorch#23729 explains why we need .record_stream().

Known issue:

Subgraph transfer will be in the default stream because dgl.heterograph doesn't support .record_stream() for now. If we are going to make it, we have to port .record_stream() via tensoradaptor for NDArrays allocated by PyTorch.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 13, 2022

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 13, 2022

Commit ID: 3cb73dd764167643382a04a21a0f131859d1cdbd

Build ID: 4

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@yaox12 yaox12 added the Release Candidate Candidate PRs for the upcoming release label Jul 14, 2022
@jermainewang
Copy link
Member

Is the change required for a specific PyTorch version? Will it break other versions?

@BarclayII
Copy link
Collaborator

The semantic seems to appear in at least 1.9.0. I think it's OK if we target 1.9.0+.

@yaox12
Copy link
Collaborator Author

yaox12 commented Jul 14, 2022

Is the change required for a specific PyTorch version? Will it break other versions?

No. tensor.record_stream() exists since PyTorch 1.3 or even earlier.

@jermainewang
Copy link
Member

Do we have unit tests for this feature?

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2022

Commit ID: d805d02d595cacb27e05bf5ca84373695291283b

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2022

Commit ID: 4ed13f4

Build ID: 6

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2022

Commit ID: 38490c6

Build ID: 7

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@yaox12 yaox12 merged commit 82ca781 into dmlc:master Jul 14, 2022
@yaox12 yaox12 deleted the fix_cuda_stream branch July 14, 2022 12:02
BarclayII pushed a commit to BarclayII/dgl that referenced this pull request Aug 10, 2022
…mlc#4250)

* record stream when using another cuda stream for data transfer

* fix linting

* fix None stream
@frozenbugs frozenbugs removed the Release Candidate Candidate PRs for the upcoming release label Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Example][Bug] Random crash of a multi-gpu run in graphsage example folder
5 participants