Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

Merged
merged 8 commits into from
Nov 2, 2020

Conversation

nv-dlasalle
Copy link
Collaborator

@nv-dlasalle nv-dlasalle commented Oct 7, 2020

Description

Create the arange on the target device directly using the backend, rather than creating it on the CPU, and copying it the target device.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage (covered by test_basics.py's test_update_routines(), test_update_0deg(), etc. )
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

Changes

The arange method in the backends was modified to accept a device context, and the HeteroNodeView was modified to make use of this, to avoid generating the range on the CPU, and then synchronously copying it up to the GPU.

@BarclayII
Copy link
Collaborator

How was the speed up? Just curious; not knowing the number is fine.

@nv-dlasalle
Copy link
Collaborator Author

The overall speed up is pretty small, but in the region immediately following its in invocation in the graphsage example, we see a reduction from 1.8ms to 547us (> 3x), as a result of not having to wait for the H2D copy:

Before this change:
original_arange

With this change:
this_branch_arange

But as more synchronization points are removed, the impact could be larger. In the above images, synchronizing to check if the COO is sorted is what limits the benefit.

return nd.cast_to_signed(nd.from_dlpack(zerocopy_to_dlpack(data)))
else:
return nd.from_dlpack(zerocopy_to_dlpack(data))
return nd.from_dlpack(zerocopy_to_dlpack(data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this modification?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the effect is the same. Do you prefer keeping the old code? It doesn't look like a big deal to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a potential bug(feature?) in tensorflow dlpack, the device of int32 tensor is not accurate. I think this is fixed in the latest tf, but we need to keep it for backward compatibility

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the change the else clause in 2abb841.

With respect as to why this method needed to be modified, in the unit tests for int64_t I ran into the issue that the pointer in the tensor was silently converted to a CPU pointer rather than a GPU pointer. The code was already handling the in32_t case, so I expanded it to also handle the int64_t case. This was with TensorFlow 2.3.1.

This wasn't an issue in the unit tests before this PR, as the tensor was generated on the CPU, and then explicitly copied to the GPU.

Copy link
Collaborator

@BarclayII BarclayII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good after Da's comment is resolved.

@zheng-da zheng-da merged commit c6890c2 into dmlc:master Nov 2, 2020
VoVAllen added a commit to VoVAllen/dgl that referenced this pull request Nov 13, 2020
…ad of on CPU and copying it (dmlc#2266)

* Build arange on target device

* Utilize arange device in viewpy:HeteroNodeView.__call__

* Work around uint64 error in TF to_dlpack

* Restore else clause

Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
BarclayII pushed a commit to BarclayII/dgl that referenced this pull request Nov 27, 2020
…ad of on CPU and copying it (dmlc#2266)

* Build arange on target device

* Utilize arange device in viewpy:HeteroNodeView.__call__

* Work around uint64 error in TF to_dlpack

* Restore else clause

Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com>
Co-authored-by: Da Zheng <zhengda1936@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants