[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

nv-dlasalle · 2020-10-07T19:31:40Z

Description

Create the arange on the target device directly using the backend, rather than creating it on the CPU, and copying it the target device.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage (covered by test_basics.py's test_update_routines(), test_update_0deg(), etc. )
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change

Changes

The arange method in the backends was modified to accept a device context, and the HeteroNodeView was modified to make use of this, to avoid generating the range on the CPU, and then synchronously copying it up to the GPU.

BarclayII · 2020-10-09T09:32:26Z

How was the speed up? Just curious; not knowing the number is fine.

nv-dlasalle · 2020-10-09T17:41:46Z

The overall speed up is pretty small, but in the region immediately following its in invocation in the graphsage example, we see a reduction from 1.8ms to 547us (> 3x), as a result of not having to wait for the H2D copy:

Before this change:

With this change:

But as more synchronization points are removed, the impact could be larger. In the above images, synchronizing to check if the COO is sorted is what limits the benefit.

zheng-da · 2020-10-10T05:11:35Z

python/dgl/backend/tensorflow/tensor.py

        return nd.cast_to_signed(nd.from_dlpack(zerocopy_to_dlpack(data)))
-    else:
-        return nd.from_dlpack(zerocopy_to_dlpack(data))
+    return nd.from_dlpack(zerocopy_to_dlpack(data))


why do we need this modification?

I think the effect is the same. Do you prefer keeping the old code? It doesn't look like a big deal to me.

It's a potential bug(feature?) in tensorflow dlpack, the device of int32 tensor is not accurate. I think this is fixed in the latest tf, but we need to keep it for backward compatibility

I reverted the change the else clause in 2abb841.

With respect as to why this method needed to be modified, in the unit tests for int64_t I ran into the issue that the pointer in the tensor was silently converted to a CPU pointer rather than a GPU pointer. The code was already handling the in32_t case, so I expanded it to also handle the int64_t case. This was with TensorFlow 2.3.1.

This wasn't an issue in the unit tests before this PR, as the tensor was generated on the CPU, and then explicitly copied to the GPU.

BarclayII

I'm good after Da's comment is resolved.

…ad of on CPU and copying it (dmlc#2266) * Build arange on target device * Utilize arange device in viewpy:HeteroNodeView.__call__ * Work around uint64 error in TF to_dlpack * Restore else clause Co-authored-by: Jinjing Zhou <VoVAllen@users.noreply.github.com> Co-authored-by: Da Zheng <zhengda1936@gmail.com>

nv-dlasalle force-pushed the device_arange branch from d574a9f to 959a1be Compare October 9, 2020 17:43

nv-dlasalle added 3 commits October 9, 2020 11:04

Build arange on target device

93b2907

Utilize arange device in viewpy:HeteroNodeView.__call__

14f64a4

Work around uint64 error in TF to_dlpack

e48d87b

nv-dlasalle force-pushed the device_arange branch from 959a1be to e48d87b Compare October 9, 2020 18:04

zheng-da reviewed Oct 10, 2020

View reviewed changes

BarclayII approved these changes Oct 12, 2020

View reviewed changes

VoVAllen and others added 5 commits October 12, 2020 13:57

Merge branch 'master' into device_arange

23fce82

Restore else clause

2abb841

Merge branch 'master' into device_arange

1286525

Merge branch 'master' into device_arange

2221042

Merge branch 'master' into device_arange

5781b0b

zheng-da merged commit c6890c2 into dmlc:master Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

nv-dlasalle commented Oct 7, 2020 •

edited

BarclayII commented Oct 9, 2020

nv-dlasalle commented Oct 9, 2020

zheng-da Oct 10, 2020

BarclayII Oct 12, 2020

VoVAllen Oct 12, 2020

nv-dlasalle Oct 12, 2020

BarclayII left a comment

[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

[Performance] In HeteroNodeView, build arange on target device, instead of on CPU and copying it #2266

Conversation

nv-dlasalle commented Oct 7, 2020 • edited

Description

Checklist

Changes

BarclayII commented Oct 9, 2020

nv-dlasalle commented Oct 9, 2020

zheng-da Oct 10, 2020

Choose a reason for hiding this comment

BarclayII Oct 12, 2020

Choose a reason for hiding this comment

VoVAllen Oct 12, 2020

Choose a reason for hiding this comment

nv-dlasalle Oct 12, 2020

Choose a reason for hiding this comment

BarclayII left a comment

Choose a reason for hiding this comment

nv-dlasalle commented Oct 7, 2020 •

edited