Less cudaGet/SetDevice calls in Gluon execution #13764

ptrendx · 2019-01-02T22:58:01Z

Description

This PR reduces the number of cudaGetDevice/cudaSetDevice calls during Gluon execution.
Previously, during every call to allocate/free buffer in StorageManager DeviceStore would call cudaGetDevice and 2x cudaSetDevice (to get the current device, set the new device and lastly to set the original device again), even if no actual allocation took place (due to caching allocator usage).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

This PR changes the DeviceStore so that cudaSetDevice calls are made only when necessary (when the new device is different than the original device)
This PR changes the Storage so that only the actual GPU and pinned CPU allocations are guarded by the DeviceStore - memory returned from allocator cache does not need to be guarded.

Comments

There is 1 more place that introduces potentially needless calls to cudaSetDevice when using Gluon - https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine_perdevice.cc#L99 is called when temporary ThreadedOpr object is destroyed after either normal or bulked execution of ops. It seems though that this should be handled by caching of ThreadedOpr objects and engine variables (since destruction of them after every op seems to be quite costly). @eric-haibin-lin FYI

Roshrini · 2019-01-03T18:13:32Z

@ptrendx Can you look into failing CI builds?
@mxnet-label-bot Add [pr-awaiting-review, Gluon]

driver is unloading

eric-haibin-lin

Thanks for the fix! One quesiton

eric-haibin-lin · 2019-01-04T06:57:24Z

src/kvstore/comm.h

    for (int i = 0; i < n; ++i) {
-     device_store.SetDevice(gpus[i]);
+      // Restores active device to what it was before EnableP2P
+      mxnet::common::cuda::DeviceStore device_store(gpus[i]);


is cudaGetDevice costly? This change would cause 2x cudaGetDevice calls

This code is executed only during initialization, so I'm not concerned about its performance (to answer your question though - cudaGetDevice is slightly less costly than cudaSetDevice).
I made a change here just because it is then real RAII guard instead of just a setdevice call.

eric-haibin-lin · 2019-01-04T07:01:18Z

@ctcyang could you also take a look?

ptrendx · 2019-01-04T18:13:24Z

Not sure why the website check is showing as pending - it seems to have finished successfully in Details view.

ctcyang

Nice work! The only nitpick I have is that after these changes, the only place where cudaSetDevice is still used directly is: https://github.com/apache/incubator-mxnet/blob/e9a7aa42ec380d92b1623025d6434b8856724402/src/engine/threaded_engine_pooled.cc#L136

Could you change that to use this new API too?

ctcyang

LGTM

* Remove unnecessary cudaGetDevice/cudaSetDevice calls * Fixes for the DeviceGuard * Retrigger CI * Fix for possible invalid device ordinal when using DeviceStore while driver is unloading * Fix for RTC when the driver API call is the first call * Added DeviceStore to pooled engine

ptrendx added 2 commits January 2, 2019 14:37

Remove unnecessary cudaGetDevice/cudaSetDevice calls

ce44b5e

Fixes for the DeviceGuard

910e43c

marcoabreu added Gluon pr-awaiting-review PR is waiting for code review labels Jan 3, 2019

Retrigger CI

9923764

ptrendx mentioned this pull request Jan 3, 2019

I found a bug in the source code, I don't know how to define it, but I commented out that the code will run better. #13747

Closed

ptrendx added 2 commits January 3, 2019 11:16

Fix for possible invalid device ordinal when using DeviceStore while

fe85640

driver is unloading

Fix for RTC when the driver API call is the first call

74efe6e

eric-haibin-lin reviewed Jan 4, 2019

View reviewed changes

ctcyang reviewed Jan 4, 2019

View reviewed changes

Added DeviceStore to pooled engine

d0b26a0

ctcyang approved these changes Jan 4, 2019

View reviewed changes

eric-haibin-lin approved these changes Jan 5, 2019

View reviewed changes

eric-haibin-lin merged commit 863fb86 into apache:master Jan 5, 2019

perdasilva mentioned this pull request Mar 13, 2019

[WIP] v1.4.x: Backports fewer cudaGet/SetDevice calls PR #14419

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less cudaGet/SetDevice calls in Gluon execution #13764

Less cudaGet/SetDevice calls in Gluon execution #13764

ptrendx commented Jan 2, 2019

Roshrini commented Jan 3, 2019

eric-haibin-lin left a comment

eric-haibin-lin Jan 4, 2019

ptrendx Jan 4, 2019

eric-haibin-lin commented Jan 4, 2019

ptrendx commented Jan 4, 2019

ctcyang left a comment •

edited

Loading

ctcyang left a comment

Less cudaGet/SetDevice calls in Gluon execution #13764

Less cudaGet/SetDevice calls in Gluon execution #13764

Conversation

ptrendx commented Jan 2, 2019

Description

Checklist

Essentials

Changes

Comments

Roshrini commented Jan 3, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

eric-haibin-lin Jan 4, 2019

Choose a reason for hiding this comment

ptrendx Jan 4, 2019

Choose a reason for hiding this comment

eric-haibin-lin commented Jan 4, 2019

ptrendx commented Jan 4, 2019

ctcyang left a comment • edited Loading

Choose a reason for hiding this comment

ctcyang left a comment

Choose a reason for hiding this comment

ctcyang left a comment •

edited

Loading