Add async GPU dependency Engine #20331

Kh4L · 2021-06-04T08:05:24Z

Description

Changes how the GPU operations are synced in the ThreadedEngine.

GPU operators needed to be completed so the output variables (either NDArray or Resource) could be tagged as readable. The engine workers needed to cudaStreamSynchronize on the GPU stream to know when the variable is ready to read. This prevented an optimal GPU kernels overlapping with CPU operations.

This PR introduces a sync mechanism which leverages cudaEvent to avoid host synchronization between GPU operations.
The GPU writing operators tag the variables with a cudaEvent based sync object, and the operators reading the variables sync or wait on these objects.

Illustration

Note

Use the MXNET_ENGINE_TYPE environment variable to enable the new async engine feature, by adding Async at the end of the engine type:

ThreadedEngine -> ThreadedEngineAsync
ThreadedEnginePerDevice -> ThreadedEnginePerDeviceAsync

When not set, the GPU operations synchronization will be done as previously: with a host-device synchronization after the op function.

Detailed changes

The changes primarily concern the ThreadedEnginePerDevice and the ThreadedEnginePooled.

Removing synchronization

This PR removes all the stream synchronizations that used to be called after the operator function, such as rctx.get_stream<gpu>()->Wait();.

SyncObject

A new structure, SyncObject is added to the engine variable. It carries the CUDA events used to defer the data dependency to the CUDA Driver.
Therefore, each variable (NDArray or Resource) handled by the engine owns a SyncObject.

There are two categories of events, the reader events and writer events.

A sync object can have 0 or more reader events: several operators might be waiting to read this Variable.
However, it can carry only 0 or 1 writer event: only one writer can be writing on the variable at a time.

New functions: OnCompleteGPU, OnStartGPU and OnStartCPU

Previously, an OnComplete callback would be called by the engine worker right after executing the op function. Its main purpose was to set the data dependency. This PR adds a new OnComplete overrides and another callback type, the on
3 new hooks were added:

OnCompleteGPU. The existing OnComplete callback has new a new version, the OnCompleteGPU. This functions captures the kernels scheduled on the stream in an event, and assigns it to the SyncObject of each variable.
OnStartGPU: this hook is called before GPU op functions and will iterate through all the events of variables read and written by the function. It calls cudaStreamWaitEvent to guarantee that all the needed variables will be ready before the current function's kernels are scheduled.
OnStartCPU: this hook acts similarly as the previous one, but it is called before GPU op functions and it uses host-device synchronisation with cudaEventSynchronize.

These changes result in a new signature for the Engine's async function:
Before:

  typedef std::function<void(RunContext, CallbackOnStart, CallbackOnComplete)> AsyncFn;

After:

  typedef std::function<void(RunContext, CallbackOnStart)> AsyncFn;

CUDAEventPool

The events are managed by a CUDAEventPool that contains a counter. This is needed both for performance and to track the event order.
In the ThreadedEnginePerDevice, there is one pool per worker. In the ThreadedEnginePooled, there is a one ThreadedEnginePooled per stream, owned by the stream manager.

Pooled allocator

The PooledStorageManager is also aware of the sync objects in order to recycle them with the data Chunk. The free function now also recycles the SyncObject, which can be later synchronized at alloc time.

WaitAll, WaitToRead and WaitToWrite updated

The NDArray synchronisation method WaitAll, WaitToRead and WaitToWrite now call cudaEventSynchronize to guarantee that the GPU data is ready.

TODO:

Re-enable the byteps test, once byteps is updated

Co-author: @ptrendx

mxnet-bot · 2021-06-04T08:05:28Z

Hey @Kh4L , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, centos-gpu, sanity, windows-cpu, miscellaneous, centos-cpu, unix-gpu, windows-gpu, edge, website, clang]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

Kh4L · 2021-06-14T08:21:04Z

@mxnet-bot run ci [centos-cpu, centos-gpu, edge, unix-cpu, unix-gpu, windows-gpu]

mxnet-bot · 2021-06-14T08:21:10Z

Jenkins CI successfully triggered : [unix-cpu, windows-gpu, edge, unix-gpu, centos-gpu, centos-cpu]

TristonC · 2021-06-16T00:25:38Z

@szha @sandeep-krishnamurthy Do you have someone in your team or elsewhere for a code review?

szha · 2021-06-22T02:32:32Z

I plan to take a closer look this week. could @ptrendx also take a look?

include/mxnet/base.h

ptrendx · 2021-06-25T05:24:40Z

include/mxnet/base.h

   */
-  bool is_bulk;
+  void *event_pool = nullptr;


Leaving this comment here just as a way to explain why the is_bulk is removed. is_bulk was a way to mark operations as not needing synchronization when being a part of a single engine bulk. This PR moves the handling of the synchronization to the engine itself, so this is no longer useful.

ptrendx · 2021-06-25T05:26:16Z

I will review this PR next week.

szha · 2021-07-05T21:43:39Z

src/engine/threaded_engine.cc

+
+void ThreadedEngine::OnStartCPU(Engine *engine, void *opr_block,
+                        const dmlc::Error* error) {
+  static bool use_new_dep_engine = dmlc::GetEnv("MXNET_ASYNC_GPU_ENGINE", false);


We already have MXNET_ENGINE_TYPE which decides whether NaiveEngine (i.e. synchronous scheduler) is used, so unless there's a strong reason I'd prefer not to introduce new environment variables here.

sandeep-krishnamurthy · 2021-07-06T22:30:59Z

@apeforest @eric-haibin-lin can you please help review this PR?

sandeep-krishnamurthy · 2021-07-06T22:31:19Z

@Zha0q1 @mseth10 can you please help review this PR?

Kh4L · 2021-07-21T17:53:00Z

@sandeep-krishnamurthy what's the review status?

barry-jin

Thanks for the great work! The PR looks good to me. The synchronization in new dlpack API #20546 will probably rely on the sync_obj introduced here. Looking forward to seeing this new feature to be merged.

barry-jin · 2021-09-02T17:12:47Z

src/engine/threaded_engine.cc

+    if ((*events_per_stream)[event_stream].pool_index < cuda_event.pool_index) {
+      (*events_per_stream)[event_stream] = cuda_event;


Why should the event be replaced by others with larger pool_index?

Because the pool is per stream so for 2 events from the same pool the one with the larger pool_index had to be recorded later (and synchronizing with that event will synchronize with both).

barry-jin · 2021-09-02T17:26:44Z

src/engine/threaded_engine.cc

+      if (sync_obj.writer_event[0].event.expired()) {
+        sync_obj.writer_event.clear();


Could you elaborate why should all the writer events be removed when the first writer event expires.

There can only be 1 writer event active (vs multiple readers).

There can be only one event in the writer_event vector, clear here could be replaced by pop_back.

barry-jin · 2021-09-02T17:50:12Z

src/engine/threaded_engine.cc

+    sync_obj.reader_events.clear();
+    sync_obj.writer_event.clear();
+    sync_obj.writer_event.push_back({event, worker_stream->stream_, event_pool_idx});


This probably answers my last question. But what's the purpose of clearing all the reader and writer events? Is it because current event should synchronize with the cleared reader and writer events?

Right - since if you are the writer to the variable, you had to wait for every previous reader/writer already there is no reason to keep those events anymore.

barry-jin · 2021-09-02T17:54:13Z

src/engine/threaded_engine_perdevice.cc

+      CallbackOnStart on_start = this->CreateOnStart(ThreadedEngine::OnStartStatic,
+                                                     opr_block);
+      CallbackOnComplete callback = this->CreateCallback(ThreadedEngine::OnCompleteStatic,
+                                                         opr_block);


Why use OnStartStatic and OnCompleteStatic here instead of creating GPU onstart and oncomplete for GPU case, creating static-onstart and static-oncomplete for CPU case.

Mechanism introduced in this PR is in addition to the dependency tracking that already existed, and the OnCompleteStatic contains dependency update code (so it has to be called from the GPU version as well).

Kh4L · 2021-10-13T06:13:34Z

@mxnet-bot run ci [sanity]

mxnet-bot · 2021-10-13T06:13:38Z

Jenkins CI successfully triggered : [sanity]

Signed-off-by: Serge Panev <spanev@nvidia.com>

Kh4L · 2021-10-18T03:11:15Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2021-10-18T03:11:22Z

Jenkins CI successfully triggered : [unix-gpu]

Kh4L · 2021-10-20T03:10:35Z

@mxnet-bot run ci [centos-cpu, unix-cpu, windows-gpu]

mxnet-bot · 2021-10-20T03:10:43Z

Jenkins CI successfully triggered : [windows-gpu, centos-cpu, unix-cpu]

szha · 2021-10-20T17:29:29Z

@Kh4L glad that the CI passed. @ptrendx @barry-jin could you take a look and see if there's any other concern to be addressed?

barry-jin

LGTM, Thanks! My only concern is that use_new_dep_engine is probably not tested in CI and new environment variable is not added into website. But, I think its OK to do these in a seperate pr.

barry-jin · 2021-10-20T21:03:13Z

src/engine/threaded_engine.cc

+static inline bool IsEngineAsync() {
+  std::string type = dmlc::GetEnv("MXNET_ENGINE_TYPE", std::string(""));
+  std::string async_engine_tag("Async");
+  auto tag_pos = type.find(async_engine_tag);
+  return tag_pos != std::string::npos;
+}
+
+void ThreadedEngine::OnStartCPU(Engine* engine, void* opr_block, const dmlc::Error* error) {
+  static bool use_new_dep_engine = IsEngineAsync();
+  if (!use_new_dep_engine) {
+    return;
+  }


I think this part is not tested in CI because I didn't see any changes to use Async tag inside test functions. https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh

Why not test it in the CI now? @Kh4L

ptrendx · 2021-10-22T17:14:20Z

ci/jenkins/Jenkinsfile_unix_gpu

@@ -49,7 +49,8 @@ core_logic: {
    custom_steps.test_unix_cpp_package_gpu('gpu'),
    // TODO(szha): fix and reenable the hanging issue. tracked in #18098
    // custom_steps.test_unix_distributed_kvstore_gpu('gpu'),
-    custom_steps.test_unix_byteps_gpu('gpu'),
+    // TODO(spanev): reenable when byteps is updated with the new dep engine API


Could you add an issue to track this?

Good idea, tracked here #20697

ptrendx

LGTM

Kh4L requested a review from eric-haibin-lin as a code owner June 4, 2021 08:05

mseth10 added the pr-work-in-progress PR is still work in progress label Jun 4, 2021

Kh4L force-pushed the new_dep_engine branch 3 times, most recently from c94b243 to 259bbb0 Compare June 7, 2021 03:30

Kh4L requested review from aaronmarkham and marcoabreu as code owners June 8, 2021 05:56

This was referenced Jun 9, 2021

Implement more asynchronous dependency handling on GPU horovod/horovod#2963

Merged

[WIP] Add support for MXNet async dependency engine horovod/horovod#2972

Closed

ptrendx self-requested a review June 25, 2021 05:21

ptrendx reviewed Jun 25, 2021

View reviewed changes

include/mxnet/base.h Outdated Show resolved Hide resolved

ptrendx reviewed Jun 25, 2021

View reviewed changes

Kh4L changed the title ~~[WIP] Add async GPU depency Engine~~ [WIP] Add async GPU dependency Engine Jun 29, 2021

Kh4L force-pushed the new_dep_engine branch from 31a7610 to 8c1fbd7 Compare June 30, 2021 01:11

szha reviewed Jul 5, 2021

View reviewed changes

barry-jin reviewed Sep 2, 2021

View reviewed changes

Kh4L force-pushed the new_dep_engine branch 3 times, most recently from d863554 to ee5a6bb Compare September 8, 2021 16:14

Kh4L changed the title ~~[WIP] Add async GPU dependency Engine~~ Add async GPU dependency Engine Sep 13, 2021

mseth10 removed the pr-work-in-progress PR is still work in progress label Sep 13, 2021

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 13, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Oct 13, 2021

clang-format

19cc3a5

Signed-off-by: Serge Panev <spanev@nvidia.com>

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 13, 2021

Fix rebase errors

b174357

Signed-off-by: Serge Panev <spanev@nvidia.com>

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 14, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 20, 2021

barry-jin approved these changes Oct 20, 2021

View reviewed changes

ptrendx reviewed Oct 22, 2021

View reviewed changes

Kh4L mentioned this pull request Oct 25, 2021

[CI] re-enable byteps test with the async dep engine API #20697

Open

ptrendx approved these changes Oct 25, 2021

View reviewed changes

romerojosh mentioned this pull request Oct 25, 2021

Add support for MXNet async dependency engine. horovod/horovod#3242

Merged

4 tasks

szha merged commit 922d9f5 into apache:master Oct 30, 2021

DickJC123 mentioned this pull request Aug 1, 2022

[BUGFIX] Fix threadsafety and shutdown issues with threaded_engine_perdevice #21110

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add async GPU dependency Engine #20331

Add async GPU dependency Engine #20331

Kh4L commented Jun 4, 2021 •

edited

mxnet-bot commented Jun 4, 2021

Kh4L commented Jun 14, 2021

mxnet-bot commented Jun 14, 2021

TristonC commented Jun 16, 2021

szha commented Jun 22, 2021

ptrendx Jun 25, 2021

ptrendx commented Jun 25, 2021

szha Jul 5, 2021

sandeep-krishnamurthy commented Jul 6, 2021

sandeep-krishnamurthy commented Jul 6, 2021

Kh4L commented Jul 21, 2021

barry-jin left a comment

barry-jin Sep 2, 2021

ptrendx Sep 2, 2021

barry-jin Sep 2, 2021

ptrendx Sep 2, 2021

Kh4L Sep 4, 2021

barry-jin Sep 2, 2021

ptrendx Sep 2, 2021

barry-jin Sep 2, 2021

ptrendx Sep 2, 2021

Kh4L commented Oct 13, 2021

mxnet-bot commented Oct 13, 2021

Kh4L commented Oct 18, 2021

mxnet-bot commented Oct 18, 2021

Kh4L commented Oct 20, 2021

mxnet-bot commented Oct 20, 2021

szha commented Oct 20, 2021

barry-jin left a comment

barry-jin Oct 20, 2021

szha Oct 27, 2021

ptrendx Oct 22, 2021

Kh4L Oct 25, 2021

ptrendx left a comment

		if ((*events_per_stream)[event_stream].pool_index < cuda_event.pool_index) {
		(*events_per_stream)[event_stream] = cuda_event;

		if (sync_obj.writer_event[0].event.expired()) {
		sync_obj.writer_event.clear();

Add async GPU dependency Engine #20331

Add async GPU dependency Engine #20331

Conversation

Kh4L commented Jun 4, 2021 • edited

Description

Illustration

Note

Detailed changes

Removing synchronization

SyncObject

New functions: OnCompleteGPU, OnStartGPU and OnStartCPU

CUDAEventPool

Pooled allocator

WaitAll, WaitToRead and WaitToWrite updated

mxnet-bot commented Jun 4, 2021

Kh4L commented Jun 14, 2021

mxnet-bot commented Jun 14, 2021

TristonC commented Jun 16, 2021

szha commented Jun 22, 2021

Choose a reason for hiding this comment

ptrendx commented Jun 25, 2021

Choose a reason for hiding this comment

sandeep-krishnamurthy commented Jul 6, 2021

sandeep-krishnamurthy commented Jul 6, 2021

Kh4L commented Jul 21, 2021

barry-jin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kh4L commented Oct 13, 2021

mxnet-bot commented Oct 13, 2021

Kh4L commented Oct 18, 2021

mxnet-bot commented Oct 18, 2021

Kh4L commented Oct 20, 2021

mxnet-bot commented Oct 20, 2021

szha commented Oct 20, 2021

barry-jin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptrendx left a comment

Choose a reason for hiding this comment

Kh4L commented Jun 4, 2021 •

edited