Skip to content

Commit

Permalink
[Bugfix][NCCL] Release NCCL thread_local resources in destructor (#17078
Browse files Browse the repository at this point in the history
)

Prior to this commit, allocations performed by `ncclCommInitRank` had
no corresponding call to `ncclCommDestroy`.  While `ncclCommDestroy`
does occur in the `CCLThreadLocalContext::Clear` method, there are no
calls into this method.  On worker processes, the failure to call
`ncclCommDestroy` typically had little effect.  Any destruction would
occur shortly before the process closes, and so resources would be
reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than
a separate process.  While this allows it to easily receive data from
the controller thread, resources allocated by worker0 are not
reclaimed by the OS until the entire process terminates.  As a result,
the `CCLThreadLocalContext` leaked GPU memory, as the
`ncclCommInitRank` call at the start of each
`tvm.runtime.disco.ProcessSession` was never de-allocated.  The
increase in GPU memory usage was about 1 gigabyte for each
`ProcessSession`.

This commit updates `CCLThreadLocalContext` to have a destructor that
calls the `Clear` method.  For worker0, this is called when the thread
is joined to the main thread.
  • Loading branch information
Lunderberg committed Jun 12, 2024
1 parent eb4f41c commit 0984e97
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 4 deletions.
12 changes: 12 additions & 0 deletions src/runtime/disco/nccl/nccl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,21 @@ void InitCCLPerWorker(IntTuple device_ids, std::string unique_id_bytes) {
CCLThreadLocalContext* ctx = CCLThreadLocalContext::Get();
DiscoWorker* worker = DiscoWorker::ThreadLocal();
ICHECK(worker != nullptr);

CHECK_EQ(unique_id_bytes.size(), NCCL_UNIQUE_ID_BYTES)
<< "ValueError: The length of unique_id must be " << NCCL_UNIQUE_ID_BYTES << ", but got "
<< unique_id_bytes.size() << ".";

CHECK(!ctx->comm) << "Cannot initialize CCL, "
<< "the previous thread-global comm still exists, "
<< "and has not been destructed";
CHECK(!ctx->default_stream) << "Cannot initialize CCL, "
<< "the previous thread-global stream still exists, "
<< "and has not been destructed";
CHECK(!ctx->worker) << "Cannot initialize CCL, "
<< "the previous thread-global worker still exists, "
<< "and has not been destructed";

// Step up local context of NCCL
int device_id = device_ids[worker->worker_id];
SetDevice(device_id);
Expand Down
15 changes: 11 additions & 4 deletions src/runtime/disco/nccl/nccl_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -118,16 +118,23 @@ inline ncclDataType_t AsNCCLDataType(runtime::DataType dtype) {
}

struct CCLThreadLocalContext {
DiscoWorker* worker;
DiscoWorker* worker = nullptr;
int device_id;
deviceStream_t default_stream = nullptr;
ncclComm_t comm;
ncclComm_t comm = nullptr;

~CCLThreadLocalContext() { Clear(); }

void Clear() {
NCCL_CALL(ncclCommDestroy(comm));
if (default_stream != nullptr) {
if (comm) {
NCCL_CALL(ncclCommDestroy(comm));
comm = nullptr;
}
if (default_stream) {
StreamDestroy(default_stream);
default_stream = nullptr;
}
worker = nullptr;
}

deviceStream_t GetDefaultStream() {
Expand Down

0 comments on commit 0984e97

Please sign in to comment.