Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix][NCCL] Release NCCL thread_local resources in destructor #17078

Merged
merged 1 commit into from
Jun 12, 2024

Conversation

Lunderberg
Copy link
Contributor

Prior to this commit, allocations performed by ncclCommInitRank had no corresponding call to ncclCommDestroy. While ncclCommDestroy does occur in the CCLThreadLocalContext::Clear method, there are no calls into this method. On worker processes, the failure to call ncclCommDestroy typically had little effect. Any destruction would occur shortly before the process closes, and so resources would be reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than a separate process. While this allows it to easily receive data from the controller thread, resources allocated by worker0 are not reclaimed by the OS until the entire process terminates. As a result, the CCLThreadLocalContext leaked GPU memory, as the ncclCommInitRank call at the start of each
tvm.runtime.disco.ProcessSession was never de-allocated. The increase in GPU memory usage was about 1 gigabyte for each ProcessSession.

This commit updates CCLThreadLocalContext to have a destructor that calls the Clear method. For worker0, this is called when the thread is joined to the main thread.

Prior to this commit, allocations performed by `ncclCommInitRank` had
no corresponding call to `ncclCommDestroy`.  While `ncclCommDestroy`
does occur in the `CCLThreadLocalContext::Clear` method, there are no
calls into this method.  On worker processes, the failure to call
`ncclCommDestroy` typically had little effect.  Any destruction would
occur shortly before the process closes, and so resources would be
reclaimed by the OS when the process terminates.

However, worker0 of a Disco session is a separate thread, rather than
a separate process.  While this allows it to easily receive data from
the controller thread, resources allocated by worker0 are not
reclaimed by the OS until the entire process terminates.  As a result,
the `CCLThreadLocalContext` leaked GPU memory, as the
`ncclCommInitRank` call at the start of each
`tvm.runtime.disco.ProcessSession` was never de-allocated.  The
increase in GPU memory usage was about 1 gigabyte for each
`ProcessSession`.

This commit updates `CCLThreadLocalContext` to have a destructor that
calls the `Clear` method.  For worker0, this is called when the thread
is joined to the main thread.
@vinx13 vinx13 merged commit 0984e97 into apache:main Jun 12, 2024
19 checks passed
@Lunderberg Lunderberg deleted the nccl_cleanup_ccl_resources branch June 13, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants