Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update images for release #2672

Merged
merged 1 commit into from
Jul 7, 2021
Merged

chore: update images for release #2672

merged 1 commit into from
Jul 7, 2021

Conversation

mackrorysd
Copy link
Member

@mackrorysd mackrorysd commented Jul 7, 2021

Description

The combination of PyTorch 1.9 and the HOROVOD_GPU_ALLREDUCE=NCCL option (which means to use NCCL specifically for the All-Reduce algorithm only) has been resulting in extra processes on the GPUs and consequent memory errors. Configuring NCCL to be used for all operations appears to eliminate this problem and we probably should have been doing that anyway. Also, working around a recent regression in Pillow by just pinning to an older release (python-pillow/Pillow#5571).

See also: determined-ai/environments#110

Test Plan

https://app.circleci.com/pipelines/github/determined-ai/determined?branch=all_operations_nccl

At the time of this writing, one job is still running but the rest have passed.

@mackrorysd mackrorysd requested a review from shiyuann as a code owner July 7, 2021 02:49
@cla-bot cla-bot bot added the cla-signed label Jul 7, 2021
@mackrorysd mackrorysd requested a review from ioga July 7, 2021 02:49
@mackrorysd mackrorysd enabled auto-merge (squash) July 7, 2021 03:33
@mackrorysd mackrorysd disabled auto-merge July 7, 2021 13:31
@mackrorysd mackrorysd merged commit 9aa6758 into master Jul 7, 2021
@mackrorysd
Copy link
Member Author

mackrorysd commented Jul 7, 2021

Thank you, auto-merge. Circle CI was done last night but apparently the Github side of it doesn't recognize that.

edit: I suspect it's because it was from an upstream branch and it was waiting for all the optional on-hold things. But I already ran those separately from the PR.

bensomers pushed a commit that referenced this pull request Jul 14, 2021
@dzhu dzhu deleted the all_operations_nccl branch April 19, 2022 19:02
@dannysauer dannysauer modified the milestones: 0.0.102, 0.16.2 Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants