Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node training not working on H100 gpus #446

Open
Tristan-Kosciuch opened this issue Apr 11, 2024 · 3 comments
Open

Multi node training not working on H100 gpus #446

Tristan-Kosciuch opened this issue Apr 11, 2024 · 3 comments

Comments

@Tristan-Kosciuch
Copy link

Tristan-Kosciuch commented Apr 11, 2024

Hello,

We're trying to run musicgen training/fine-tuning from the audiocraft repo using dora. We've been able to run single-node training with dora run -d solver, When running the above using torchrun on multiple nodes training also fails with a thread deadlock error, and the same is true for running with dora launch. We're running on GCP with NVIDIA H100 instances. I wonder if the H100s are not compatible with some of audiocraft's dependencies.

When attempting to use dora grid the process quickly exits with "FAI" status and a warning that we cannot change config values.

    raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Could not override 'lr'.
To append to your config use +lr=0.01

This is the grid we are using

from itertools import product
from dora import Explorer, Launcher

@Explorer
def explorer(launcher: Launcher):

    sub = launcher.bind(lr=0.01)  # bind some parameter value, in a new launcher
    sub.slurm_(gpus=16)  # all jobs scheduled with `sub` will use 8 gpus.

    sub()  # Job with lr=0.01 and 16 gpus.
    sub.bind_(epochs=40)  # in-place version of bind()
    sub.slurm(partition="h100")(batch_size=16)  # lr=0.01, 16 gpus, h100, bs=16 and epochs=40.

A few warnings with dora launch stand out. We have a script to start dora launch, that we run with sh launch.sh (we don't use sbatch/srun here, should we?). Here's the script.

#!/bin/sh

# these logging exports don't do much
export HYDRA_FULL_ERROR=1
export CUDBG_USE_LEGACY_DEBUGGER=1
export NVLOG_CONFIG_FILE=${HOME}/nvlog.stdout.config
export NVLOG_CONFIG_FILE=${HOME}/nvlog.config
export WANDB_MODE=offline

cd /home/$USER/audiocraft/

export AUDIOCRAFT_DORA_DIR=/projects/$USER/

export TEAM=$TEAM
export USER=$USER
export NCCL_DEBUG=DEBUG
export LOGLEVEL=INFO

dora launch -a --no_git_save -p h100 -g 16 solver=musicgen/musicgen_32khz \
model/lm/model_scale=small \
conditioner=text2music \
dset=audio/data_32khz \
dataset.num_workers=0 \
continue_from=//pretrained/facebook/musicgen-small \
dataset.valid.num_samples=16 \
dataset.batch_size=64 \
schedule.cosine.warmup=500 \
optim.optimizer=dadam \
optim.lr=1e-4 \
optim.epochs=30 \
slurm.setup=[". /home/$USER/anaconda3/etc/profile.d/conda.sh","conda activate torch_ac"]
optim.updates_per_epoch=1000

If I run this from our slurm login node, I get this message. This makes me question if training is being attempted on the login node:

/home/$USER/anaconda3/envs/torch_ac/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

If we run the script from one of the GPU nodes, there is no NVML issue but training times out with GPU usage stuck at 100% and eventually crashes with a message about backwards pass for gradients, I'll post that log here soon.

Any info is helpful, even an example of how to do dora launch on a slurm cluster (is it launched with sbatch?)

@nischalj10
Copy link

Following. Were you able to resolve this?

@Tristan-Kosciuch
Copy link
Author

Tristan-Kosciuch commented Apr 18, 2024

Unfortunately we were not able to, we're still working on it. Have you managed it?

@nischalj10
Copy link

Nope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants