Multi node training not working on H100 gpus #446

Tristan-Kosciuch · 2024-04-11T02:55:38Z

Hello,

We're trying to run musicgen training/fine-tuning from the audiocraft repo using dora. We've been able to run single-node training with dora run -d solver, When running the above using torchrun on multiple nodes training also fails with a thread deadlock error, and the same is true for running with dora launch. We're running on GCP with NVIDIA H100 instances. I wonder if the H100s are not compatible with some of audiocraft's dependencies.

When attempting to use dora grid the process quickly exits with "FAI" status and a warning that we cannot change config values.

    raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Could not override 'lr'.
To append to your config use +lr=0.01

This is the grid we are using

from itertools import product
from dora import Explorer, Launcher

@Explorer
def explorer(launcher: Launcher):

    sub = launcher.bind(lr=0.01)  # bind some parameter value, in a new launcher
    sub.slurm_(gpus=16)  # all jobs scheduled with `sub` will use 8 gpus.

    sub()  # Job with lr=0.01 and 16 gpus.
    sub.bind_(epochs=40)  # in-place version of bind()
    sub.slurm(partition="h100")(batch_size=16)  # lr=0.01, 16 gpus, h100, bs=16 and epochs=40.

A few warnings with dora launch stand out. We have a script to start dora launch, that we run with sh launch.sh (we don't use sbatch/srun here, should we?). Here's the script.

#!/bin/sh

# these logging exports don't do much
export HYDRA_FULL_ERROR=1
export CUDBG_USE_LEGACY_DEBUGGER=1
export NVLOG_CONFIG_FILE=${HOME}/nvlog.stdout.config
export NVLOG_CONFIG_FILE=${HOME}/nvlog.config
export WANDB_MODE=offline

cd /home/$USER/audiocraft/

export AUDIOCRAFT_DORA_DIR=/projects/$USER/

export TEAM=$TEAM
export USER=$USER
export NCCL_DEBUG=DEBUG
export LOGLEVEL=INFO

dora launch -a --no_git_save -p h100 -g 16 solver=musicgen/musicgen_32khz \
model/lm/model_scale=small \
conditioner=text2music \
dset=audio/data_32khz \
dataset.num_workers=0 \
continue_from=//pretrained/facebook/musicgen-small \
dataset.valid.num_samples=16 \
dataset.batch_size=64 \
schedule.cosine.warmup=500 \
optim.optimizer=dadam \
optim.lr=1e-4 \
optim.epochs=30 \
slurm.setup=[". /home/$USER/anaconda3/etc/profile.d/conda.sh","conda activate torch_ac"]
optim.updates_per_epoch=1000

If I run this from our slurm login node, I get this message. This makes me question if training is being attempted on the login node:

/home/$USER/anaconda3/envs/torch_ac/lib/python3.10/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

If we run the script from one of the GPU nodes, there is no NVML issue but training times out with GPU usage stuck at 100% and eventually crashes with a message about backwards pass for gradients, I'll post that log here soon.

Any info is helpful, even an example of how to do dora launch on a slurm cluster (is it launched with sbatch?)

The text was updated successfully, but these errors were encountered:

nischalj10 · 2024-04-16T11:14:27Z

Following. Were you able to resolve this?

Tristan-Kosciuch · 2024-04-18T14:58:24Z

Unfortunately we were not able to, we're still working on it. Have you managed it?

nischalj10 · 2024-04-19T11:31:24Z

Nope

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node training not working on H100 gpus #446

Multi node training not working on H100 gpus #446

Tristan-Kosciuch commented Apr 11, 2024 •

edited

nischalj10 commented Apr 16, 2024

Tristan-Kosciuch commented Apr 18, 2024 •

edited

nischalj10 commented Apr 19, 2024

Multi node training not working on H100 gpus #446

Multi node training not working on H100 gpus #446

Comments

Tristan-Kosciuch commented Apr 11, 2024 • edited

nischalj10 commented Apr 16, 2024

Tristan-Kosciuch commented Apr 18, 2024 • edited

nischalj10 commented Apr 19, 2024

Tristan-Kosciuch commented Apr 11, 2024 •

edited

Tristan-Kosciuch commented Apr 18, 2024 •

edited