You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're trying to run musicgen training/fine-tuning from the audiocraft repo using dora. We've been able to run single-node training with dora run -d solver, When running the above using torchrun on multiple nodes training also fails with a thread deadlock error, and the same is true for running with dora launch. We're running on GCP with NVIDIA H100 instances. I wonder if the H100s are not compatible with some of audiocraft's dependencies.
When attempting to use dora grid the process quickly exits with "FAI" status and a warning that we cannot change config values.
raise ConfigCompositionException(
hydra.errors.ConfigCompositionException: Could not override 'lr'.
To append to your config use +lr=0.01
This is the grid we are using
from itertools import product
from dora import Explorer, Launcher
@Explorer
def explorer(launcher: Launcher):
sub = launcher.bind(lr=0.01) # bind some parameter value, in a new launcher
sub.slurm_(gpus=16) # all jobs scheduled with `sub` will use 8 gpus.
sub() # Job with lr=0.01 and 16 gpus.
sub.bind_(epochs=40) # in-place version of bind()
sub.slurm(partition="h100")(batch_size=16) # lr=0.01, 16 gpus, h100, bs=16 and epochs=40.
A few warnings with dora launch stand out. We have a script to start dora launch, that we run with sh launch.sh (we don't use sbatch/srun here, should we?). Here's the script.
If we run the script from one of the GPU nodes, there is no NVML issue but training times out with GPU usage stuck at 100% and eventually crashes with a message about backwards pass for gradients, I'll post that log here soon.
Any info is helpful, even an example of how to do dora launch on a slurm cluster (is it launched with sbatch?)
The text was updated successfully, but these errors were encountered:
Hello,
We're trying to run musicgen training/fine-tuning from the audiocraft repo using dora. We've been able to run single-node training with
dora run -d solver
, When running the above using torchrun on multiple nodes training also fails with a thread deadlock error, and the same is true for running withdora launch
. We're running on GCP with NVIDIA H100 instances. I wonder if the H100s are not compatible with some of audiocraft's dependencies.When attempting to use
dora grid
the process quickly exits with "FAI" status and a warning that we cannot change config values.This is the grid we are using
A few warnings with dora launch stand out. We have a script to start dora launch, that we run with
sh launch.sh
(we don't use sbatch/srun here, should we?). Here's the script.If I run this from our slurm login node, I get this message. This makes me question if training is being attempted on the login node:
If we run the script from one of the GPU nodes, there is no NVML issue but training times out with GPU usage stuck at 100% and eventually crashes with a message about backwards pass for gradients, I'll post that log here soon.
Any info is helpful, even an example of how to do dora launch on a slurm cluster (is it launched with sbatch?)
The text was updated successfully, but these errors were encountered: