Skip to content

[Bug]: Multinode tasks sometimes get stuck in a loop #3130

@un-def

Description

@un-def

Steps to reproduce

  1. Create a fleet (this is optional, the same happens with autocreated fleets):
type: fleet
name: fleet-aws
nodes: 2
placement: cluster
backends: [aws]
resources:
  cpu: 1..
  gpu: 1
  memory: 1GB..
  disk: 10GB.. 
Fleet fleet-aws does not exist yet.
Create the fleet? [y/n]: y
 FLEET      INSTANCE  BACKEND          RESOURCES                            PRICE   STATUS  CREATED
 fleet-aws  0         aws (us-east-1)  cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526  idle    10:53
            1         aws (us-east-1)  cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526  idle    10:53
  1. Start the NCCL Tests example:
dstack apply -f examples/clusters/nccl-tests/.dstack.yml --backend aws --gpu 1

Actual behaviour

Sometimes, after the run is finished, it starts again. The restart loop may break after just two iterations, but in some cases the run restarts up to 6-7 times.

Moreover, the second iteration usually spins up a new instance instead of reusing the existing one, violating the fleet spec:

 FLEET      INSTANCE  BACKEND          RESOURCES                            PRICE   STATUS        CREATED
 fleet-aws  0         aws (us-east-1)  cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526  idle          14 mins ago
            1         aws (us-east-1)  cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526  idle          14 mins ago
            2         aws (us-east-1)  cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526  provisioning  now

Expected behaviour

No response

dstack version

4abb315

Server logs

dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:08:48.023589
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:08:48.026706
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:09:02.834106
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:09:02.836883
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:09:17.511238
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:09:17.514321
dstack._internal.server.background.tasks.process_running_jobs:762 job(de292b)nccl-tests-0-0: now is TERMINATING
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.services.runs:1185 run(e0e300)nccl-tests: scaling UP 1 replica(s)
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(da9543)nccl-tests-0-0: provisioning has started
dstack._internal.server.services.backends:347 Requesting instance offers from backends: ['aws']
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:743 job(da9543)nccl-tests-0-0: trying g4dn.xlarge in
aws/us-east-1 for $0.5260 per hour
dstack._internal.server.background.tasks.process_terminating_jobs:91 job(062ab6)nccl-tests-1-0: terminating job
dstack._internal.server.background.tasks.process_terminating_jobs:91 job(de292b)nccl-tests-0-0: terminating job
dstack._internal.server.services.jobs:257 job(062ab6)nccl-tests-1-0: stopping container
dstack._internal.server.services.jobs:257 job(de292b)nccl-tests-0-0: stopping container
dstack._internal.core.backends.aws.compute:290 Trying provisioning g4dn.xlarge in us-east-1d
dstack._internal.server.services.jobs:324 job(de292b)nccl-tests-0-0: instance 'fleet-aws-1' has been released, new status
is IDLE
dstack._internal.server.services.services:270 job(de292b)nccl-tests-0-0: service replica unregistered from receiving
requests, gateway=False
dstack._internal.server.services.jobs:378 job(de292b)nccl-tests-0-0: job status is DONE, reason: DONE_BY_RUNNER
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.services.jobs:324 job(062ab6)nccl-tests-1-0: instance 'fleet-aws-0' has been released, new status
is IDLE
dstack._internal.server.services.services:270 job(062ab6)nccl-tests-1-0: service replica unregistered from receiving
requests, gateway=False
dstack._internal.server.services.jobs:378 job(062ab6)nccl-tests-1-0: job status is TERMINATED, reason:
TERMINATED_BY_SERVER
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:366 job(da9543)nccl-tests-0-0: now is provisioning a new
instance
dstack._internal.server.background.tasks.process_submitted_jobs:397 The job nccl-tests-0-0 created the new instance
fleet-aws-2

Additional information

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmajor

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions