-
Notifications
You must be signed in to change notification settings - Fork 205
Closed
Labels
Description
Steps to reproduce
- Create a fleet (this is optional, the same happens with autocreated fleets):
type: fleet
name: fleet-aws
nodes: 2
placement: cluster
backends: [aws]
resources:
cpu: 1..
gpu: 1
memory: 1GB..
disk: 10GB.. Fleet fleet-aws does not exist yet.
Create the fleet? [y/n]: y
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
fleet-aws 0 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 idle 10:53
1 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 idle 10:53
- Start the NCCL Tests example:
dstack apply -f examples/clusters/nccl-tests/.dstack.yml --backend aws --gpu 1Actual behaviour
Sometimes, after the run is finished, it starts again. The restart loop may break after just two iterations, but in some cases the run restarts up to 6-7 times.
Moreover, the second iteration usually spins up a new instance instead of reusing the existing one, violating the fleet spec:
FLEET INSTANCE BACKEND RESOURCES PRICE STATUS CREATED
fleet-aws 0 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 idle 14 mins ago
1 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 idle 14 mins ago
2 aws (us-east-1) cpu=4 mem=16GB disk=100GB T4:16GB:1 $0.526 provisioning now
Expected behaviour
No response
dstack version
Server logs
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:08:48.023589
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:08:48.026706
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:09:02.834106
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:09:02.836883
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.background.tasks.process_running_jobs:352 job(062ab6)nccl-tests-1-0: process running job,
age=0:09:17.511238
dstack._internal.server.background.tasks.process_running_jobs:352 job(de292b)nccl-tests-0-0: process running job,
age=0:09:17.514321
dstack._internal.server.background.tasks.process_running_jobs:762 job(de292b)nccl-tests-0-0: now is TERMINATING
dstack._internal.server.background.tasks.process_runs:158 run(e0e300)nccl-tests: processing run
dstack._internal.server.services.runs:1185 run(e0e300)nccl-tests: scaling UP 1 replica(s)
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(da9543)nccl-tests-0-0: provisioning has started
dstack._internal.server.services.backends:347 Requesting instance offers from backends: ['aws']
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:743 job(da9543)nccl-tests-0-0: trying g4dn.xlarge in
aws/us-east-1 for $0.5260 per hour
dstack._internal.server.background.tasks.process_terminating_jobs:91 job(062ab6)nccl-tests-1-0: terminating job
dstack._internal.server.background.tasks.process_terminating_jobs:91 job(de292b)nccl-tests-0-0: terminating job
dstack._internal.server.services.jobs:257 job(062ab6)nccl-tests-1-0: stopping container
dstack._internal.server.services.jobs:257 job(de292b)nccl-tests-0-0: stopping container
dstack._internal.core.backends.aws.compute:290 Trying provisioning g4dn.xlarge in us-east-1d
dstack._internal.server.services.jobs:324 job(de292b)nccl-tests-0-0: instance 'fleet-aws-1' has been released, new status
is IDLE
dstack._internal.server.services.services:270 job(de292b)nccl-tests-0-0: service replica unregistered from receiving
requests, gateway=False
dstack._internal.server.services.jobs:378 job(de292b)nccl-tests-0-0: job status is DONE, reason: DONE_BY_RUNNER
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.services.jobs:324 job(062ab6)nccl-tests-1-0: instance 'fleet-aws-0' has been released, new status
is IDLE
dstack._internal.server.services.services:270 job(062ab6)nccl-tests-1-0: service replica unregistered from receiving
requests, gateway=False
dstack._internal.server.services.jobs:378 job(062ab6)nccl-tests-1-0: job status is TERMINATED, reason:
TERMINATED_BY_SERVER
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:186 job(c6018e)nccl-tests-1-0: provisioning has started
dstack._internal.server.background.tasks.process_submitted_jobs:205 job(c6018e)nccl-tests-1-0: waiting for master job to
be provisioned
dstack._internal.server.background.tasks.process_submitted_jobs:366 job(da9543)nccl-tests-0-0: now is provisioning a new
instance
dstack._internal.server.background.tasks.process_submitted_jobs:397 The job nccl-tests-0-0 created the new instance
fleet-aws-2Additional information
No response