Fix Issue 3683 multiple jobs assigned to same instance#3690
Fix Issue 3683 multiple jobs assigned to same instance#3690Bihan wants to merge 3 commits intodstackai:masterfrom
Conversation
|
@Bihan Have you tried reproducing this problem with pipelines? I suspect it should already be fixed in pipelines and we don't need fixing the scheduled task since it'll be dropped right after the coming release. |
|
And there is nothing wrong with doing busy_blocks += offer.blocks in Python if the resource is locked and the instances should be locked in this case. I think the original bug is related to lock being held correctly but busy_blocks is stale (read from unlock value). Let me know if the problem is reproducible with pipelines and we'll be looking for the root cause in the case. If it's not, then there is no need for the fix. |
|
@Bihan, #3686 shouldn't be related. I suspect JobSubmitted pipeline fixed it (that you try to change) (https://github.com/Bihan/dstack/blob/master/src/dstack/_internal/server/background/pipeline_tasks/jobs_submitted.py) You can try set DSTACK_FF_PIPELINE_PROCESSING_ENABLED to enable pipelines and try to reproduce. scheduled tasks will be dropped after the coming release. |
@r4victor With pipelines there is no above issue. Jobs are allocated to unique instances as expected. I think for now we can close this PR. |
This PR fixes bug #3683.
Problem
When assigning a job to an existing instance, the server used a read → add in Python → write on
busy_blocks(e.g. load the row, computebusy_blocks += offer.blocks). Under concurrency, two jobs read the same value, both add their share, and both write—so capacity was oversubscribed and multiple jobs ended up on one instance.Fix
Changed how capacity is reserved: instead of a non-atomic Python update, the server issues a single SQL UPDATE that:
Increments
busy_blocksby the job’s blocks, only if there is remaining capacity (total_blocks is NULL or busy_blocks + blocks <= total_blocks).