Skip to content

Fix Issue 3683 multiple jobs assigned to same instance#3690

Draft
Bihan wants to merge 3 commits intodstackai:masterfrom
Bihan:issue_3683_multiple_jobs_assigned_to_same_instance
Draft

Fix Issue 3683 multiple jobs assigned to same instance#3690
Bihan wants to merge 3 commits intodstackai:masterfrom
Bihan:issue_3683_multiple_jobs_assigned_to_same_instance

Conversation

@Bihan
Copy link
Copy Markdown
Collaborator

@Bihan Bihan commented Mar 25, 2026

This PR fixes bug #3683.

Problem
When assigning a job to an existing instance, the server used a read → add in Python → write on busy_blocks (e.g. load the row, compute busy_blocks += offer.blocks). Under concurrency, two jobs read the same value, both add their share, and both write—so capacity was oversubscribed and multiple jobs ended up on one instance.

Fix
Changed how capacity is reserved: instead of a non-atomic Python update, the server issues a single SQL UPDATE that:
Increments busy_blocks by the job’s blocks, only if there is remaining capacity (total_blocks is NULL or busy_blocks + blocks <= total_blocks).

@Bihan Bihan changed the title Issue 3683 multiple jobs assigned to same instance Fix Issue 3683 multiple jobs assigned to same instance Mar 25, 2026
@r4victor
Copy link
Copy Markdown
Collaborator

@Bihan Have you tried reproducing this problem with pipelines? I suspect it should already be fixed in pipelines and we don't need fixing the scheduled task since it'll be dropped right after the coming release.

@r4victor
Copy link
Copy Markdown
Collaborator

And there is nothing wrong with doing busy_blocks += offer.blocks in Python if the resource is locked and the instances should be locked in this case. I think the original bug is related to lock being held correctly but busy_blocks is stale (read from unlock value).

Let me know if the problem is reproducible with pipelines and we'll be looking for the root cause in the case. If it's not, then there is no need for the fix.

@Bihan
Copy link
Copy Markdown
Collaborator Author

Bihan commented Mar 25, 2026

@Bihan Have you tried reproducing this problem with pipelines? I suspect it should already be fixed in pipelines and we don't need fixing the scheduled task since it'll be dropped right after the coming release.

@r4victor Do you mean with this merged #3686 there will be no issues?

@r4victor
Copy link
Copy Markdown
Collaborator

r4victor commented Mar 25, 2026

@Bihan, #3686 shouldn't be related. I suspect JobSubmitted pipeline fixed it (that you try to change) (https://github.com/Bihan/dstack/blob/master/src/dstack/_internal/server/background/pipeline_tasks/jobs_submitted.py)

You can try set DSTACK_FF_PIPELINE_PROCESSING_ENABLED to enable pipelines and try to reproduce. scheduled tasks will be dropped after the coming release.

@Bihan
Copy link
Copy Markdown
Collaborator Author

Bihan commented Mar 25, 2026

@Bihan, #3686 shouldn't be related. I suspect JobSubmitted pipeline fixed it (that you try to change) (https://github.com/Bihan/dstack/blob/master/src/dstack/_internal/server/background/pipeline_tasks/jobs_submitted.py)

You can try set DSTACK_FF_PIPELINE_PROCESSING_ENABLED to enable pipelines and try to reproduce. scheduled tasks will be dropped after the coming release.

@r4victor With pipelines there is no above issue. Jobs are allocated to unique instances as expected. I think for now we can close this PR.

@peterschmidt85 peterschmidt85 marked this pull request as draft March 27, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants