fix(scheduling): release lock during runner init to unblock operations#760
Conversation
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
load, you stilldefer l.unlock()after acquiring the lock but then calll.unlock()explicitly beforerun(...); with the currentguard-based lock this will double-unlock and can deadlock the goroutine—remove the defer and manage lock/unlock explicitly around the different phases instead. - When reacquiring the loader lock after
run/wait, you usel.lock(context.Background()), which ignores the original call’s cancellation and could hang indefinitely; consider reusing the originalctx(or propagating a derived one) and handling lock acquisition failure so load respects caller cancellation throughout.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `load`, you still `defer l.unlock()` after acquiring the lock but then call `l.unlock()` explicitly before `run(...)`; with the current `guard`-based lock this will double-unlock and can deadlock the goroutine—remove the defer and manage lock/unlock explicitly around the different phases instead.
- When reacquiring the loader lock after `run`/`wait`, you use `l.lock(context.Background())`, which ignores the original call’s cancellation and could hang indefinitely; consider reusing the original `ctx` (or propagating a derived one) and handling lock acquisition failure so load respects caller cancellation throughout.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request refactors the loader logic to release the lock during the potentially long-running runner initialization. This is a good improvement to prevent blocking other operations. The implementation introduces a 'loading' state for runners and exposes this via the ps command. My review found one critical issue with how the lock is re-acquired after being released, which could lead to the service hanging on shutdown or potential race conditions.
24bdd4a to
ec84ed8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement by releasing the loader lock during long-running runner initializations, which unblocks other operations. The changes to expose the "Loading..." status in docker model ps are also well-implemented.
However, I've identified a critical race condition in pkg/inference/scheduling/loader.go. Releasing the lock allows multiple goroutines to attempt to load the same model concurrently, as there's no check for in-progress loads. I've provided detailed comments and suggestions to address this issue by tracking in-progress loads and preventing duplicate initializations. Addressing this is crucial for the stability and efficiency of the loader.
…te in ps Signed-off-by: Dorin Geman <dorin.geman@docker.com>
ec84ed8 to
6e3a208
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the runner loading logic to release the loader lock during the potentially long-running runner initialization process. This is a significant improvement as it prevents blocking other operations like docker model ps. The changes introduce a new loading state to track in-flight initializations, ensuring that slots are reserved and that duplicate loading attempts for the same model are handled gracefully by waiting for the existing one to complete. The implementation, while complex due to manual lock management, appears to be robust and correct. The new Loading... status in the ps output is a great addition for user experience. Overall, this is a well-executed and important fix for the model runner's scheduling.
On top of #759.
Release the loader lock during long-running runner initialization (up to 5 minutes) so that other operations like docker model ps, model releases, and concurrent loads are not blocked, and expose in-flight loads in the ps output with a "Loading..." status.
With this change you can now run
docker model pswhile a model is loading and see that it's loading.Terminal 1:
Terminal 2:
Terminal 3: