feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…#13
Open
Grant McCloskey (MushuEE) wants to merge 1 commit into
Open
Conversation
3a51efa to
a570c3c
Compare
Author
|
Fixed format issue using |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Addresses #12
This PR optimizes Substrate's critical scheduling path by replacing the legacy$O(N)$ worker range scan bottleneck with an $O(1)$ constant-time idle worker Set queue backed by Valkey/Redis.
Architectural Rationale & Changes
Previously, when an actor was resumed, the scheduler step (
AssignWorkerStep.Executeinworkflow_resume.go) would fetch all registered workers in the pool viastore.ListWorkers(ctx). For a pool of 10,000 workers, this required serial keyspace scans over all master shards, loading and unmarshaling 10,000 JSON blocks on every single wakeup request.This PR completely eliminates that bottleneck by shifting worker state queue management into the database layer:
1. Interface Enhancements (
store.Interface)ClaimIdleWorker(ctx, namespace, pool, actorID, actorNamespace, actorTemplate)to the database contract.2. Set-Based Idle Pool Management (
ateredis.go)We leverage Valkey/Redis's high-performance in-memory
Setindexing to track available capacity:ClaimIdleWorkerexecutes a single, atomicSPOPoperation on thepool:<namespace>:<pool>:idle_workersSet to claim a random free worker in constant time.SPOPis an atomic server-side operation, concurrent scheduler instances are guaranteed to pop unique worker IDs. This completely eliminates optimistic lock collisions and database retries on concurrent wakeups.CreateWorker.DeleteWorker.ActorIdtransitions to empty inUpdateWorker.3. Handling Redis Cluster Slot Restrictions (
CROSSSLOTfix)During integration testing, multi-key transaction pipelines (trying to write a worker record and mutate the idle set in a single transaction block) failed with
CROSSSLOTerrors because the keys hashed to different clustered slots.We solved this by splitting the operations sequentially outside the transactions:
WATCHtransactions).SAdd/SRemon the idle Set) are executed as separate, independent commands immediately upon transaction success. This ensures 100% compatibility with clustered production Valkey.Verification & Tests Completed
go vet(0 errors/warnings).