Skip to content

feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…#13

Open
Grant McCloskey (MushuEE) wants to merge 1 commit into
agent-substrate:mainfrom
MushuEE:feature/redis-scheduler-opt
Open

feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…#13
Grant McCloskey (MushuEE) wants to merge 1 commit into
agent-substrate:mainfrom
MushuEE:feature/redis-scheduler-opt

Conversation

@MushuEE
Copy link
Copy Markdown

Description

Addresses #12

This PR optimizes Substrate's critical scheduling path by replacing the legacy $O(N)$ worker range scan bottleneck with an $O(1)$ constant-time idle worker Set queue backed by Valkey/Redis.


Architectural Rationale & Changes

Previously, when an actor was resumed, the scheduler step (AssignWorkerStep.Execute in workflow_resume.go) would fetch all registered workers in the pool via store.ListWorkers(ctx). For a pool of 10,000 workers, this required serial keyspace scans over all master shards, loading and unmarshaling 10,000 JSON blocks on every single wakeup request.

This PR completely eliminates that bottleneck by shifting worker state queue management into the database layer:

1. Interface Enhancements (store.Interface)

  • Added ClaimIdleWorker(ctx, namespace, pool, actorID, actorNamespace, actorTemplate) to the database contract.

2. Set-Based Idle Pool Management (ateredis.go)

We leverage Valkey/Redis's high-performance in-memory Set indexing to track available capacity:

  • Atomic Selection ($O(1)$): The new ClaimIdleWorker executes a single, atomic SPOP operation on the pool:<namespace>:<pool>:idle_workers Set to claim a random free worker in constant time.
  • Zero Scheduling Collisions: Because SPOP is an atomic server-side operation, concurrent scheduler instances are guaranteed to pop unique worker IDs. This completely eliminates optimistic lock collisions and database retries on concurrent wakeups.
  • Automated Lifecycle Hooks:
    • Newly registered workers are added to the idle set during CreateWorker.
    • Deleted worker pods are cleanly purged from the set during DeleteWorker.
    • Suspended workers are returned to the set when their ActorId transitions to empty in UpdateWorker.

3. Handling Redis Cluster Slot Restrictions (CROSSSLOT fix)

During integration testing, multi-key transaction pipelines (trying to write a worker record and mutate the idle set in a single transaction block) failed with CROSSSLOT errors because the keys hashed to different clustered slots.

We solved this by splitting the operations sequentially outside the transactions:

  • The worker metadata record remains safely protected by optimistic locking version checks (WATCH transactions).
  • The indexing mutations (SAdd/SRem on the idle Set) are executed as separate, independent commands immediately upon transaction success. This ensures 100% compatibility with clustered production Valkey.

Verification & Tests Completed

  • Compilation: Fully verified with go vet (0 errors/warnings).
  • Unit & Integration Tests: Re-ran the entire store package test suite; 100% of tests passed flawlessly:
    go test -v ./cmd/servers/ateapi/store/...
    PASS
    ok      github.com/agent-substrate/substrate/cmd/servers/ateapi/store/ateredis  0.168s
    

@MushuEE Grant McCloskey (MushuEE) force-pushed the feature/redis-scheduler-opt branch from 3a51efa to a570c3c Compare May 20, 2026 22:09
@MushuEE
Copy link
Copy Markdown
Author

Fixed format issue using make fmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant