Skip to content

feat(llm-access-kiro): spread new sessions + band latency for fair account scheduling#25

Merged
acking-you merged 4 commits into
masterfrom
feat/kiro-session-spread-latency-band
Jun 2, 2026
Merged

feat(llm-access-kiro): spread new sessions + band latency for fair account scheduling#25
acking-you merged 4 commits into
masterfrom
feat/kiro-session-spread-latency-band

Conversation

@acking-you
Copy link
Copy Markdown
Owner

Problem

Two fairness defects in the Kiro account selector (selection_ordered_kiro_routes):

  1. No-sample accounts get starved. Selection compared exact smoothed latency scores as the second key. An account without latency samples falls back to the global average; any account measured even marginally faster than the average permanently out-sorts it. The last_started_at round-robin tier almost never fired, so the single fastest account won every request. "No sample" is not "high latency" — but the old ordering treated it that way.

  2. New sessions all stack on one account. A new session (has a session id but no prior affinity) followed pure latency and was pinned to the same fastest account. Per-session affinity itself was correct (same session → same account for 6h), but additional new sessions never spread — they all landed on that one account too.

Both stem from the same root: latency was too dominant, producing greedy winner-take-all.

Changes

1. Latency banding (kiro_latency.rs)

  • Replace route_score_ms with route_score_band: quantize the smoothed score into bands of 15% of the global average (BAND_FRACTION). Accounts in the same band tie and fall through to the round-robin / balance tiers. A small latency edge no longer pins traffic.
  • Scoring logic extracted into PreparedKiroLatencySnapshot::route_score (shared); the three guards (latency_routing_enabled, snapshot staleness, finite global avg) are preserved so legacy-order fallbacks are unchanged.

2. New-session spread (kiro_session_affinity.rs, route_selection.rs, kiro_dispatch.rs)

  • KiroSessionAffinity::account_session_counts() tallies non-expired sessions per account by iterating the LRU (iter does not perturb recency), skipping TTL-expired entries.
  • selection_ordered_kiro_routes / select_kiro_route_with_account_permit take an optional session_counts. When present (new session only), fewest bound sessions is the primary balance key, with latency band as tiebreaker.
  • Wired into both the main and websearch dispatch paths. Computed once per request (stable across failover retries, since affinity only mutates on success).

Resulting sort order

  • New session (session id, no affinity): proxy health → fewest bound sessions → latency band → last_started → credits → name
  • Stateless / sticky-hit: proxy health → latency band → last_started → credits → name (unchanged shape, exact-score → band is the only diff)

Existing affinity stays sticky; the preferred-account fast path is untouched.

Notes

  • Self-balancing: as sessions bind, counts diverge, so subsequent new sessions flow to the laggards. Cold start (all counts 0) ties through to band + last_started.
  • Snapshot semantics: session_counts is a point-in-time snapshot; it does not react to sessions bound by concurrent in-flight requests between retries — acceptable for fairness.
  • State remains per-process (no cross-instance sharing), same as before.

Tests

  • account_session_counts_tallies_live_entries_per_account + ..._skips_expired_entries
  • kiro_selection_spreads_new_session_to_least_bound_account (slower account with fewer sessions wins for a new session)
  • kiro_selection_treats_same_band_accounts_as_equal_for_round_robin (no-sample account not starved by marginal edge)
  • All 5 existing ordering tests still pass (band preserves the three guards).

cargo test -p llm-access → 295 passed. cargo clippy -p llm-access -- -D warnings → clean. rustfmt on changed files only.

🤖 Generated with Claude Code

…count scheduling

Two scheduling fairness fixes in the Kiro account selector:

1. Latency banding. selection_ordered_kiro_routes compared exact smoothed
   latency scores, so any account measured marginally faster than the global
   average permanently out-sorted accounts that merely lacked samples (which
   fall back to the global average). The last_started_at round-robin tier
   almost never fired and the fastest account won everything. route_score_ms
   is replaced by route_score_band, which quantizes the score into bands of
   15% of the global average; accounts in the same band tie and fall through
   to round-robin. Missing samples are no longer a penalty.

2. New-session spread. A new session (has a session id but no existing
   affinity) followed pure latency and stacked onto the same fastest account.
   KiroSessionAffinity now exposes account_session_counts(); when dispatching a
   new session, the gateway selects the account with the fewest bound sessions
   first (least-connections), with latency band as the tiebreaker. Existing
   affinity stays sticky; stateless requests keep the band-only ordering.
   Wired into both the main and websearch dispatch paths.

Self-balancing: counts diverge as sessions bind, so subsequent new sessions
flow to the laggards. account_session_counts is a once-per-request snapshot
(stable across failover retries) computed by iterating the affinity LRU
(iter does not perturb recency), skipping TTL-expired entries.

Tests: per-account session tally incl. expired-entry skip; new-session spread
to least-bound account; same-band accounts round-robin instead of starving the
no-sample one. All 295 llm-access tests pass; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces latency-score bands and session-balancing to prevent starving accounts with fewer samples and to distribute new sessions more evenly. Specifically, it quantizes latency scores into bands using a new BAND_FRACTION constant and adds session-affinity tracking to count active sessions per account, prioritizing accounts with fewer bound sessions during route selection. Feedback on these changes highlights a performance and lock contention hazard in account_session_counts_at, where string allocations are performed on every iteration while holding a mutex lock; using a borrowed lookup can optimize this path.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +111 to +117
let mut counts: HashMap<String, usize> = HashMap::new();
for (_, entry) in entries.iter() {
if now.saturating_duration_since(entry.updated_at) > self.ttl {
continue;
}
*counts.entry(entry.account_name.to_string()).or_insert(0) += 1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Performance & Lock Contention Hazard

Calling entry.account_name.to_string() on every single non-expired entry in the LRU cache (which can hold up to 65,536 entries) allocates a new String on every iteration, even if the account is already present in the counts map.

This is highly inefficient and holds the global entries mutex lock for much longer than necessary, causing severe lock contention on the hot path of every new session request.

By using counts.get_mut(entry.account_name.as_ref()), we can query the map using a borrowed &str without allocating. We only allocate a String when inserting a new account name for the first time, which reduces the number of allocations from O(N) to O(A) (where A is the number of unique accounts, typically very small).

        let mut counts: HashMap<String, usize> = HashMap::new();
        for (_, entry) in entries.iter() {
            if now.saturating_duration_since(entry.updated_at) > self.ttl {
                continue;
            }
            if let Some(count) = counts.get_mut(entry.account_name.as_ref()) {
                *count += 1;
            } else {
                counts.insert(entry.account_name.to_string(), 1);
            }
        }
References
  1. Avoid allocating temporary collections (such as Vec<String>) and cloning elements solely for hashing. Instead, hash borrowed segments directly by chaining iterators or updating the hasher incrementally, especially on performance-critical hot paths.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — adopted in ac98e5e. account_session_counts_at now borrows the name for the lookup (get_mut(entry.account_name.as_ref())) and only allocates a String when first inserting an account, so allocations drop from O(entries) to O(distinct accounts) and the affinity mutex is held for much less time on the new-session hot path.

I also tightened the same hot path further: the per-account scan is now skipped entirely when a key has a single candidate account (route_strategy=fixed) via a routes.len() > 1 guard at both dispatch sites, since there is nothing to spread there. Behavior is unchanged.

acking-you and others added 2 commits June 3, 2026 00:05
…le-account keys

Address PR review on account_session_counts:

- account_session_counts_at allocated a String for every non-expired LRU entry
  (up to 65,536) while holding the affinity mutex, even for already-counted
  accounts. Borrow the name for the lookup and only allocate when first
  inserting an account, dropping allocations from O(entries) to O(distinct
  accounts) and shortening the lock hold on the new-session hot path.
- Skip the per-account scan entirely when a key has a single candidate account
  (route_strategy=fixed): there is nothing to spread, so the O(N) lock-held
  walk is pure overhead. Guarded with routes.len() > 1 at both dispatch paths.

No behavior change: single-route ordering is invariant, and the count map is
identical regardless of insertion path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fold

Replace the imperative for-loop + pre-declared map with a filter/fold chain:
TTL-expired entries are dropped by `filter`, and the `HashMap` is `fold`'s
accumulator threaded to the return value (no separate `let mut` / trailing
return). Allocation profile is unchanged — still a borrowed `get_mut` lookup
that only allocates a key for first-seen accounts (O(distinct accounts)) —
because std `HashMap::entry` requires an owned key and the borrowed-key entry
API (`raw_entry_mut`/`entry_ref`) is not on stable. Threading the map by value
through `fold` is a register-sized header move, so codegen matches the loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5138d99f9b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +335 to +336
session_count: session_counts
.map(|counts| counts.get(&route.account_name).copied().unwrap_or(0)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate session counts by routing identity

When a route list contains multiple account names for the same upstream Kiro user, this new-session key counts only the individual account_name, while the scheduler, cooldowns, and quota exhaustion operate on route.routing_identity (the next line already reads last-started by identity, and quota failover groups aliases by identity). In a set like alpha/beta sharing one routing identity plus gamma on another, existing sessions on alpha do not raise beta's count, so new sessions can keep being steered to the same upstream account through its alias instead of the actually less-bound identity. Aggregate the affinity counts over the candidate routes' routing identities before applying the fewest-sessions sort key.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, real bug — fixed in da51954. selection_ordered_kiro_routes now folds the per-account affinity counts onto the candidate routes' routing_identity before applying the fewest-sessions key, so aliases of one upstream account share a single load figure — matching how the scheduler, cooldowns, last-started, and quota failover already group. O(routes), keys borrow from routes (zero-alloc). Added kiro_selection_spreads_new_session_by_routing_identity_not_account_name: with alpha/beta on one identity (alpha=3 sessions) and gamma on another, a new session now goes to gamma instead of beta (which would win on per-account counts as the high-credit alias of the loaded identity).

Comment thread crates/llm-access/src/kiro_latency.rs Outdated
Comment on lines +77 to +78
let band_width = (snapshot.global_avg_first_token_ms * BAND_FRACTION).max(1.0);
Some((score / band_width).floor() as i64)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Anchor latency bands around the neutral score

Because the band index is floor(score / width) with buckets anchored at zero, the neutral no-sample score (global_avg) often sits close to a bucket boundary, so accounts that are still within the configured 15% tolerance can land in a lower band and continue to win before round-robin. For example, with a 500 ms global average the width is 75 ms: a no-sample route is band 6, while a sampled 449 ms route (only ~10% faster) is band 5 and will still permanently sort ahead until its measured latency crosses an arbitrary boundary. If the intent is to treat near-global/no-sample accounts as equal, compute the band relative to the global baseline or otherwise center the neutral score inside the same tolerance bucket.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in da51954. Bands are now centered on the neutral baseline: round((score - global_avg) / (2 * tolerance)) with tolerance = global_avg * BAND_FRACTION, so band 0 is global_avg ± tolerance and the no-sample score sits dead-center instead of on a bucket boundary. Your 500ms/449ms example now shares band 0. Reworked the same-band test to a 430ms route score (~14% under the 500ms average): the old floor-from-zero scheme bucketed it at band 5 vs neutral's band 6 (permanent win), the centered scheme keeps both in band 0 so round-robin takes over. BAND_FRACTION is now documented as the tolerance radius.

Comment on lines +110 to +115
self.entries
.lock()
.expect("kiro session affinity mutex")
.iter()
.filter(|(_, entry)| now.saturating_duration_since(entry.updated_at) <= self.ttl)
.fold(HashMap::new(), |mut counts, (_, entry)| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid scanning affinity cache on every new session

For every brand-new session with multiple routes, this walks the entire session-affinity LRU while holding the synchronous mutex; with the configured limit raised toward LLM_ACCESS_KIRO_SESSION_AFFINITY_MAX_ENTRIES (up to 65,536 entries), a burst of new Kiro conversations can spend request-handler time serially scanning old sessions before route selection. This is avoidable by maintaining per-account counters or a smaller per-key/route snapshot instead of rebuilding counts from all cached entries on each request.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered, but keeping the point-in-time scan for now — I think the cost is bounded and cold-path:

  • It runs only on the first request of a brand-new session: subsequent turns in that conversation hit affinity and take the sticky fast path, which never scans. And it's already skipped entirely for single-account keys (routes.len() > 1 guard).
  • At the default cap (4096) that's a tens-of-µs walk under the lock, once per new conversation, with allocation already cut to O(distinct accounts).

A maintained per-account/identity counter would have to stay consistent across LRU's implicit capacity eviction and lazy TTL expiry (entries are only reclaimed on lookup), which is materially more error-prone than rebuilding a fold each time. The scan only becomes a hot-path concern if someone raises LLM_ACCESS_KIRO_SESSION_AFFINITY_MAX_ENTRIES toward 65,536 — at which point I'd revisit with a maintained counter backed by profiling. Flagging this as a conscious trade-off rather than an oversight.

…atency bands

Address two P2 findings from Codex review on the account scheduler:

1. Aggregate new-session spread by routing identity (not account name). The
   scheduler, cooldowns, last-started, and quota failover all group by
   `routing_identity`, but the fewest-sessions key counted per `account_name`.
   When several account names alias one upstream Kiro user, sessions bound to
   one alias did not deter new sessions from another alias of the same upstream
   account. selection_ordered_kiro_routes now folds the affinity counts onto
   the candidate routes' routing identities before applying the key (O(routes),
   keys borrow from routes).

2. Center latency bands on the neutral baseline. Bands were floor(score/width)
   anchored at zero, so the no-sample score (global_avg) sat near a bucket
   boundary: a sampled route only ~10% faster could land a band lower and
   permanently out-sort the neutral score before round-robin. Bands are now
   round((score - global_avg) / (2*tolerance)), so band 0 is
   global_avg ± (global_avg * BAND_FRACTION) and near-average / no-sample
   accounts share it. BAND_FRACTION is now the tolerance radius.

Tests: new kiro_selection_spreads_new_session_by_routing_identity_not_account_name
(gamma's less-bound identity wins over a high-credit alias of the loaded
identity); same-band test reworked to a 430ms route score (~14% under average)
that the old floor scheme bucketed away from neutral but the centered scheme
keeps in band 0. 296 llm-access tests pass; clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@acking-you acking-you merged commit e6eecd2 into master Jun 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant