feat: add configurable limit on concurrent bulk dispatch goroutines by ycombinator · Pull Request #6751 · elastic/fleet-server

ycombinator · 2026-04-03T22:09:11Z

What is the problem this PR solves?

During a 30k Serverless scale test, 22 of 39 fleet-server pods were OOMKilled. Analysis of the captured pod logs showed:

An upgrade + policy reassignment storm caused checkin durations to spike from normal (~ms) to 5-45 seconds average.
This created ~479 concurrent checkins per pod, each blocked in dispatch() waiting to enqueue onto the bulk engine's channel (capacity 32).
Elasticsearch was not the bottleneck — ES had ~88% free heap after GC with sub-30ms pause times.
Each blocked goroutine holds its stack (~4-8KB) plus the bulkT object. With no upper bound on concurrent dispatches, goroutines piled up until pods hit their memory limit (~154 Mi) and were killed.

How does this PR solve the problem?

This adds an optional cap on concurrent dispatch goroutines to bound memory usage.

The limit check runs at the top of dispatch(), before blocking on the channel:

fleet-server/internal/pkg/bulk/engine.go

Lines 608 to 622 in 1f456fb

    
           // Check pending dispatch limit before blocking on the channel. 
        
           if limit := b.opts.maxPendingDispatches; limit > 0 { 
        
           	pending := b.pendingDispatches.Add(1) 
        
           	defer b.pendingDispatches.Add(-1) 
        
           	if pending > int64(limit) { 
        
           		zerolog.Ctx(ctx).Warn(). 
        
           			Str("mod", kModBulk). 
        
           			Str("action", blk.action.String()). 
        
           			Int64("pending", pending). 
        
           			Int("limit", limit). 
        
           			Msg("Dispatch rejected: too many pending") 
        
           		b.freeBlk(blk) 
        
           		return respT{err: ErrTooManyDispatches} 
        
           	} 
        
           }

When the limit is reached, the dispatch is rejected immediately with ErrTooManyDispatches, which maps to HTTP 429 so agents retry on their next checkin interval:

fleet-server/internal/pkg/api/error.go

Lines 181 to 189 in 1f456fb

    
           { 
        
           	bulk.ErrTooManyDispatches, 
        
           	HTTPErrResp{ 
        
           		http.StatusTooManyRequests, 
        
           		"TooManyDispatches", 
        
           		"too many pending dispatches", 
        
           		zerolog.WarnLevel, 
        
           	}, 
        
           },

The limit is configurable via max_pending_dispatches in the bulk config:

fleet-server/internal/pkg/config/input.go

Line 52 in 1f456fb

MaxPendingDispatches int `config:"max_pending_dispatches"`

The default is 0 (no limit) so existing deployments are unaffected. Operators opt in by setting a value appropriate for their deployment size.

How to test this PR locally

go test -race ./internal/pkg/bulk/ -run TestDispatch -v

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works

Related issues

Relates fix: free bulkT on dispatch abort to prevent memory leak under load #6747

🤖 Generated with Claude Code

When agent count exceeds what the bulk engine can process, goroutines pile up in dispatch() waiting to send on the 32-slot channel. Each blocked goroutine holds its stack plus the bulkT object. With 30k+ agents under upgrade/policy storms, this grows unbounded until OOM. This adds an optional cap (max_pending_dispatches) on concurrent dispatch goroutines. When the limit is reached, new dispatches are rejected immediately with ErrTooManyDispatches, which maps to HTTP 429. Agents retry on their next checkin interval, spreading load over time. The default is 0 (no limit) so existing deployments are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mergify · 2026-04-04T22:18:52Z

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

michel-laterman

We really should be consistent with the type of maxPendingBulkDispatches; either have it as an int64 in the implementation + config or just an int

Also the new tests don't follow our existing test structures with the use of the require test package

michel-laterman · 2026-04-06T18:25:36Z

internal/pkg/bulk/dispatch_limit_test.go

+	if resp.err != ErrTooManyBulkDispatches {
+		t.Fatalf("expected ErrTooManyBulkDispatches, got: %v", resp.err)
+	}


As the linter says, errors.Is should be used.
Or the require package for the error check

Fixed in 9548f86.

michel-laterman · 2026-04-06T18:26:38Z

internal/pkg/bulk/dispatch_limit_test.go

+	if resp.err != nil {
+		t.Fatalf("expected no error, got: %v", resp.err)
+	}


We should add something to AGENTS.md that gets claude to use the require package

Done in a separate PR: #6756

michel-laterman · 2026-04-06T18:28:39Z

internal/pkg/bulk/opt.go

+	blockQueueSz             int
+	apikeyMaxParallel        int
+	apikeyMaxReqSize         int
+	maxPendingBulkDispatches int


Shouldn't this be int64?

Fixed in b0f71eb.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator mentioned this pull request Apr 3, 2026

fix: free bulkT on dispatch abort to prevent memory leak under load #6747

Open

5 tasks

ycombinator marked this pull request as ready for review April 3, 2026 22:11

ycombinator requested a review from a team as a code owner April 3, 2026 22:11

ycombinator requested review from michel-laterman and swiatekm April 3, 2026 22:11

refactor: rename dispatches to bulk dispatches for clarity

f22dfe6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator changed the title ~~feat: add configurable limit on concurrent dispatch goroutines~~ feat: add configurable limit on concurrent bulk dispatch goroutines Apr 3, 2026

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 4, 2026

style: run go fmt on changed files

b6b51a2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

michel-laterman requested changes Apr 6, 2026

View reviewed changes

ycombinator and others added 3 commits April 6, 2026 14:05

fix: use require package and errors.Is in dispatch limit tests

9548f86

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor: use int64 consistently for maxPendingBulkDispatches

b0f71eb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: run goimports on engine.go

77fd4bf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ycombinator requested a review from michel-laterman April 6, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable limit on concurrent bulk dispatch goroutines#6751

feat: add configurable limit on concurrent bulk dispatch goroutines#6751
ycombinator wants to merge 6 commits intoelastic:mainfrom
ycombinator:fix/pending-dispatch-limit

ycombinator commented Apr 3, 2026

Uh oh!

mergify bot commented Apr 4, 2026

Uh oh!

michel-laterman left a comment

Uh oh!

michel-laterman Apr 6, 2026

Uh oh!

ycombinator Apr 6, 2026

Uh oh!

michel-laterman Apr 6, 2026

Uh oh!

ycombinator Apr 6, 2026 •

edited

Loading

Uh oh!

michel-laterman Apr 6, 2026

Uh oh!

ycombinator Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Check pending dispatch limit before blocking on the channel.
	if limit := b.opts.maxPendingDispatches; limit > 0 {
	pending := b.pendingDispatches.Add(1)
	defer b.pendingDispatches.Add(-1)
	if pending > int64(limit) {
	zerolog.Ctx(ctx).Warn().
	Str("mod", kModBulk).
	Str("action", blk.action.String()).
	Int64("pending", pending).
	Int("limit", limit).
	Msg("Dispatch rejected: too many pending")
	b.freeBlk(blk)
	return respT{err: ErrTooManyDispatches}
	}
	}

	{
	bulk.ErrTooManyDispatches,
	HTTPErrResp{
	http.StatusTooManyRequests,
	"TooManyDispatches",
	"too many pending dispatches",
	zerolog.WarnLevel,
	},
	},

Conversation

ycombinator commented Apr 3, 2026

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

Uh oh!

mergify bot commented Apr 4, 2026

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

michel-laterman Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ycombinator Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

michel-laterman Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ycombinator Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michel-laterman Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ycombinator Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ycombinator Apr 6, 2026 •

edited

Loading