Add support for quicker Fleet check-in upon component status #9982

moukoublen · 2025-09-16T05:56:51Z

What does this PR do?

At the moment, the fleet server "holds" a long poll of 5 minutes on every elastic agent check-in. As a result, when an input has an issue and is in an error or degraded state, information is not relayed to Fleet until at most 5 minutes.

In this PR, a "state watch" was introduced into the fleet gateway component to cancel ongoing check-in requests if a state change occurs while the elasti-agent waits on check-in.

The flow is this:

During checking start, (using a mutex) the current state is fetched and a ctx with cancel is created. The cancel function is stored for later usage.
On the happy path (check-in completes successfully), the cancel function is being cleaned up.
If a state update is received in the state channel while waiting for a check-in response (long poll), the context is canceled due to a specific cause. This leads to the check-in re-start with the new updated state

Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.

The "state watch and cancel ongoing checkin" functionality works only if the fleet agent config checkin.mode is set to on_state_changed (will be set by agentless controller on agentless). Otherwise, the flow works without any change.

New configuration checkin added under the FleetAgentConfig

fleet:
  # ...
  mode: "standard" # choose between `standard` or `on_state_change`
  request_backoff_init: 5s # default 60 seconds
  request_backoff_max: 10m # default 10 mins

On agentless we can set (through the agentless controller) the above config to enable fast checkin.

Why is it important?

On agentless environments where the agent logs are not available to the customer, the only way to communicate operational issues (e.g., misconfigured credentials) is through the degraded state status message of the input/component. But creating an agentless agent with wrong credentials will take up to 5 minutes to display the degraded status, and meanwhile, the input will appear as healthy. This provides a bad experience to the customer.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Closes Add support for quicker Fleet check-in upon component status #9348

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2025-09-17T12:24:51Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz

Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.

Are we sure about this? The limiting factor for the number of agents Fleet can manage is the frequency of authentication and API key operations in general usually, this has the potential to drastically increase those. The backoff mechanism is still faster than 5 minutes.

The safest way to introduce this is to put it behind a configuration flag, that is initially only turned on in agentless.

Then we can look at enabling it in the Horde based scale tests, which will need this change ported into the Horde drones and require us to add the ability for the Horde drone health status to change.

This is a very high consequence change as bugs or interactions here can potentially DDoS all existing Fleet Servers accidentally.

We are absolutely going to need more than a unit test to validate this before enabling it by default. As mentioned already, I'm OK introducing this behind a feature toggle as long as the change has no impact when it is disabled.

Probably the fleet configuration section is the place to put a configuration flag for this.

internal/pkg/agent/application/managed_mode.go

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

blakerouse

Overall this looks good. I am going to wait for a green CI before approval, but this looks good.

internal/pkg/agent/application/managed_mode.go

internal/pkg/agent/configuration/fleet.go

changelog/fragments/1758032089-quicker-fleet-check-in.yaml

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

internal/pkg/agent/configuration/fleet.go

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

internal/pkg/agent/configuration/fleet.go

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go

@blakerouse

LGTM, tomorrow is a holiday for me so delegated approval to @blakerouse and removing my request for changes.

elastic-sonarqube · 2025-09-29T19:32:23Z

Quality Gate passed

Issues
5 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
63.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-09-29T19:59:59Z

💚 Build Succeeded

Buildkite Build
Commit: 0ff1cce

History

💚 Build #27719 succeeded dc21d56
💔 Build #27690 failed 3332130
💔 Build #27671 failed b4257d9
💔 Build #27662 failed 51fe3cb
💔 Build #27658 failed a1ff973
💚 Build #27565 succeeded fcc1752

cc @moukoublen

moukoublen · 2025-09-29T20:56:09Z

@blakerouse CI seems ok now

blakerouse

Looks good. Thanks for working through this with me and getting it to be mergable.

moukoublen requested a review from a team as a code owner September 16, 2025 05:56

moukoublen requested review from ycombinator and straistaru September 16, 2025 05:56

moukoublen added the enhancement New feature or request label Sep 16, 2025

mergify bot assigned moukoublen Sep 16, 2025

moukoublen force-pushed the poc_fast_fail_checkin branch 2 times, most recently from 5c207f2 to 5af9eda Compare September 16, 2025 06:19

moukoublen added the backport-skip label Sep 16, 2025

elastic deleted a comment from mergify bot Sep 16, 2025

moukoublen force-pushed the poc_fast_fail_checkin branch from 5af9eda to ff78030 Compare September 16, 2025 12:55

moukoublen mentioned this pull request Sep 16, 2025

Add support for quicker Fleet check-in upon component status #9348

Closed

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 17, 2025

cmacknz requested a review from blakerouse September 17, 2025 15:50

cmacknz previously requested changes Sep 17, 2025

View reviewed changes

moukoublen added 3 commits September 23, 2025 15:36

poc checkin fast fail on state change

cced74c

add fragment

e139a8c

make fast checkin configurable

a842971

moukoublen force-pushed the poc_fast_fail_checkin branch from 6e92158 to a842971 Compare September 23, 2025 12:37

moukoublen added 2 commits September 23, 2025 16:20

add unittests

38223bf

unit test

2617b32

olegsu reviewed Sep 23, 2025

View reviewed changes

internal/pkg/agent/application/managed_mode.go Outdated Show resolved Hide resolved

moukoublen added 2 commits September 23, 2025 17:42

rename variable

741b041

warn log; rename env var

ac956fd

moukoublen requested a review from cmacknz September 23, 2025 14:49

moukoublen commented Sep 23, 2025

View reviewed changes

internal/pkg/agent/application/gateway/fleet/fleet_gateway.go Outdated Show resolved Hide resolved

moukoublen added 4 commits September 23, 2025 17:51

fix message

1f8c1b0

lint fix

71a2b68

fix test

e2fb045

fix test

6c6744f

moukoublen dismissed olegsu’s stale review via 62b3f0e September 27, 2025 10:16

logs on state change cancelation

a1ff973

moukoublen requested a review from cmacknz September 27, 2025 10:47

moukoublen and others added 5 commits September 27, 2025 19:00

Merge branch 'main' into poc_fast_fail_checkin

03c995b

fix and more in-depth test

51fe3cb

Merge branch 'main' into poc_fast_fail_checkin

b4257d9

Merge branch 'main' into poc_fast_fail_checkin

3332130

make backoff settings configurable from fleet checkin config

b07dc4a

blakerouse reviewed Sep 29, 2025

View reviewed changes

remove fragment

dc21d56

moukoublen added the skip-changelog label Sep 29, 2025

moukoublen requested a review from blakerouse September 29, 2025 14:45

olegsu previously approved these changes Sep 29, 2025

View reviewed changes