Skip to content

Conversation

moukoublen
Copy link
Member

@moukoublen moukoublen commented Sep 16, 2025

What does this PR do?

At the moment, the fleet server "holds" a long poll of 5 minutes on every elastic agent check-in. As a result, when an input has an issue and is in an error or degraded state, information is not relayed to Fleet until at most 5 minutes.

In this PR, a "state watch" was introduced into the fleet gateway component to cancel ongoing check-in requests if a state change occurs while the elasti-agent waits on check-in.

The flow is this:

  • During checking start, (using a mutex) the current state is fetched and a ctx with cancel is created. The cancel function is stored for later usage.
  • On the happy path (check-in completes successfully), the cancel function is being cleaned up.
  • If a state update is received in the state channel while waiting for a check-in response (long poll), the context is canceled due to a specific cause. This leads to the check-in re-start with the new updated state

Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.

The "state watch and cancel ongoing checkin" functionality works only if the fleet agent config checkin.mode is set to on_state_changed (will be set by agentless controller on agentless). Otherwise, the flow works without any change.

New configuration checkin added under the FleetAgentConfig

fleet:
  # ...
  mode: "standard" # choose between `standard` or `on_state_change`
  request_backoff_init: 5s # default 60 seconds
  request_backoff_max: 10m # default 10 mins

On agentless we can set (through the agentless controller) the above config to enable fast checkin.

Why is it important?

On agentless environments where the agent logs are not available to the customer, the only way to communicate operational issues (e.g., misconfigured credentials) is through the degraded state status message of the input/component. But creating an agentless agent with wrong credentials will take up to 5 minutes to display the degraded status, and meanwhile, the input will appear as healthy. This provides a bad experience to the customer.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@moukoublen moukoublen requested a review from a team as a code owner September 16, 2025 05:56
@moukoublen moukoublen added the enhancement New feature or request label Sep 16, 2025
@moukoublen moukoublen force-pushed the poc_fast_fail_checkin branch 2 times, most recently from 5c207f2 to 5af9eda Compare September 16, 2025 06:19
@elastic elastic deleted a comment from mergify bot Sep 16, 2025
@moukoublen moukoublen force-pushed the poc_fast_fail_checkin branch from 5af9eda to ff78030 Compare September 16, 2025 12:55
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 17, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz requested a review from blakerouse September 17, 2025 15:50
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.

Are we sure about this? The limiting factor for the number of agents Fleet can manage is the frequency of authentication and API key operations in general usually, this has the potential to drastically increase those. The backoff mechanism is still faster than 5 minutes.

The safest way to introduce this is to put it behind a configuration flag, that is initially only turned on in agentless.

Then we can look at enabling it in the Horde based scale tests, which will need this change ported into the Horde drones and require us to add the ability for the Horde drone health status to change.

This is a very high consequence change as bugs or interactions here can potentially DDoS all existing Fleet Servers accidentally.

We are absolutely going to need more than a unit test to validate this before enabling it by default. As mentioned already, I'm OK introducing this behind a feature toggle as long as the change has no impact when it is disabled.

Probably the fleet configuration section is the place to put a configuration flag for this.

@moukoublen moukoublen force-pushed the poc_fast_fail_checkin branch from 6e92158 to a842971 Compare September 23, 2025 12:37
@moukoublen moukoublen requested a review from cmacknz September 23, 2025 14:49
@moukoublen moukoublen requested a review from cmacknz September 27, 2025 10:47
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. I am going to wait for a green CI before approval, but this looks good.

olegsu
olegsu previously approved these changes Sep 29, 2025
@moukoublen moukoublen requested a review from cmacknz September 29, 2025 18:28
@cmacknz cmacknz dismissed their stale review September 29, 2025 18:38

LGTM, tomorrow is a holiday for me so delegated approval to @blakerouse and removing my request for changes.

Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @moukoublen

@moukoublen
Copy link
Member Author

@blakerouse CI seems ok now

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for working through this with me and getting it to be mergable.

@moukoublen moukoublen merged commit 08dde42 into elastic:main Sep 30, 2025
23 checks passed
@moukoublen moukoublen deleted the poc_fast_fail_checkin branch September 30, 2025 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for quicker Fleet check-in upon component status
6 participants