-
Notifications
You must be signed in to change notification settings - Fork 194
Add support for quicker Fleet check-in upon component status #9982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5c207f2
to
5af9eda
Compare
5af9eda
to
ff78030
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.
Are we sure about this? The limiting factor for the number of agents Fleet can manage is the frequency of authentication and API key operations in general usually, this has the potential to drastically increase those. The backoff mechanism is still faster than 5 minutes.
The safest way to introduce this is to put it behind a configuration flag, that is initially only turned on in agentless.
Then we can look at enabling it in the Horde based scale tests, which will need this change ported into the Horde drones and require us to add the ability for the Horde drone health status to change.
This is a very high consequence change as bugs or interactions here can potentially DDoS all existing Fleet Servers accidentally.
We are absolutely going to need more than a unit test to validate this before enabling it by default. As mentioned already, I'm OK introducing this behind a feature toggle as long as the change has no impact when it is disabled.
Probably the fleet
configuration section is the place to put a configuration flag for this.
6e92158
to
a842971
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. I am going to wait for a green CI before approval, but this looks good.
LGTM, tomorrow is a holiday for me so delegated approval to @blakerouse and removing my request for changes.
|
💚 Build Succeeded
History
cc @moukoublen |
@blakerouse CI seems ok now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for working through this with me and getting it to be mergable.
What does this PR do?
At the moment, the fleet server "holds" a long poll of 5 minutes on every elastic agent check-in. As a result, when an input has an issue and is in an error or degraded state, information is not relayed to Fleet until at most 5 minutes.
In this PR, a "state watch" was introduced into the fleet gateway component to cancel ongoing check-in requests if a state change occurs while the elasti-agent waits on check-in.
The flow is this:
Worst case scenario: During the agent's start-up, potentially many state changes might occur in the inputs till the "healthy" state is achieved. At that moment, a lot of cancellations might happen. The already implemented backoff mechanism covers this case as well.
The "state watch and cancel ongoing checkin" functionality works only if the fleet agent config
checkin.mode
is set toon_state_changed
(will be set by agentless controller on agentless). Otherwise, the flow works without any change.New configuration
checkin
added under theFleetAgentConfig
On agentless we can set (through the agentless controller) the above config to enable fast checkin.
Why is it important?
On agentless environments where the agent logs are not available to the customer, the only way to communicate operational issues (e.g., misconfigured credentials) is through the degraded state status message of the input/component. But creating an agentless agent with wrong credentials will take up to 5 minutes to display the degraded status, and meanwhile, the input will appear as healthy. This provides a bad experience to the customer.
Checklist
./changelog/fragments
using the changelog toolDisruptive User Impact
How to test this PR locally
Related issues
Questions to ask yourself