[Fleet] Agent status improvements #75236

mostlyjason · 2020-08-17T22:32:10Z

Describe the feature:

Currently, the fleet page shows the status of agents including whether they are online, offline, or have an error. It also shows whether agents are out of date, and enrolling or unenrolling. However, there is no way to see which agents have integrations that are reporting errors or are unhealthy. Instead these agents are reported as online and green, and this may be misinterpreted as healthy. We need a better way to indicate to administrators that agents are not running as expected and require attention. Endpoint security reported this use case #74708

I'd like to propose refactoring the statuses so that the fleet page shows:

Healthy - online and running as expected. There are no agent policy updates or automatic agent binary updates pending, but there may be manual agent binary updates available.
Unhealthy - online but requires attention from an admin because its reporting an errors at the agent or integration levels, a report of a process being unhealthy in our health check API [Agent] Implement an HealthCheck function in every beats. beats#17737, or an upgrade failed and it was rolled back
Updating - online and updating the agent policy or binary, enrolling or unenrolling. This supersedes a status of healthy or unhealthy.
Offline - has not checked in a minimal amount of time. This supersedes the updating status, since it cannot update while its offline.

Additionally, we can indicate when there are manual agent binary updates or agent policy available using a separate flag.

The reason we'd want to provide a summary of statuses on the overview page is to provide a rollup so fleet administrators can determine what is in flux and what requires their attention. Administrators can also filter the list to see just the set of agents requiring their attention, and combine that filter with others to look at a particular agent configuration or integration. Optionally, there could be a way to display sub-status information like "Updating: enrolling".

The agent details page will show both the overview status and the finer-grained status information to help users identify the cause of problems. It will provide a way for users to see which integrations are healthy, which are disabled due to user preference or condition, and which have errors or failed a health check along with more information on the reason why. There may be a summary of the health for each integration, and the user can see the activity log for more detail.

This also allows us to communicate the status of deployments using the same statuses, rather than having separate statuses just for deployments. #72537

Describe a specific use case for the feature:

As a Fleet administrator, I'd like to identify agents that are not operating as expected and require my attention on the fleet page. It should account not just for the agent status but the status of the integrations as well.
As a Fleet administrator, I'd like to get detailed information about why an agent is not healthy so that I can troubleshoot and fix it. I'd like to identify which specific integrations and error messages are reporting the problem.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-17T22:32:12Z

Pinging @elastic/ingest-management (Team:Ingest Management)

mostlyjason · 2020-08-17T22:32:30Z

@hbharding I'd be interested in your input on this

hbharding · 2020-08-19T18:19:55Z

Hey @mostlyjason, thanks for putting this together. I think this simplifies a lot. I especially like that we can use these same statuses to communicate deployment status.

I created a Whimsical diagram that attempts to capture everything you've described. I organized the diagram so that statuses on the left will always supercede statuses on the right if any of the conditions inside are true. For example, if a policy is "unenrolling", it can not also be in a "unhealthy" or "healthy" state.

I shared this in our meeting yesterday with Endpoint, and there were questions about items inside the "unhealthy" status. "Unhealthy" makes sense when some integrations could have issues while other integrations are running fine. But what if the agent is "online" and has an error that prevents all data from being sent? Shouldn't we elevate this type of status so that it appears to be more critical? Perhaps it makes sense to introduce a red "error" status like so:

Some questions I have are:

I don't think its possible to detect if an agent is "enrolling". I think "new" agents would just appear with a status of "healthy", "unhealthy", or "error"
What are some example scenarios that would cause an agent error?
For "unhealthy", you mentioned our healthcheck API could report a process as being unhealthy. What does this mean?

hbharding · 2020-08-19T18:32:11Z

Also, to recap a discussion from yesterday:

re: Integration errors, we talked about maybe adding a way to "pivot" the agent table so that it is focused on policies. If an agent is unhealthy due to an integration error (Endpoint, for example), it is likely that multiple agents will have the same issue because they use the same policy. On the Fleet page, if we report 200 agents as being "unhealthy", how can the user isolate the agents to only see agents that have unhealthy because of an Endpoint Integration error?

nchaulet · 2020-09-08T14:55:07Z

I don't think its possible to detect if an agent is "enrolling". I think "new" agents would just appear with a status of "healthy", "unhealthy", or "error"

You are right the enrolling status we have now is more an enrolled status, should we have an enrolled status for agent between the enrollment and the first checkin?

ph · 2020-12-18T15:31:46Z

@michalpristas or @nchaulet I can't find the issue for the Elastic Agent related to this effort did you ever created one?

nchaulet · 2020-12-18T16:09:44Z

@ph there as no specific issue for that but this was partially implemented here #84434 (adding the Healthy, unhealthy, updating status) There is no per integration status now as we postponed this and the status is still computed by Kibana and not reported by the agent so we do not have the Updating Policy status

mostlyjason · 2021-01-15T16:21:05Z

Just want to describe the goal for the next phase is to so expose improved status for inputs in the Agent details page, filtered by integration. That applies to the second user story:

As a Fleet administrator, I'd like to get detailed information about why an agent is not healthy so that I can troubleshoot and fix it. I'd like to identify which specific integrations and error messages are reporting the problem.

mostlyjason added design Team:Fleet Team label for Observability Data Collection Fleet team labels Aug 17, 2020

mostlyjason mentioned this issue Aug 20, 2020

[Ingest Manager] Show the progress of updates #72537

Closed

mostlyjason mentioned this issue Sep 22, 2020

[Fleet] Improve agent observability #78188

Open

13 tasks

This was referenced Sep 30, 2020

[Ingest Manager] Add ability to upgrade the agent in Fleet #78469

Closed

[Ingest Manager] Upgrade Agents in Fleet #78810

Merged

ph assigned nchaulet and unassigned nchaulet Oct 19, 2020

ph assigned nchaulet Dec 18, 2020

ph unassigned nchaulet Feb 19, 2021

ph added the Team:Agent Agent Team label Feb 19, 2021

jen-huang changed the title ~~[Ingest Manager] Fleet status improvements~~ [Fleet] Agent status improvements Apr 27, 2021

jen-huang mentioned this issue Jan 23, 2023

[Elastic Agent] Report running processes and their health statuses elastic/elastic-agent#2156

Closed

3 tasks

mostlyjason mentioned this issue Jul 22, 2021

Document Elastic Agent statuses elastic/observability-docs#774

Closed

mostlyjason mentioned this issue Jan 10, 2022

[Fleet] [Bug] If an agent is upgraded while in unhealthy state, its status is stuck at updating but the banner shows as unhealthy #122206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Agent status improvements #75236

[Fleet] Agent status improvements #75236

mostlyjason commented Aug 17, 2020 •

edited

elasticmachine commented Aug 17, 2020

mostlyjason commented Aug 17, 2020

hbharding commented Aug 19, 2020 •

edited

hbharding commented Aug 19, 2020

nchaulet commented Sep 8, 2020

ph commented Dec 18, 2020 •

edited

nchaulet commented Dec 18, 2020

mostlyjason commented Jan 15, 2021

[Fleet] Agent status improvements #75236

[Fleet] Agent status improvements #75236

Comments

mostlyjason commented Aug 17, 2020 • edited

elasticmachine commented Aug 17, 2020

mostlyjason commented Aug 17, 2020

hbharding commented Aug 19, 2020 • edited

hbharding commented Aug 19, 2020

nchaulet commented Sep 8, 2020

ph commented Dec 18, 2020 • edited

nchaulet commented Dec 18, 2020

mostlyjason commented Jan 15, 2021

mostlyjason commented Aug 17, 2020 •

edited

hbharding commented Aug 19, 2020 •

edited

ph commented Dec 18, 2020 •

edited