Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Agent status improvements #75236

Open
mostlyjason opened this issue Aug 17, 2020 · 8 comments
Open

[Fleet] Agent status improvements #75236

mostlyjason opened this issue Aug 17, 2020 · 8 comments
Labels
design Team:Agent Agent Team Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mostlyjason
Copy link
Contributor

mostlyjason commented Aug 17, 2020

Describe the feature:

Currently, the fleet page shows the status of agents including whether they are online, offline, or have an error. It also shows whether agents are out of date, and enrolling or unenrolling. However, there is no way to see which agents have integrations that are reporting errors or are unhealthy. Instead these agents are reported as online and green, and this may be misinterpreted as healthy. We need a better way to indicate to administrators that agents are not running as expected and require attention. Endpoint security reported this use case #74708

I'd like to propose refactoring the statuses so that the fleet page shows:

  • Healthy - online and running as expected. There are no agent policy updates or automatic agent binary updates pending, but there may be manual agent binary updates available.
  • Unhealthy - online but requires attention from an admin because its reporting an errors at the agent or integration levels, a report of a process being unhealthy in our health check API [Agent] Implement an HealthCheck function in every beats. beats#17737, or an upgrade failed and it was rolled back
  • Updating - online and updating the agent policy or binary, enrolling or unenrolling. This supersedes a status of healthy or unhealthy.
  • Offline - has not checked in a minimal amount of time. This supersedes the updating status, since it cannot update while its offline.

Additionally, we can indicate when there are manual agent binary updates or agent policy available using a separate flag.

The reason we'd want to provide a summary of statuses on the overview page is to provide a rollup so fleet administrators can determine what is in flux and what requires their attention. Administrators can also filter the list to see just the set of agents requiring their attention, and combine that filter with others to look at a particular agent configuration or integration. Optionally, there could be a way to display sub-status information like "Updating: enrolling".

The agent details page will show both the overview status and the finer-grained status information to help users identify the cause of problems. It will provide a way for users to see which integrations are healthy, which are disabled due to user preference or condition, and which have errors or failed a health check along with more information on the reason why. There may be a summary of the health for each integration, and the user can see the activity log for more detail.

This also allows us to communicate the status of deployments using the same statuses, rather than having separate statuses just for deployments. #72537

Describe a specific use case for the feature:

  • As a Fleet administrator, I'd like to identify agents that are not operating as expected and require my attention on the fleet page. It should account not just for the agent status but the status of the integrations as well.
  • As a Fleet administrator, I'd like to get detailed information about why an agent is not healthy so that I can troubleshoot and fix it. I'd like to identify which specific integrations and error messages are reporting the problem.
@mostlyjason mostlyjason added design Team:Fleet Team label for Observability Data Collection Fleet team labels Aug 17, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@mostlyjason
Copy link
Contributor Author

@hbharding I'd be interested in your input on this

@hbharding
Copy link
Contributor

hbharding commented Aug 19, 2020

Hey @mostlyjason, thanks for putting this together. I think this simplifies a lot. I especially like that we can use these same statuses to communicate deployment status.

I created a Whimsical diagram that attempts to capture everything you've described. I organized the diagram so that statuses on the left will always supercede statuses on the right if any of the conditions inside are true. For example, if a policy is "unenrolling", it can not also be in a "unhealthy" or "healthy" state.

image

I shared this in our meeting yesterday with Endpoint, and there were questions about items inside the "unhealthy" status. "Unhealthy" makes sense when some integrations could have issues while other integrations are running fine. But what if the agent is "online" and has an error that prevents all data from being sent? Shouldn't we elevate this type of status so that it appears to be more critical? Perhaps it makes sense to introduce a red "error" status like so:

image

Some questions I have are:

  • I don't think its possible to detect if an agent is "enrolling". I think "new" agents would just appear with a status of "healthy", "unhealthy", or "error"
  • What are some example scenarios that would cause an agent error?
  • For "unhealthy", you mentioned our healthcheck API could report a process as being unhealthy. What does this mean?

@hbharding
Copy link
Contributor

Also, to recap a discussion from yesterday:

re: Integration errors, we talked about maybe adding a way to "pivot" the agent table so that it is focused on policies. If an agent is unhealthy due to an integration error (Endpoint, for example), it is likely that multiple agents will have the same issue because they use the same policy. On the Fleet page, if we report 200 agents as being "unhealthy", how can the user isolate the agents to only see agents that have unhealthy because of an Endpoint Integration error?

@nchaulet
Copy link
Member

nchaulet commented Sep 8, 2020

I don't think its possible to detect if an agent is "enrolling". I think "new" agents would just appear with a status of "healthy", "unhealthy", or "error"

You are right the enrolling status we have now is more an enrolled status, should we have an enrolled status for agent between the enrollment and the first checkin?

@ph
Copy link
Contributor

ph commented Dec 18, 2020

@michalpristas or @nchaulet I can't find the issue for the Elastic Agent related to this effort did you ever created one?

@nchaulet
Copy link
Member

@ph there as no specific issue for that but this was partially implemented here #84434 (adding the Healthy, unhealthy, updating status) There is no per integration status now as we postponed this and the status is still computed by Kibana and not reported by the agent so we do not have the Updating Policy status

@mostlyjason
Copy link
Contributor Author

Just want to describe the goal for the next phase is to so expose improved status for inputs in the Agent details page, filtered by integration. That applies to the second user story:

As a Fleet administrator, I'd like to get detailed information about why an agent is not healthy so that I can troubleshoot and fix it. I'd like to identify which specific integrations and error messages are reporting the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Team:Agent Agent Team Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

5 participants