Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Define UX design for information that is displayed when the user clicks on “Agent” to view its details #81872

Open
ravikesarwani opened this issue Oct 28, 2020 · 6 comments
Assignees

Comments

@ravikesarwani
Copy link
Contributor

Scenario: In the top Fleet page (where all the Agent is listed) the user sees that a particular Agent is not “Healthy” and clicks the “Agent” to view its details and debug the issue.

The view is designed to detail all the information (or link) to help users debug an issue with the Agent.

Following details needs to be captured in the design:

Status: Status of the Agent in broken into the following pieces

  • Status of integration: Is Agent working correctly for each of the integrations its configured for and able to collect and send data to ES
  • Connection to fleet: Can Agent receive any configuration changes
  • Connection for observability data: Can Agent send observability data (log and metrics) to ES

When any of the status is not healthy we need to show information that helps the user debug the issue:

  • When was the last successful connection made
  • Last error message(s)

Historical view of Agent status (nice to have):
The historical view of the Agent status is nice to have for initial release.

Agent and its sub processes (beats) logs:
Logs help users debug issues with Agent.

  • The page provides a way for the user to explore the Agent and its sub processes logs
  • The logs are shown in place on this page for better user experience but for complex use cases can be linked to the Logs UI.
  • Users should be provided a way to filter logs. This helps the user root cause issues for a specific problem more easily. Example: User sees that a specific integration (NGINX metrics) is not working and wants to filter metricbeat logs so that the root cause can be identified quickly.

Agent and its sub processes (beats) metrics:
The Agent and its sub processes metrics helps the user root cause any performance related issues.
Key questions the user is trying to answer:

  • What resources is Agent and its sub processes using
  • Is the Agent able to catch up with the amount of data being generated (ex: application logs)

This metrics view helps user understand:

  • Resources being used by the Agent over time (CPU, Memory, Network, System load, Open handles)
  • Is my Agent able to catch up with the data being generated (Event rate, Fail rate, Error)

Agent as a supervisor runs other processes (beats) to perform the main tasks.

From a user perspective it's critical that the metrics can be viewed for the Agent as a whole (by default) but also be able to filter for a specific sub process. This is critical in debugging issues with a specific integration.

The logs and metrics should be filtered by the same filter and time picker to speed up investigation on an Agent issue by the user.

CC: @hbharding @mukeshelastic @mostlyjason @ph @nchaulet

@ph
Copy link
Contributor

ph commented Oct 28, 2020

cc @jen-huang

@timroes timroes added v7.11.0 and removed v7.11 labels Oct 29, 2020
@hbharding
Copy link
Contributor

Link to wireframes (in progress, will review with team this week)

@hbharding
Copy link
Contributor

hbharding commented Nov 17, 2020

Update

cc @ravikesarwani @mostlyjason @mukeshelastic @nchaulet @ph

View Figma file

During 7.11 and 7.12, we will make incremental changes to improve Agent Observability in Fleet. By 7.12, the goal is to provide detailed status information about an Elastic Agent's integrations and its inputs, as well as provide the user ability to view logs and metrics so they can diagnose and fix specific issues. Some of the more detailed information and functionality won't land until 7.12, but we will be making some UI changes in 7.11 in anticipation of this improved functionality.

I've broken this message into 2 parts for 7.11 and 7.12. There is a lot to unpack, so please ask questions or leave feedback if you have any.


7.11

In 7.11, we'd like to change how we display status information on the Agents table in Fleet. An Agent's status can be one of the following:

  • offline - The agent did not check in and can not be found.
  • healthy - All processes are healthy
  • updating - The agent is enrolling, unenrolling, updating agent policy, or updating agent binary.
  • unhealthy - An issue was found with one or more of the agent's processes.

Note: if it's not possible to refactor the agent status to match the above, we can reuse our existing statuses but still fit them into the design that I am proposing in the screenshot below.

image

Some of this information was previously shown globally in the header area for this page next to the "Add agent" button. In 7.11, we'd like to change this so that the information is shown directly above the agent table using a colorful "status bar". The status bar shows a breakdown of agent statuses, and it will update its numbers and display based on whatever query or filters are applied above. This will help users understand the breakdown of health statuses for a filtered group of agents. One other small change is how we represent agent status. Previously, we were using EuiHealth component in the table column, but instead we'd like to use EuiBadge which allows for more emphasis to placed on the color. Screenshot below:

When a user clicks on a host name from the agents table, they are taken to the agent details page. This page will have a few updates in 7.11:

image

  • Use EuiBadge to indicate agent status in the page's header area
  • Adds "last activity" in the header area. This is the same information we present in the previous page in the agent table.
  • Agent metadata is now wrapped inside a panel called "Overview". Some new information we'll be adding here include:
    • A summary of the integrations that exist on this agent using a hollow EuiBadge with the integration's icon on the left side. This information is somewhat duplicative of what is shown on the right half of this page, but it provides a scannable representation. If an agent has multiple instances of the same integration (for example, Uptime/Synthetics monitors), it will only be shown once in the "Overview" panel, whereas the information on the right hand side will list all instances of an integration which could become quite long in the event of there being many instances.
    • The agent's log level (can be info, warn, debug, or error). In a separate issue, we will add the ability for a user to set the log level for a single agent. [Fleet] Allow to config Agent log level #83330 (comment)
    • Monitor logs and monitor metrics can be true or false. This information is set at the agent policy level.
    • The Elasticsearch URL where data is sent. This information currently comes from Fleet's "global output" settings
  • On the right half of the page, we will list individual instances of an integration. Users can click the integration's name which will take them to the "edit page" for that integration. Users can also click the expand icon to view which inputs are configured for the integration. Once expanded, users can click the "view logs" icon which will take the user to the logs tab with a filter applied to only show logs that correspond with that process. I.e. the "logs" input will show the "elastic_agent.filebeat" dataset
    • image
  • In the event that the agent has an error, we will show a "danger" callout at the top of the agent details page. This callout will indicate the time that the error occured, it will include the error message in an EuiCodeBlock, and it will provide a prominent "view logs" call to action which takes the user to the logs tab. @michalpristas provided some examples of agent errors. In the event that the error prevents data about the agent's metadata and integrations from being sent, we may be unable to render the Overview and Integrations information listed on this page below the callout.
    • image
  • We will replace the "Agent activity" tab with a new "Logs" tab. @jen-huang has already filed a PR for these changes. This page uses the new Logs Stream component and includes a search bar, filters for dataset and log level, a date time picker, and a link to view the logs in the Logs UI
  • image

7.12

In 7.12, we intend to extend the UI so that we are able to report status information and metrics for individual integrations. If an integration is labeled as "unhealthy" (which means an error was found for one or more corresponding inputs in the agent logs), this information will extend to the agent's overall health. If all integrations are healthy, then the agent's overall status will be labeled as "healthy". In the expanded state for the list of integrations, we want to list the integration's inputs, and for each input, show the last message received (red if it's an error message), a sparkline indicating the input's event rate over the past hour, and an additional action link to view metrics about this input in an Elastic Agent dashboard. This dashboard does not exist yet and will need to be created for 7.12 (cc @ravikesarwani). The dashboard should include similar metrics that we show in Stack Monitoring for Beats instances, and (if possible) it should include filter buttons to isolate metrics for a specific integration and input.

image

image

@mostlyjason
Copy link
Contributor

Thanks @hbharding!

@hbharding
Copy link
Contributor

Small update, per #83330, we want provide a way for the user to change an individual agent's logging level.

I've updated the Figma file and added a select input to change the agent logging level beneath the new logs stream component. This input will only appear for agents >= 7.11.

Kapture 2020-11-23 at 09 53 31

Users can also see the current agent logging level on the agent detail page where we show agent metadata

image

See screenshots for step-by-step details

1. Default state with the currently applied logging level
image

2. User selects a new logging level. A button appears that says "Apply changes"
image

3. After user clicks "apply changes", the select input becomes disabled while the system waits for a response. The button changes to a loading state that says "Applying changes..."
image

4. After a response returns, the UI displays a toast indicating what changed. UI state returns to step elastic/beats#1
image

@ravikesarwani
Copy link
Contributor Author

Here's an example why "Events Rate" data is so critical for Agent metrics dashboard.
cc: @ph @nchaulet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants