-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infrastructure UI] Add metric trends (KPIs) to filtered Hosts list #143535
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
@pmeresanu85 We had a team time session with Yoann from the Cloud team, and I got a chance to ask about the summary graphs he's placed on their dashboards. They show a graph with a line per host on 3 metrics (CPU, memory and disk throughput) and I asked why not show a single metric value or a trendline for the group as an average and he said that it wouldn't tell you very much because it's just an average of averages of averages. So I think we should consider in our view to show the same kind of graphs, since it shows trends for the system as well as outliers which might hint at issues to drill into with filters. |
@miltonhultgren very good comment, let's follow up on Yoann's feedback. The ones above are averages (KPI trends). Also we need to be mindful that this is 1 feedback point, we shouldn't change course based on 1 single feedback point in my opinion. Think we shouldn't be aiming to produce a perfect product in v1, better have an iterative approach. I wonder if we talked to Yoann about our metric dashboards in context, which are the exact ones they have. |
Discussed with @miltonhultgren
Goal: SRE want to find the groups of hosts relevant to them.
Goal: SRE wants to troubleshoot / remediate infrastructure problems in context of metrics & alerts Note: Host Map/ Host Graph should become part of the analysis view as a minimal functional set, showing only group context & KPIs without individual host selection. MVP : For the MVP of the feature we would only require the 2 step workflow (listing/filtering hosts & analysing host group in context of metrics and alerts). If we can fit in the Host Group object from an effort perspective this can be part of the MVP as well. |
Some more thoughts: This means there will be some things on the page that make sense for the "find group of hosts" goal and some that make sense for the "analyze problems within a group" goal. Likewise, there will be things that are missing for each goal and that don't make sense in the context of the other goal. We should accept that for now and address such problems when we have the time to split this into two views (with the persistence of groups and navigation flows to reach each type of page). On the Analysis page: On the Find hosts page: For the Host groups: Some meta thoughts: |
@miltonhultgren let's proceed based on these 2 comment above |
We spoke a bit more about this in our team time: One thing that we all acknowledge is the need for the landing page to be appealing. One idea would be to show metrics about the "fleet" of hosts instead, like how many hosts there are, and a pie breakdown of their OS type, or some other "group" type data. Rather than showing CPU trends which might be more relevant once you filter down the list. Another thing we could put in to brighten the page up is callouts for collecting data from more hosts. If for example we see that we have 100 hosts in the system module data, but APM data reports 160 hosts, we can put a call out saying "hey, you have another 60 hosts in your system that you could instrument and put into a host group". Perhaps there are other such guiding steps we can share. We also arrived at the question that it might be useful to compare two host groups to each other, but we're not clear on how that would look. If we have a global list with a "group by" sort of query, then we could do this in one page by reducing a group into a single item in the lists/graphs but it's not clear how that would fit into a design with two views. |
We were never able to come up with an appropriate design to handle 2-step views and concluded to go with one view. Also in part because there is some favor from the Design team that a single view is better UX. Unless we get input from the Design team otherwise or they have the bandwidth to finalize a 2 step design, I think we need to go with the single one for now. CC @formgeist I still think these summary metrics of the "group" could be useful, given the Host Map demo would show each individual host on the chart, this could be an "Host Group" summary metric. The Host Map demo showed the summary metrics based on the metrics being shown in that visualization which are the metrics: CPU, Memory Usage, RX, and TX. So it would probably make sense to show these metrics and not Disk Usage and Disk Latency. @pmeresanu85 Would you agree? |
In agree with the above paragraph, based on the guidance we got from UX, my suggestion would be to go with a single view (vs 2 step design).
Agree. Let's stick to showing summary metrics of the "group". Additionally let's stick to showing CPU, Memory usage, RX and TX |
Updated issue with specific metrics to be used. It says to use the snapshot API but I'm not sure that provides the data we need here or averages across all results. If we have to calculate averages on the client I'm not sure if that will work with the results (though these might not be paginated on the server). If the API doesn't do what we want we may want to use lens visualizations instead. If this is the case let's update the description. |
@formgeist will update this issue once there is a clear design for this. |
@smith Not sure if intentional, but the descriptions mentions "CPU utilization" and "Memory utilization" instead of the more common name "CPU usage" and "Memory usage" - is there a specific reason to use a different naming here? I would assume we'd try to keep them consistent with the Hosts table? |
@formgeist We should continue to use "usage" over utilization for consistency. |
Updated text to use "usage" instead of "utilization". |
I think we can move this out of refining. At present it can go on the single view. I understand we might be moving it into tabs later based on @formgeist designs but that shouldn't be too difficult to move. The snapshot api doesn't support this kind of "list of hosts" summary metrics. We'd likely create a new api (which could be reused for charts or a summary of groups of hosts if we ever had a landing page for that) if we went with using Elastic charts metric visualization. I think we decided to go with our own API / Elastic charts metric visualization to make things more consistent instead of using Lens for some metrics and not others and having more customizable options and other issues we had with the table such as erroring out when the field doens't exist in the Data View. |
Any objection to using either Snapshot API or Metric Explorer API for this(both internally run the same code to build the query)? They would provide us with all the metrics, but hosts count, out of the box: Hosts count could be new inventory model using cardinality aggregation |
Yea I was thinking the same. Sounds good to me! |
UpdateGreat interview with Lucas Moore (one of our SREs). He really liked these charts so this is good news! |
@crespocarlos and @roshan-elastic spoke about query sizes. Given we want to prioritise performance/responsiveness/response times, we discussed some suggestions: SuggestionsKPI Tiles
Table
Other
Next Steps
|
@roshan-elastic @neptunian @formgeist meeting notes: Decisions
When we return the default result view:
Either result view or one which is filtered/queried:
Open Questions
Action Points
|
Update from last meeting (@formgeist @crespocarlos + @roshan-elastic ) Next Design Requirements1. KPI Tiles - Hosts - X out of X 2. KPI Tiles - Indicate that they are affected by selected host limit, e.g. 100, 50, 10 - whatever is selected 3. KPI tiles - affected by sorting 4. Sampling - Default sorting 5. Sampling - return to default sorting 6. Query size controls 7. Long query loading prompt
Other Notes / Open Points8. Click through on KPIs 9. Limit time range 10. Losing the displayed hosts when you change the time range 11. Responsive/mobile view Action Points
|
To narrow the scope of this ticket, I suggest us to focus on:
Regarding the item below, the page is loads the KPIs based on the default 15 minutes that is set in the unified search as soon as the page loads. Should we change this behaviour?
The items below could be in a separate ticket as they aim to address a different problem
We'll need to change the query to run
Since the KPIs run the same base query, this should work without any additional change
It's currently possible to sort the table. The main change here is to sort the data on ES side as a direct impact of limiting the number of hosts returned by the API
This needs further investigation |
Hey @crespocarlos, looking at these ones - I wasn't sure whether it would be best for you administer this (e.g. create issues/tasks which make sense to you) or whether I should create them? Let me know what you prefer! BTW I've added in the top comment, a list of items for the current ticket. Feel free to add items to this as you need:
|
I can create them. |
## Summary closes #143535 ![out](https://user-images.githubusercontent.com/2767137/209656024-51eb234b-c449-4ba5-982b-ba78d38dbd98.gif) This PR adds metrics trend tiles to the hosts view. All of those metrics, but `hosts`, are loaded using existing features found in the Snapshot API. `hosts` will show for now only the total table row count. I've created a basic functional test just to validate that things are loading and improved a few things. ### How to test it -**Using metricbeat** - Enable system metric in metricbeat - Start your local ES -**Using oblt** - configure kibana.dev.yml with your oblt-cluster -**Using slingshot** - Clone https://github.com/elastic/slingshot and run slingshot yarn slingshot load --config ./configs/hosts.json - Start your local ES Start kibana Navigate to Infrastructure > Hosts #### Functional tests Start server ```bash yarn test:ftr:server --config x-pack/test/functional/apps/infra/config.ts ``` Start runner ```bash node scripts/functional_test_runner --config=x-pack/test/functional/apps/infra/config.ts --include x-pack/test/functional/apps/infra/hosts_view.ts ``` Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
✔️ Task List
*Open Questions = Needs facilitation by PM
*Current Tasks = Can be owned individually (reach out to whoever for feedback/clarification)
Open Questions
Current Tasks
📖 Description
In order to show the user a summary of information about the search result for hosts, we would like to show some summary metrics across the top of the page, below the search bar.
⚙️ Implementation details
The look, spacing, and layout should be identical to what's specified in the screenshots below:
Default state - query size
Filtered state - number of included hosts vs. total number of hosts
✔️ Acceptance criteria
In the Hosts View, the following metrics and trends are displayed:
node
host.name
considering only the time range filtercompute
avg(system.cpu.total.norm.pct)
memory
avg(system.memory.actual.used.bytes) / max(system.memory.total)
sortLeft
counter_rate(system.network.in.bytes)
sortLeft
counter_rate(system.network.out.bytes)
The text was updated successfully, but these errors were encountered: