Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System] Fix metrics overview dashboard #9771

Merged

Conversation

milan-elastic
Copy link
Contributor

@milan-elastic milan-elastic commented May 1, 2024

  • Bug

Proposed commit message

  • Fixed the Table chart visualization, where CPU usage is comming as '-'.
  • Changed palette param from 'Percent' to 'Number' for Table chart visualization.
  • Added data_stream.dataset filter at panel for Top hosts by CPU usage over time and Top hosts by memory usage over time.

Related Issues

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Screenshots

Before

screencapture-my-deployment-4520e7-kb-us-central1-gcp-cloud-es-io-9243-app-dashboards-2024-05-03-14_42_29

After

screencapture-127-0-0-1-64942-app-dashboards-2024-05-03-13_28_29

@elasticmachine
Copy link

elasticmachine commented May 1, 2024

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@milan-elastic milan-elastic marked this pull request as ready for review May 2, 2024 06:34
@milan-elastic milan-elastic requested review from a team as code owners May 2, 2024 06:34
@milan-elastic
Copy link
Contributor Author

milan-elastic commented May 2, 2024

While comparing the before migration system metrics dashboards to this PR, I have observed that the fields are getting changed. And from this change, the metrics visualization and the table chart for CPU usage are showing different values!

Before Migration
image (7)

After Migration
image (8)

Were these changes expected?
cc: @drewdaemon

@ishleenk17
Copy link
Contributor

While comparing the before migration system metrics dashboards to this PR, I have observed that the fields are getting changed. And from this change, the metrics visualization and the table chart for CPU usage are showing different values!

Before Migration image (7)

After Migration image (8)

Were these changes expected? cc: @drewdaemon

Do we have closure on this ?
Do we know the reason of CPU values being shown as 0% ?

@ishleenk17
Copy link
Contributor

@milan-elastic : Was this issue reprodicible where in 10s we don't see the CPU values?

@milan-elastic
Copy link
Contributor Author

milan-elastic commented May 3, 2024

Do we have closure on this ?

No, not yet! I am looking forward to get a confirmation from @drewdaemon on this.

Do we know the reason of CPU values being shown as 0% ?

The panel previously displayed "-" instead of the actual value, resulting in 0% being shown. This issue has been resolved as part of this pull request.

@milan-elastic
Copy link
Contributor Author

@milan-elastic : Was this issue reprodicible where in 10s we don't see the CPU values?

Yes, it is reproducible and fixed now as part of this PR

@drewdaemon
Copy link
Contributor

Hi @milan-elastic

I believe there are two things going on here.

Differences between TSVB and Lens

The number shown could be different even if the field hadn't changed since the old TSVB visualizations used an aggregation over a recent slice of the selected timeframe, while Lens's Last value function retrieves literally the last-reported value. The TSVB approach consistently confused... well... everybody.

See #1437 (comment) for more discussion about this.

Different field being used

I actually don't remember this change. But, looking at the system docs it seems like it is a correct one.

The old field system.cpu.user.norm.pct is

The percentage of CPU time spent in user space.

The new field system.process.cpu.total.norm.pct

The percentage of CPU time spent by the process since the last event. This value is normalized by the number of CPU cores and it ranges from 0 to 100%.

Feels like you'd want to see total CPU usage, instead of just the user space.

That said, I'm no expert on the data here. Maybe @cmacknz could double-check

@cmacknz
Copy link
Member

cmacknz commented May 3, 2024

The total system CPU usage definitely seems more correct.

@fearful-symmetry can you sanity check the correct CPU metrics are being used here?

@drewdaemon
Copy link
Contributor

drewdaemon commented May 6, 2024

A side-note I wanted to state out-loud here. Completely removing the reduced time range setting will mean that the table may show out-of-date metrics and this may not be clear to the user. For example, maybe the last CPU metric for host A was reported at 80% yesterday. My dashboard time range is set to the last 7 days. I will see 80% as the "current" value even though it was reported yesterday.

Not saying it's the wrong choice to remove this, just saying there's a trade-off. Another possibility would be increasing the reduced time range to a larger window to account for slight variances in the ingest frequency.

That said, the user can always check the health of their agents and ingest through other means than these visualizations.

@cmacknz
Copy link
Member

cmacknz commented May 6, 2024

Not saying it's the wrong choice to remove this, just saying there's a trade-off. Another possibility would be increasing the reduced time range to a larger window to account for slight variances in the ingest frequency.

Increasing the reduced time range seems like it might be a better choice. Showing 1 week old data for metrics feels wrong.

The data collection interval defaults to 10s, so at most you'd have one sample in that period. Using a longer one makes more sense. What value to use feels a bit arbitrary, maybe 15m instead of 10s to guarantee we have something unless there is a real problem?

I think in an ideal world we'd want this to be some fixed multiple of the configured collection period, but I'm not sure that's possible.

@drewdaemon
Copy link
Contributor

What value to use feels a bit arbitrary, maybe 15m instead of 10s to guarantee we have something unless there is a real problem?

FWIW, I think this makes sense.

I think in an ideal world we'd want this to be some fixed multiple of the configured collection period, but I'm not sure that's possible.

Not possible today. But, I think variables in the integration assets would be powerful.

@milan-elastic
Copy link
Contributor Author

After our team discussion, we've decided to adjust the time range to 15 minutes. While this won't always ensure complete data population,But any day it's a better option than sticking with a reduced time range of 10 seconds.
cc: @tommyers-elastic @ishleenk17 @lalit-satapathy

},
{
"color": "#cc5642",
"stop": 100
"stop": 1.85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 1.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drewdaemon Yes, Ideally it should be 1 instead of 1.85, seems like it has calculated it's value automatically! From the UI there is no place I've found from where I can manipulate this value! To make it 1 I've changed the value manually in json file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milan-elastic I'm sorry... I misunderstood. If this is the value the Kibana automatically set, we should stick to that. Let's revert to 1.85.

I think Kibana is probably doing its best to make sure that we still get a red color if the percentage value rises above 1

…stic/integrations into bugfix-system-metrics-overview
Copy link
Contributor

@drewdaemon drewdaemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work.

I recommend respecting the final color stop value from Kibana (ref).

Approving to unblock.

@ishleenk17
Copy link
Contributor

@milan-elastic : I hope the dashboards json is autogenerated from Kibana and not manipulated manually.
As in changes done on dashboards and then exported to json.
As I see in some comment a change was done manually.

Nit: There are 3 files showing in diff due to extra line addition. Please take care of that.

@milan-elastic
Copy link
Contributor Author

@milan-elastic : I hope the dashboards json is autogenerated from Kibana and not manipulated manually. As in changes done on dashboards and then exported to json. As I see in some comment a change was done manually.

@ishleenk17 After @drewdaemon comment I've reverted the 1 to 1.85 that I have changed manually previously. So it's been taken care of.

Nit: There are 3 files showing in diff due to extra line addition. Please take care of that.

I think we should not revert them back, because those changes are result of elastic-package check command. so everytime anyone is going to run this command this diff will be appear!

Copy link
Contributor

@ishleenk17 ishleenk17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@milan-elastic
Copy link
Contributor Author

I require code owner review to merge this PR, can someone from @elastic/sec-linux-platform and @elastic/sec-windows-platform review this PR and provide approval ?
cc: @lalit-satapathy @ishleenk17

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @milan-elastic

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@milan-elastic milan-elastic merged commit 0e26f73 into elastic:main May 28, 2024
5 checks passed
@elasticmachine
Copy link

Package system - 1.58.1 containing this change is available at https://epr.elastic.co/search?package=system

bmorelli25 pushed a commit to bmorelli25/integrations that referenced this pull request Jun 3, 2024
* fix metrics overview dashboard

* update pr link in changelog

* reevaluate reducedtimerange

* updated the ndjson

* resolve review comments

---------

Co-authored-by: harnish-elastic <118714680+harnish-elastic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metric System] Overview - wrongly configured table
8 participants