Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[elastic_agent] Agent Metrics Dashboard improvements #10031

Merged
merged 5 commits into from
Jun 18, 2024

Conversation

leehinman
Copy link
Contributor

Proposed commit message

The visualizations in the "[Elastic Agent] Agent Metrics" Dashboard have been improved. Specifically:

For the following visualizations the breakdown was changed from beat.type to agent.name+component.id. This was necessary because it was grouping data together from different sources which was reporting incorrect values. agent.name+component.id is correct because this maps the data to the unique process id that generated the data.

Also, the interval was changed from auto to minute. This is necessary because the data only comes in at 1 minute intervals. By default the visualization was grabbing smaller intervals which was breaking the visualization.

  • [Elastic Agent] Total events rate /s
  • [Elastic Agent] Output write throughput
  • [Elastic Agent] Output write errors
  • [Elastic Agent] Events acknowledged rate /s
  • [Elastic Agent] Output write batches
  • [Elastic Agent] Output batch size

The following visualizations were added.

  • [Elastic Agent] Events dropped rate /s
  • [Elastic Agent] Events duplicate rate /s
  • [Elastic Agent] Events failed rate /s
  • [Elastic Agent] Events TooMany rate /s

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.

Author's Checklist

How to test this PR locally

  1. Install elastic-agent
  2. Look at [Elastic Agent] Agent Metrics Dashboard

Related issues

Screenshots

Before

Screenshot 2024-05-30 at 13 59 16

After

Screenshot 2024-05-30 at 14 04 35

All Visualization

Screenshot 2024-05-22 at 19 17 50

@leehinman leehinman requested a review from a team as a code owner May 30, 2024 20:00
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 31, 2024
@elasticmachine
Copy link

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@fearful-symmetry
Copy link
Contributor

@leehinman do you want to merge from upstream? The conflicts are probably me. Hoping I'll actually have time to get to this tomorrow.

@fearful-symmetry
Copy link
Contributor

So, this is probably a me problem, but I can't get the CPU usage dashboard to show anything? This is just the "default" elastic-package deployment (elastic-package stack up) with a system integration installed and agent metrics enabled to get more data. Can you verify this is working?

It looks like the visualization is aggregating from system.process.cpu.total.value and elastic-agent.process, and there are documents matching with those fields.

The query response shows that we have matching documents, but the aggregation isn't there.

{
  "rawResponse": {
    "took": 7,
    "timed_out": false,
    "_shards": {
      "total": 15,
      "successful": 15,
      "skipped": 14,
      "failed": 0
    },
    "hits": {
      "total": 75,
      "max_score": null,
      "hits": []
    }
  }
}

Again, might just me being dumb, but it seems odd, since the Memory Usage and Open Handles metrics are working fine?

Screenshot 2024-06-05 at 1 49 14 PM

@leehinman
Copy link
Contributor Author

Can you verify this is working?

I didn't touch the "CPU Usage" chart, but I figured out what is wrong. It is the fact that the data view is set to "auto" and we only get events every minute. So if auto picks buckets that are smaller than one minute (which is what happens with default 15 min time picker) Some of the buckets have no data, and things like derivations don't work. :-(

fix coming soon.

@leehinman
Copy link
Contributor Author

@fearful-symmetry try it now

@jlind23
Copy link
Contributor

jlind23 commented Jun 12, 2024

@leehinman @pierrehilbert is this good to be merged?

@leehinman leehinman requested a review from flash1293 June 12, 2024 15:26
@leehinman
Copy link
Contributor Author

@leehinman @pierrehilbert is this good to be merged?

Marius is out on paternity leave, I added Joe Reuter as reviewer, I'd like someone with a bit more Dashboard experience to review just to make sure I'm not missing something.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Some comments.

I ran this on some sample data and the first chart errors out for me:

Screenshot 2024-06-13 at 15 01 10

Didn't dig deeper on why that's happening, but it seems to be a TSVB vis, is there a reason for this? By my understanding we shouldn't use TSVB unless absolutely required.

In terms of the charts - all the time series charts are pretty dense and the y axis label doesn't really add much as the title is explaining it quite well already - I think we can omit them:
Screenshot 2024-06-13 at 15 03 45

For the x axis, it's pretty much the same - IMHO we can just remove the label here, the "per minute" is more confusing than helpful as we normalize the value per second anyway:
Screenshot 2024-06-13 at 15 04 07

@leehinman
Copy link
Contributor Author

Thanks for this. Some comments.

I ran this on some sample data and the first chart errors out for me:
Screenshot 2024-06-13 at 15 01 10

Didn't dig deeper on why that's happening, but it seems to be a TSVB vis, is there a reason for this? By my understanding we shouldn't use TSVB unless absolutely required.

Looks like this was done when Dashboards were first made. I'm going to leave it as TSVB until Marius bets back and we can ask him about it, since he did the original work.

I think the error is "too much data", and I'm guessing that is because I set the interval to "1m", I switched it to ">=1m", lets see if that fixes it. If we go less than 1m, we have a different problem.

In terms of the charts - all the time series charts are pretty dense and the y axis label doesn't really add much as the title is explaining it quite well already - I think we can omit them: Screenshot 2024-06-13 at 15 03 45

I didn't find a way to "remove" the text, I can change it to " ", so nothing shows up, but I can't seem to remove it totally and reclaim the space. Do you know of a better way?

For the x axis, it's pretty much the same - IMHO we can just remove the label here, the "per minute" is more confusing than helpful as we normalize the value per second anyway: Screenshot 2024-06-13 at 15 04 07

These are automatically added, so I could remove the "@timestamp" part of "@timestamp per minute", but I didn't find a way to remove the "per minute" part. Any ideas?

@leehinman
Copy link
Contributor Author

Updated dashboards look like:

Screenshot 2024-06-13 at 15 15 08

@flash1293
Copy link
Contributor

I think the error is "too much data", and I'm guessing that is because I set the interval to "1m", I switched it to ">=1m", lets see if that fixes it. If we go less than 1m, we have a different problem.

That should work 👍

I didn't find a way to "remove" the text, I can change it to " ", so nothing shows up, but I can't seem to remove it totally and reclaim the space. Do you know of a better way?

You can remove the axis titles in the axis menu like this:

Screenshot 2024-06-17 at 08 54 43

@leehinman
Copy link
Contributor Author

@flash1293 Thanks for the help on the axis titles. 2 of the the panels are different and don't have those options, but I got it fixed on the rest. screenshot below.

Screenshot 2024-06-18 at 08 47 38

@elasticmachine
Copy link

💚 Build Succeeded

History

Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@flash1293
Copy link
Contributor

yeah, for tsvb you can't remove it I think - LGTM from the dashboard side based on the screenshots

@leehinman leehinman merged commit e40c48b into elastic:main Jun 18, 2024
5 checks passed
@elasticmachine
Copy link

Package elastic_agent - 1.20.0 containing this change is available at https://epr.elastic.co/search?package=elastic_agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration:elastic_agent Elastic Agent Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
7 participants