Fix the telemetry collection of Logstash with metricbeat monitoring. #182304

mashhurs · 2024-05-01T21:41:04Z

Summary

Telemetry data collection is broken for Logstash, monitoring with metricbeat. This PR change covers following issues faced:

1) Resolve cluster UUID

With self monitoring, KB creates .monitoring-es* index with mapping type field and defaults to type:cluster_state key-value. It uses type:cluster_state condition when fetching cluster UUID.
However, with metricbeat, this type field doesn't exist under mapping which metricbeat creates, so cluster UUID will not be resolved as query is wrong (results empty output).

2) type field mismatch in (especially in state) queries, also collapse field

metricbeat sends the state event with metricset.name:node and state fetch query doesn't care about this condition, instead uses legacy type:logstash_state condition which is incorrect.
in the queries, collapse field is not correct: it is due to data shape change from legacy to metricbeat monitoring and queries are tightly coupled with legacy one (1, 2, 3)
in the queries, filter_path is also not correct: in both state query and stats query

3) Logstash state data frequency

metricbeat sends state event (_node/stats/pipeline?graph=true) data once but legacy frequently sends. KB telemetry fetcher queries against ES with latest update period where state data will not be available. It might be a reasonable design which results network efficiency. If that's the case, we should decide the expectation
- to still collect: we have to tune the query to collect the state data.
- leave it as an empty assuming state didn't change (personally, I would not align with this option since collecting once it a data loss risky)

Checklist

Delete any items that are not applicable to this PR.

~~[ ] Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support~~
~~[ ] Documentation was added for features that require explanation or tutorials~~
Unit or functional tests were updated or added to match the most common scenarios
~~[ ] Flaky Test Runner was used on any tests changed~~
~~[ ] Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)~~
~~[ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)~~
~~[ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list~~
~~[ ] This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)~~
~~[ ] This was checked for cross-browser compatibility~~

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

apmmachine · 2024-05-01T21:41:18Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

x-pack/plugins/monitoring/server/telemetry_collection/get_cluster_uuids.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts

afharo

Thank you for taking a look at this! We needed someone with knowledge about the new shape of the data to fix this. 🧡

It Looks Great To Me! My only comment is: can we normalize the response of get_es_stats so that it doesn't leak out to other plugins and functions?

src/plugins/telemetry/server/telemetry_collection/get_local_stats.ts

src/plugins/telemetry_collection_manager/server/types.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_es_stats.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts

mashhurs · 2024-05-05T14:52:11Z

FYI: I have updated the unit test cases which align with current changes, wil try to add for metricbeat.

afharo

Thank you for bearing with me and explaining the reasoning behind the changes.

FWIW, my main concern is that we're leaking code outside the monitoring plugin. Other than that, all looking good aside from the global logstash request now being moved inside the loop.

x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_all_stats.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_es_stats.ts

src/plugins/telemetry_collection_manager/server/types.ts

x-pack/plugins/monitoring/server/telemetry_collection/get_es_stats.ts

afharo · 2024-05-08T16:43:15Z

Hmm, the failing tests indicate that we're somehow not returning the monitoring data... Maybe the new query is failing?

UPDATE: Found it in the logs:

[00:00:06]           â”‚ proc [kibana] [2024-05-08T00:55:24.693+00:00][WARN ][plugins.usageCollection.usage-collection.collector-set] ResponseError: search_phase_execution_exception
[00:00:06]           â”‚ proc [kibana] 	Caused by:
[00:00:06]           â”‚ proc [kibana] 		illegal_argument_exception: no mapping found for `logstash.node.stats.logstash.uuid` in order to collapse on
[00:00:06]           â”‚ proc [kibana] 	Root causes:
[00:00:06]           â”‚ proc [kibana] 		illegal_argument_exception: no mapping found for `logstash.node.stats.logstash.uuid` in order to collapse on
[00:00:06]           â”‚ proc [kibana]     at KibanaTransport.request (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@elastic/transport/lib/Transport.js:492:27)
[00:00:06]           â”‚ proc [kibana]     at processTicksAndRejections (node:internal/process/task_queues:95:5)
[00:00:06]           â”‚ proc [kibana]     at KibanaTransport.request (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/core-elasticsearch-client-server-internal/src/create_transport.js:51:16)
[00:00:06]           â”‚ proc [kibana]     at ClientTraced.SearchApi [as search] (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@elastic/elasticsearch/lib/api/api/search.js:66:12)
[00:00:06]           â”‚ proc [kibana]     at fetchLogstashStats (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/monitoring-plugin/server/telemetry_collection/get_logstash_stats.js:225:19)
[00:00:06]           â”‚ proc [kibana]     at getLogstashStats (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/monitoring-plugin/server/telemetry_collection/get_logstash_stats.js:312:5)
[00:00:06]           â”‚ proc [kibana]     at async Promise.all (index 2)
[00:00:06]           â”‚ proc [kibana]     at getAllStats (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/monitoring-plugin/server/telemetry_collection/get_all_stats.js:34:49)
[00:00:06]           â”‚ proc [kibana]     at async Promise.all (index 1)
[00:00:06]           â”‚ proc [kibana]     at Collector.fetch (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/monitoring-plugin/server/telemetry_collection/register_monitoring_telemetry_collection.js:227:33)
[00:00:06]           â”‚ proc [kibana]     at CollectorSet.fetchCollector (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/usage-collection-plugin/server/collector/collector_set.js:141:24)
[00:00:06]           â”‚ proc [kibana]     at fetch_monitoringTelemetry (/var/lib/buildkite-agent/builds/kb-n2-4-spot-a9d9a28162911021/elastic/kibana-pull-request/kibana-build-xpack/node_modules/@kbn/usage-collection-plugin/server/collector/collector_set.js:175:103) {"service":{"node":{"roles":["background_tasks","ui"]}}}

… shape, telemetry fetch logics updated to save and send metricbeat data shape.

…stats and state if no cluster Logstash monitoring data found. Co-authored-by: Alejandro Fernández Haro <afharo@gmail.com>

mashhurs · 2024-05-08T21:55:57Z

@elasticmachine merge upstream

…onitoring

afharo

LGTM! This is great! Thanks for such an effort!

mashhurs · 2024-05-08T23:22:10Z

LGTM! This is great! Thanks for such an effort!

Thank you so much @afharo. This happened because of your huge help, appreciate!

…onitoring

kibana-ci · 2024-05-09T00:50:56Z

💚 Build Succeeded

Buildkite Build
Commit: 37f3912

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id	before	after	diff
`monitoring`	18	20	+2

Total ESLint disabled count

id	before	after	diff
`monitoring`	25	27	+2

History

💚 Build #208859 succeeded 22b51e8
💔 Build #208800 failed 1701038
💔 Build #208539 failed f86444c
💔 Build #208309 failed af3c267
💔 Build #208301 failed 73160e1
💔 Build #207998 failed 540b7ce

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mashhurs

mashhurs · 2024-05-13T16:38:25Z

@afharo, @neptunian can we please backport this change to upcoming 8.14.x releases?

afharo · 2024-05-13T18:19:19Z

I've added the appropriate label to back port this PR to the previous minor.

Did the same with #182857

Hopefully, our kibanamachine bot backports them for us.

…lastic#182304) ## Summary Telemetry data collection is broken for Logstash, monitoring with metricbeat. This PR change covers following issues faced: **1) Resolve cluster UUID** - With self monitoring, KB creates `.monitoring-es*` index with mapping `type` field and defaults to [`type:cluster_state`](https://github.com/elastic/kibana/blob/main/packages/kbn-apm-synthtrace-client/src/lib/monitoring/cluster_stats.ts#L25) key-value. It uses [`type:cluster_state` condition when fetching cluster UUID](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_cluster_uuids.ts#L48). - However, with metricbeat, this [_`type` field doesn't exist_ under mapping](https://github.com/elastic/beats/blob/v8.13.2/metricbeat/module/elasticsearch/_meta/fields.yml) which metricbeat creates, so cluster UUID will not be resolved as query is wrong (results empty output). **2) `type` field mismatch in (especially in _state_) queries, also collapse field** - metricbeat sends the _state_ event with `metricset.name:node` and state fetch query doesn't care about this condition, instead uses [legacy `type:logstash_state` condition](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L349) which is incorrect. - in the queries, `collapse` field is not correct: it is due to data shape change from legacy to metricbeat monitoring and queries are tightly coupled with legacy one ([1](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L355), [2](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L346), [3](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L301-L302)) - in the queries, `filter_path` is also not correct: in both [_state_ query](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L332) and [_stats_ query](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L273) **3) Logstash state data frequency** - metricbeat sends _state_ event (_node/stats/pipeline?graph=true) data once but legacy frequently sends. KB telemetry fetcher [queries against ES with latest update period](https://github.com/elastic/kibana/blob/main/x-pack/plugins/monitoring/server/telemetry_collection/get_logstash_stats.ts#L343-L344) where state data will not be available. It might be a reasonable design which results network efficiency. If that's the case, we should decide the expectation - to still collect: we have to tune the query to collect the state data. - leave it as an empty assuming state didn't change (personally, I would not align with this option since collecting once it a data loss risky) --------- Co-authored-by: Alejandro Fernández Haro <afharo@gmail.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 26f6977)

kibanamachine · 2024-05-13T18:20:32Z

💚 All backports created successfully

Status	Branch	Result
✅	8.14

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…oring. (#182304) (#183331) # Backport This will backport the following commits from `main` to `8.14`: - [Fix the telemetry collection of Logstash with metricbeat monitoring. (#182304)](#182304)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com>

miltonhultgren · 2024-05-14T13:38:20Z

@afharo Thank you so much for sharing all your knowledge here and getting this to done!

mashhurs self-assigned this May 1, 2024

mashhurs commented May 1, 2024

View reviewed changes

afharo mentioned this pull request May 3, 2024

Telemetry missing when Logstash is monitored exclusively by Metricbeat #179494

Closed

afharo linked an issue May 3, 2024 that may be closed by this pull request

Telemetry missing when Logstash is monitored exclusively by Metricbeat #179494

Closed

afharo reviewed May 3, 2024

View reviewed changes

mashhurs marked this pull request as ready for review May 5, 2024 14:44

mashhurs requested review from a team as code owners May 5, 2024 14:44

mashhurs force-pushed the telemetry-fix-for-logstash-with-metricbeat-monitoring branch from 8ba410c to c20f3f9 Compare May 5, 2024 14:49

mashhurs added the release_note:fix label May 5, 2024

mashhurs requested a review from afharo May 5, 2024 14:51

afharo reviewed May 6, 2024

View reviewed changes

mashhurs force-pushed the telemetry-fix-for-logstash-with-metricbeat-monitoring branch 2 times, most recently from 152d6a8 to f86444c Compare May 7, 2024 23:55

mashhurs commented May 8, 2024

View reviewed changes

x-pack/plugins/monitoring/server/telemetry_collection/get_es_stats.ts Show resolved Hide resolved

afharo reviewed May 8, 2024

View reviewed changes

x-pack/plugins/monitoring/server/telemetry_collection/get_es_stats.ts Show resolved Hide resolved

mashhurs and others added 10 commits May 8, 2024 11:20

Fix the telemetry collection of Logstash with metricbeat monitoring.

45e6f2b

Fetch Elasticsearch cluster state logics now supports metricbeat data…

8d894a2

… shape, telemetry fetch logics updated to save and send metricbeat data shape.

Fix currently failed unit tests, covers legacy collection.

c81fc64

Fix ESLint issue.

8680248

Unit test cases for metricbeat monitoring.

82cbc9d

Typescript TS18048 error fix in get_logstash_stats.ts file

9df7bc8

Get Logstash stats for each cluster UUID.

ea35059

Fix get_all_stats unit test case.

a22ebf1

Tolerate metricbeat monitoring in cluster stats structure.

2f7fbb0

Resolve cluster version and set to cluster stats. Prevent collecting …

1701038

…stats and state if no cluster Logstash monitoring data found. Co-authored-by: Alejandro Fernández Haro <afharo@gmail.com>

mashhurs force-pushed the telemetry-fix-for-logstash-with-metricbeat-monitoring branch from f86444c to 1701038 Compare May 8, 2024 18:21

Revert the cluster state set in the all states.

9b14ebe

Merge branch 'main' into telemetry-fix-for-logstash-with-metricbeat-m…

22b51e8

…onitoring

mashhurs requested a review from afharo May 8, 2024 23:08

afharo approved these changes May 8, 2024

View reviewed changes

mashhurs enabled auto-merge (squash) May 8, 2024 23:25

mashhurs disabled auto-merge May 8, 2024 23:25

Merge branch 'main' into telemetry-fix-for-logstash-with-metricbeat-m…

37f3912

…onitoring

mashhurs enabled auto-merge (squash) May 9, 2024 00:55

neptunian approved these changes May 9, 2024

View reviewed changes

mashhurs merged commit 26f6977 into elastic:main May 9, 2024
17 checks passed

kibanamachine added v8.15.0 backport:skip This commit does not require backporting labels May 9, 2024

smith added the apm:review label May 9, 2024

mashhurs added the 8.14 candidate label May 13, 2024

afharo added backport:prev-minor Backport to the previous minor version (i.e. one version back from main) and removed backport:skip This commit does not require backporting labels May 13, 2024

kibanamachine mentioned this pull request May 13, 2024

[8.14] Fix the telemetry collection of Logstash with metricbeat monitoring. (#182304) #183331

Merged

kibanamachine added the v8.14.0 label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the telemetry collection of Logstash with metricbeat monitoring. #182304

Fix the telemetry collection of Logstash with metricbeat monitoring. #182304

mashhurs commented May 1, 2024 •

edited by kibanamachine

apmmachine commented May 1, 2024

afharo left a comment

mashhurs commented May 5, 2024

afharo left a comment

afharo commented May 8, 2024

mashhurs commented May 8, 2024

afharo left a comment

mashhurs commented May 8, 2024

kibana-ci commented May 9, 2024

ESLint disabled line counts

Total ESLint disabled count

mashhurs commented May 13, 2024

afharo commented May 13, 2024 •

edited

kibanamachine commented May 13, 2024

miltonhultgren commented May 14, 2024

Fix the telemetry collection of Logstash with metricbeat monitoring. #182304

Fix the telemetry collection of Logstash with metricbeat monitoring. #182304

Conversation

mashhurs commented May 1, 2024 • edited by kibanamachine

Summary

Checklist

Risk Matrix

For maintainers

apmmachine commented May 1, 2024

🤖 GitHub comments

afharo left a comment

Choose a reason for hiding this comment

mashhurs commented May 5, 2024

afharo left a comment

Choose a reason for hiding this comment

afharo commented May 8, 2024

mashhurs commented May 8, 2024

afharo left a comment

Choose a reason for hiding this comment

mashhurs commented May 8, 2024

kibana-ci commented May 9, 2024

💚 Build Succeeded

Metrics [docs]

ESLint disabled line counts

Total ESLint disabled count

History

mashhurs commented May 13, 2024

afharo commented May 13, 2024 • edited

kibanamachine commented May 13, 2024

💚 All backports created successfully

Questions ?

miltonhultgren commented May 14, 2024

mashhurs commented May 1, 2024 •

edited by kibanamachine

afharo commented May 13, 2024 •

edited