Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Move Monitoring collection strategy to a collector #82638

Merged
merged 5 commits into from Nov 18, 2020

Conversation

afharo
Copy link
Member

@afharo afharo commented Nov 4, 2020

Summary

As agreed in previous discussions, this PR is migrating the telemetry collection strategy registered by Monitoring to a StatsCollector.

This change allows us to evolve the Telemetry and the Monitoring plugins without affecting each other as much as they currently do. It also improves the amount and quality of data we collect, fixing use cases that would skip sending telemetry about production monitoring clusters, or missing fields only sent by local collectors.

For full details about the discussions, refer to the RFC.

Planned follow-up PRs

As stated in my initial self-review, for the sake of easier review, I think that we need to implement a few follow-up PRs (already added to the teams' roadmap):

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@afharo afharo added release_note:skip Skip the PR/issue when compiling release notes v7.11.0 labels Nov 4, 2020
@afharo afharo force-pushed the telemetry/remove-monitoring-strategy branch 3 times, most recently from b2c9661 to 2eea853 Compare November 5, 2020 18:09
@afharo afharo force-pushed the telemetry/remove-monitoring-strategy branch from 2eea853 to a448d23 Compare November 6, 2020 09:52
Copy link
Member Author

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review explaining a few of my thoughts (and planned follow-up PRs) about some of these changes.

@@ -13,7 +13,6 @@
],
"optionalPlugins": [
"infra",
"telemetryCollectionManager",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitoring plugin no longer depends on the telemetryCollectionManager 🎉

Comment on lines -257 to -264
// TODO: For the telemetry plugin to work, we need to provide the new ES client.
// The new client should be inititalized with a similar config to `this.cluster` but, since we're not using
// the new client in Monitoring Telemetry collection yet, setting the local client allows progress for now.
// The usage collector `fetch` method has been refactored to accept a `collectorFetchContext` object,
// exposing both es clients and the saved objects client.
// We will update the client in a follow up PR.
this.telemetryElasticsearchClient = elasticsearch.client;
this.telemetrySavedObjectsService = savedObjects;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These clients can be removed now that the monitoring plugin does not need to catch up with the changes in the telemetryCollectionManager 🎉

@@ -174,6 +157,11 @@ export class Plugin {
});

registerCollectors(plugins.usageCollection, config, cluster.callAsInternalUser);
registerMonitoringTelemetryCollection(
plugins.usageCollection,
cluster.callAsInternalUser,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, if we add to the fetch context the KibanaRequest (only when available), the collector itself can decide to scope the request as needed (it would also apply to the other collectors registered in registerCollectors), and avoid exposing data to users that shouldn't see it. We can revisit it in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afharo we had decided to hold off on adding the KibanaRequest to the fetch context until there was a need for it. Is there a new need for that now and, if so, how urgently do we need to address that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TinaHeiligers, thank you for confirming that. I am aware of that current action being held until a need is identified. That's also why I preferred to deal with this issue on a separate PR. So we can properly discuss about it.
The need I can see in this situation (both in lines 159 and 162) is that those collectors are always using the internal user so, if a user with fewer permissions requests the telemetry sample flyout, we'll expose data about the monitored clusters, potentially exposing data that user shouldn't see.
In this monitoring use-case, its collectors can't use the esClient/callCluster provided in the fetch context because they need the cluster client instead (pointing to the actual monitoring data). If we decide to provide the KibanaRequest (only when available), these collectors (the new one I did and the existing ones) can properly scope their request to avoid any unwanted exposure of data.
Regarding the urgency, I guess it's up to us to decide. Personally, I'd vote for having it sorted before 7.11's FF because of the potential leak of information. Once we have this PR merged, I can build a quick POC to identify the implications it may have, if you think that's OK 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a reference to the existing issue: #75875 🙂

Copy link
Member Author

@afharo afharo Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created the PR #83413 to fix this. Once it's merged, I'll rebase this one and update accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been merged and I've applied the changes in this one to use it

delete stats.stack_stats.kibana!.plugins.monitoringTelemetry;
}
return [...acc, stats, ...(monitoringTelemetry || [])];
}, [] as TelemetryAggregatedStats[]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may feel like we are mixing concepts, and that this telemetry collection strategy needs to know about other existing collectors to reshape the format that is ultimately sent to the telemetry service. But, afaik, that's one of the duties in the telemetry collection strategies (the OSS strategy also extracts some collectors to shape them into something else).

On an unrelated note, I'm planning on a follow-up PR that will change the behaviour of the telemetryCollectionManager: It will use this X-Pack strategy as a replacement of the OSS strategy (instead of the current priority-based loop, trying the next lower-priority strategy if the previous one fails).

const { body } = await supertest
.post('/api/telemetry/v2/clusters/_stats')
.set('kbn-xsrf', 'xxx')
.send({ timestamp, unencrypted: true })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp parameter was only used for the monitoring strategy to retrieve the right time-frame of telemetry data. Now it's useless and confusing. I'll create a follow-up PR to remove it.

@afharo afharo added the v8.0.0 label Nov 6, 2020
@afharo afharo marked this pull request as ready for review November 6, 2020 11:40
@afharo afharo requested a review from a team as a code owner November 6, 2020 11:40
@afharo afharo requested a review from a team November 6, 2020 11:40
@afharo afharo linked an issue Nov 6, 2020 that may be closed by this pull request
4 tasks
@afharo
Copy link
Member Author

afharo commented Nov 9, 2020

@elasticmachine merge upstream

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering we can de-couple even more than this. What concerns me is that there is still a good chunk of telemetry code within the monitoring plugin.

What does the telemetry code need to function outside of the monitoring plugin? I'm guessing at a minimum, it needs access to the monitoring cluster, or at least an exposed callCluster that is connected to it.

Does anyone else feel that this level of separation will be ideal and attainable?

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Nov 10, 2020

Does anyone else feel that this level of separation will be ideal and attainable?

@chrisronline ideally, Monitoring will just be another usage collector and provide the data that it needs to for telemetry purposes. I'll leave it up to @afharo to discuss the details but in short, there are a number of cases that we can't really get around easily right now.

@chrisronline
Copy link
Contributor

I just worry that there is code inside of the monitoring plugin that my team doesn't really own. I feel like the telemetry team owns this code and it should live in a plugin that they also own. If we agree this is desirable for both teams, maybe we can make it work by exposing various helper functions from the monitoring plugin?

I know we previously agreed upon this solution, but after seeing it in action, I wonder if we get to a better endgame.

@afharo
Copy link
Member Author

afharo commented Nov 16, 2020

I'm wondering we can de-couple even more than this. What concerns me is that there is still a good chunk of telemetry code within the monitoring plugin.

@chrisronline I'm sure we can. For starters, bulk_uploader can stop requesting the bulkFetch for the UsageCollectors (I added that to the list of follow-up PRs).

The only reason I didn't implement it here is to make the PRs as concise and easy to review as possible. It's been proven in the past that too-long PRs are easier to introduce bugs. However, I'm happy to push all the changes to this PR if you think it'll be clearer to see the final stage 🙂

I just worry that there is code inside of the monitoring plugin that my team doesn't really own. I feel like the telemetry team owns this code and it should live in a plugin that they also own.

I feel your concerns, and I agree we need to be cautious so we don't break things downstream. On the flip side, I don't think that code can be fully owned by the telemetry team: we don't fully understand the content of the .monitoring indices. And, once usageCollectors' data is not reported to monitoring, we can't maintain what's collected about other products like logstash, beats, app_search, ...

That's part of the reason for us now to report collectionSource: 'monitoring'. In the telemetry cluster, it'll receive data from both collection sources and it can prioritise the data and reported properties from one or the other. This new collector reports "this is the telemetry I've got from the monitoring POV".

I also agree, though, that it shouldn't be fully owned by the monitoring team. It feels to me more like a joined effort.

If we agree this is desirable for both teams, maybe we can make it work by exposing various helper functions from the monitoring plugin?

I'm not a big fan of this solution for this specific purpose. Other plugins can exploit it to gain access to .monitoring. I'd be cautious if we follow that approach.

@chrisronline
Copy link
Contributor

@afharo Thanks for the thoughts, much appreciated. Let's move forward with this PR as-is!

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@afharo afharo merged commit 37636f3 into elastic:master Nov 18, 2020
@afharo afharo deleted the telemetry/remove-monitoring-strategy branch November 18, 2020 18:16
afharo added a commit that referenced this pull request Nov 18, 2020
…82638) (#83686)

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
phillipb added a commit to phillipb/kibana that referenced this pull request Nov 19, 2020
… into add-logs-to-node-details

* 'add-logs-to-node-details' of github.com:phillipb/kibana: (87 commits)
  [Maps] Add 'crossed' & 'exited' events to tracking alert (elastic#82463)
  Updating code-owners to use new core/app-services team names (elastic#83731)
  Add Managed label to data streams and a view switch for the table (elastic#83049)
  [Maps] Add query bar inputs to geo threshold alerts tracked points & boundaries (elastic#80871)
  fix(NA): search examples kibana version declaration (elastic#83182)
  Fixed console error, which appears when saving changes in Edit Alert flyout (elastic#83610)
  [Alerting] Add `alert.updatedAt` field to represent date of last user edit (elastic#83578)
  Not resetting server log level if level is defined (elastic#83651)
  disable incremenetal build for legacy tsconfig.json (elastic#82986)
  [Workplace Search] Migrate SourceLogic from ent-search (elastic#83593)
  [Workplace Search] Port Box changes from ent-search (elastic#83675)
  [APM] Improve router types (elastic#83620)
  Bump flat to v4.1.1 (elastic#83647)
  Bump y18n@5 to v5.0.5 (elastic#83644)
  Bump jsonpointer to v4.1.0 (elastic#83641)
  Bump is-my-json-valid to v2.20.5 (elastic#83642)
  [Telemetry] Move Monitoring collection strategy to a collector (elastic#82638)
  Update typescript eslint to v4.8 (elastic#83520)
  [ML] Persist URL state for Anomaly detection jobs using metric function (elastic#83507)
  [ML] Performance improvements to annotations editing in Single Metric Viewer & buttons placement (elastic#83216)
  ...
chrisronline pushed a commit to chrisronline/kibana that referenced this pull request Nov 19, 2020
…ic#82638)

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telemetry & Monitoring: Kibana Monitoring & BulkUploader
5 participants