Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate risk of performance regression from share-capable saved object types #113743

Closed
rudolf opened this issue Oct 4, 2021 · 11 comments
Closed
Labels
Feature:Security/Sharing Saved Objects Platform Security - Sharing Saved Objects feature impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@rudolf
Copy link
Contributor

rudolf commented Oct 4, 2021

When saved object types become share-capable #100489 the saved objects repository adds some additional logic to method calls which could introduce a performance regression. It's impossible to guess how much of an impact additional queries to Elasticsearch might add. So in the analysis below I assume that the operation itself has a zero cost (takes 0ms) and the only impact is the extra time it takes for the request / response roundtrip i.e. latency. So if the latency is 200ms an 3x increase in latency would mean that at a minimum, the operation will now take 600ms to complete.

SavedObjectsClient::resolve()

Whereas get() uses a single Elasticsearch API call resolve() performs the following calls serially:

  1. bulk request that performs a scripted update on any legacy url aliases that match the resolved id
  2. mget request to fetch matched saved objects (similar to a SavedObjectsClient::get())
  3. scripted update to record resolve outcome telemetry

impact: higher write load and 3x increase in latency

SavedObjectsClient::update()

Adds an additional get request if updating a multi-namespace saved object type. Also, if upsert==true, adds an additional mget request (if the object exists in <=3 spaces) or search request (if the object exists in >3 spaces) to find aliases.

impact: up to 3x increase in latency

SavedObjectsClient::bulkUpdate()

Adds an additional mget request if multi-namespace saved objects are being updated.

impact: 2x increase in latency

SavedObjectsClient::incrementCounterInternal()

Adds an additional mget request if incrementing a multi-namespace saved object type. Also, if the object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

SavedObjectsClient::create()

Adds an additional mget request if overwrite==true and creating a multi-namespace saved object type. Also, if the object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

SavedObjectsClient::bulkCreate()

Adds an additional mget request if overwrite==true and creating 1 or more multi-namespace saved object types. Also, if any object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

SavedObjectsClient::delete()

Adds an additional get request if deleting a multi-namespace saved object (requires force=true if the object exists in >1 spaces). Also adds an additional updateByQuery request to delete any associated alias.

impact: up to 3x increase in latency

SavedObjectsClient::bulkDelete()

Adds an additional mget request if deleting multi-namespace saved objects (requires force=true if the object exists in >1 spaces). Also perfroms an additional updateByQuery request for each saved object in the bulk to delete any associated alias per multi-namespace saved object type. So e.g. deleting 1000 multi-namespace saved objects will create an mget as part of the preflight check, a bulk request to actually delete the objects and 1000 updateByQuery requests (10 requests in parallel)

impact: up to 3x increase in latency

@rudolf rudolf added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Oct 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@rudolf
Copy link
Contributor Author

rudolf commented Oct 4, 2021

@elastic/kibana-alerting-services just want to make sure you're aware of the potential performance impact as I think your plugins are the most performance sensitive

@mikecote
Copy link
Contributor

Thanks, @rudolf. Is master the correct branch for the alerting team to assess performance impact?

Pinging the @elastic/security-detections-response team as well. They have some logic to import pre-packaged rules, where I believe the function calls create() hundreds of times.

@jportner
Copy link
Contributor

jportner commented Oct 13, 2021

Thanks for opening this, Rudolf!

I have some thoughts on your comments for specific methods, but first, a general note:
Every single SOC method already has the following overhead:

  • AuthZ: an end-to-end call to Elasticsearch to check privileges before each request
  • Audit (disabled by default): N events recorded to disk before returning a response (certainly much lower latency than a network request, but worth noting)
  • Encryption: for supported saved object types, including Alerts Rules and Actions Connectors, objects are potentially encrypted or decrypted

All that to say: the additional ES client calls we make in the Repository for multi-namespace saved objects don't necessarily translate into a 2x or 3x latency. It would be good to try to capture some performance data, though.

SavedObjectsClient::resolve()

Whereas get() uses a single Elasticsearch API call resolve() performs the following calls serially:

  1. bulk request that performs a scripted update on any legacy url aliases that match the resolved id
  2. mget request to fetch matched saved objects (similar to a SavedObjectsClient::get())
  3. scripted update to record resolve outcome telemetry

impact: higher write load and 3x increase in latency

The scripted update to record the resolve outcome uses refresh: false, so it should be a very quick operation.

SavedObjectsClient::update()

Whereas update() on a single namespace saved object type performs a single Elasticsearch update operation, multi namespace updates will perform the following steps serially:

  1. get request for preflightCheckIncludesNamespace
  2. if upsert==true get request for preflightGetNamespaces
  3. update operation

Looking at the code, I think we have an opportunity to combine (1) and (2) 👍 I don't see why these were ever separate in the first place.

SavedObjectsClient::create()

Adds an additional get request if override==true and creating a multi namespace saved object type

impact: 2x increase in latency

SavedObjectsClient::bulkCreate()

Adds an additional mget request if creating 1 or more multi namespace saved objects

impact: 2x increase in latency

It's also worth noting that in #114693, we're going to need to add a check before create/bulkCreate to make sure that there are no conflicting aliases. But we can combine this with the existing preflight check, so the Repostiroy should still only make 2 round trips to ES for this.

@jportner
Copy link
Contributor

Looking at the code, I think we have an opportunity to combine (1) and (2) 👍 I don't see why these were ever separate in the first place.

As "luck" would have it, I discovered a bug in the current update functionality when using upsert 🤦 While fixing it, I combined (1) and (2) (see #114929)

@rudolf
Copy link
Contributor Author

rudolf commented Oct 15, 2021

I've updated the description to clarify what I mean with increase in latency. Latency is the minimum impact we would see and would really only affect code where the hot path consists of saved objects operations happening in series like task manager polling and attempting to claim a task, getting a claim conflict, retrying etc.

@mikecote yes, master or once we create a 8.x branch

@mikecote
Copy link
Contributor

@rudolf thanks, I've opened #115197 to track the investigation on the alerting side of things 🙏

@jportner jportner added the Feature:Security/Sharing Saved Objects Platform Security - Sharing Saved Objects feature label Oct 27, 2021
@jportner
Copy link
Contributor

jportner commented Nov 3, 2021

With the introduction of #116007, the create, bulkCreate, update, and incrementCounter APIs all make additional calls to ES to check for aliases before creating objects.

With the introduction of #117056, the delete API will make an additional updateByQuery call to ES.

I've updated the issue description accordingly.

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021
@lukeelmers
Copy link
Member

We'll go ahead and close this issue as alerting was the main case we wanted to test here, and an investigation was already conducted in #115197

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@rudolf
Copy link
Contributor Author

rudolf commented Sep 20, 2022

Related #112492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Security/Sharing Saved Objects Platform Security - Sharing Saved Objects feature impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

6 participants