New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Investigate risk of performance regression from share-capable saved object types #113743

Closed

rudolf opened this issue Oct 4, 2021 · 11 comments

Labels

Feature:Security/Sharing Saved Objects impact:needs-assessment loe:small Team:Core Team:ResponseOps

Contributor

rudolf commented Oct 4, 2021 •

edited

Loading

When saved object types become share-capable #100489 the saved objects repository adds some additional logic to method calls which could introduce a performance regression. It's impossible to guess how much of an impact additional queries to Elasticsearch might add. So in the analysis below I assume that the operation itself has a zero cost (takes 0ms) and the only impact is the extra time it takes for the request / response roundtrip i.e. latency. So if the latency is 200ms an 3x increase in latency would mean that at a minimum, the operation will now take 600ms to complete.

`SavedObjectsClient::resolve()`

Whereas get() uses a single Elasticsearch API call resolve() performs the following calls serially:

bulk request that performs a scripted update on any legacy url aliases that match the resolved id
mget request to fetch matched saved objects (similar to a SavedObjectsClient::get())
scripted update to record resolve outcome telemetry

impact: higher write load and 3x increase in latency

`SavedObjectsClient::update()`

Adds an additional get request if updating a multi-namespace saved object type. Also, if upsert==true, adds an additional mget request (if the object exists in <=3 spaces) or search request (if the object exists in >3 spaces) to find aliases.

impact: up to 3x increase in latency

`SavedObjectsClient::bulkUpdate()`

Adds an additional mget request if multi-namespace saved objects are being updated.

impact: 2x increase in latency

`SavedObjectsClient::incrementCounterInternal()`

Adds an additional mget request if incrementing a multi-namespace saved object type. Also, if the object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

`SavedObjectsClient::create()`

Adds an additional mget request if overwrite==true and creating a multi-namespace saved object type. Also, if the object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

`SavedObjectsClient::bulkCreate()`

Adds an additional mget request if overwrite==true and creating 1 or more multi-namespace saved object types. Also, if any object exists in >3 spaces, adds a search request to find aliases.

impact: up to 3x increase in latency

`SavedObjectsClient::delete()`

Adds an additional get request if deleting a multi-namespace saved object (requires force=true if the object exists in >1 spaces). Also adds an additional updateByQuery request to delete any associated alias.

impact: up to 3x increase in latency

`SavedObjectsClient::bulkDelete()`

Adds an additional mget request if deleting multi-namespace saved objects (requires force=true if the object exists in >1 spaces). Also perfroms an additional updateByQuery request for each saved object in the bulk to delete any associated alias per multi-namespace saved object type. So e.g. deleting 1000 multi-namespace saved objects will create an mget as part of the preflight check, a bulk request to actually delete the objects and 1000 updateByQuery requests (10 requests in parallel)

impact: up to 3x increase in latency

The text was updated successfully, but these errors were encountered:

rudolf added the Team:Core label

Contributor

elasticmachine commented Oct 4, 2021

Pinging @elastic/kibana-core (Team:Core)

rudolf added the Team:ResponseOps label

Contributor

elasticmachine commented Oct 4, 2021

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Contributor Author

rudolf commented Oct 4, 2021 •

edited

Loading

@elastic/kibana-alerting-services just want to make sure you're aware of the potential performance impact as I think your plugins are the most performance sensitive

Contributor

mikecote commented Oct 12, 2021

Thanks, @rudolf. Is master the correct branch for the alerting team to assess performance impact?

Pinging the @elastic/security-detections-response team as well. They have some logic to import pre-packaged rules, where I believe the function calls create() hundreds of times.

Contributor

jportner commented Oct 13, 2021 •

edited

Loading

Thanks for opening this, Rudolf!

I have some thoughts on your comments for specific methods, but first, a general note:
Every single SOC method already has the following overhead:

AuthZ: an end-to-end call to Elasticsearch to check privileges before each request
Audit (disabled by default): N events recorded to disk before returning a response (certainly much lower latency than a network request, but worth noting)
Encryption: for supported saved object types, including ~~Alerts~~ Rules and ~~Actions~~ Connectors, objects are potentially encrypted or decrypted

All that to say: the additional ES client calls we make in the Repository for multi-namespace saved objects don't necessarily translate into a 2x or 3x latency. It would be good to try to capture some performance data, though.

SavedObjectsClient::resolve()

Whereas get() uses a single Elasticsearch API call resolve() performs the following calls serially:

bulk request that performs a scripted update on any legacy url aliases that match the resolved id

mget request to fetch matched saved objects (similar to a SavedObjectsClient::get())

scripted update to record resolve outcome telemetry

impact: higher write load and 3x increase in latency

The scripted update to record the resolve outcome uses refresh: false, so it should be a very quick operation.

SavedObjectsClient::update()

Whereas update() on a single namespace saved object type performs a single Elasticsearch update operation, multi namespace updates will perform the following steps serially:

get request for preflightCheckIncludesNamespace

if upsert==true get request for preflightGetNamespaces

update operation

Looking at the code, I think we have an opportunity to combine (1) and (2) 👍 I don't see why these were ever separate in the first place.

SavedObjectsClient::create()

Adds an additional get request if override==true and creating a multi namespace saved object type

impact: 2x increase in latency

SavedObjectsClient::bulkCreate()

Adds an additional mget request if creating 1 or more multi namespace saved objects

impact: 2x increase in latency

It's also worth noting that in #114693, we're going to need to add a check before create/bulkCreate to make sure that there are no conflicting aliases. But we can combine this with the existing preflight check, so the Repostiroy should still only make 2 round trips to ES for this.

Contributor

jportner commented Oct 13, 2021

Looking at the code, I think we have an opportunity to combine (1) and (2) 👍 I don't see why these were ever separate in the first place.

As "luck" would have it, I discovered a bug in the current update functionality when using upsert 🤦 While fixing it, I combined (1) and (2) (see #114929)

Contributor Author

rudolf commented Oct 15, 2021 •

edited

Loading

I've updated the description to clarify what I mean with increase in latency. Latency is the minimum impact we would see and would really only affect code where the hot path consists of saved objects operations happening in series like task manager polling and attempting to claim a task, getting a claim conflict, retrying etc.

@mikecote yes, master or once we create a 8.x branch

mikecote mentioned this issue

[Alerting] Investigate risk of performance regressions from share-capable saved object types #115197

Closed

Contributor

mikecote commented Oct 15, 2021

@rudolf thanks, I've opened #115197 to track the investigation on the alerting side of things 🙏

This was referenced Oct 25, 2021

Prevent legacy url conflicts #116007

Merged

[Meta] Breaking saved object type conversions in 8.0 #100489

Closed

jportner added the Feature:Security/Sharing Saved Objects label

Contributor

jportner commented Nov 3, 2021 •

edited

Loading

With the introduction of #116007, the create, bulkCreate, update, and incrementCounter APIs all make additional calls to ES to check for aliases before creating objects.

With the introduction of #117056, the delete API will make an additional updateByQuery call to ES.

I've updated the issue description accordingly.

exalate-issue-sync bot added impact:needs-assessment loe:small labels

Member

lukeelmers commented Jan 18, 2022

We'll go ahead and close this issue as alerting was the main case we wanted to test here, and an investigation was already conducted in #115197

lukeelmers closed this as completed

kobelb added the needs-team label

botelastic bot removed the needs-team label

jportner mentioned this issue

Refactor Saved Object Client wrappers #133835

Closed

Contributor Author

rudolf commented Sep 20, 2022

Related #112492

rudolf mentioned this issue

Reduce the impact of legacy-url-aliases introduced by sharing to multiple spaces #144358

Open

rudolf mentioned this issue

Data views usage collection causes large amount of Elasticsearch queries #151064

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature:Security/Sharing Saved Objects impact:needs-assessment loe:small Team:Core Team:ResponseOps