Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Too many saved-objects + migrations == 馃挜 #92933

Closed
kobelb opened this issue Feb 25, 2021 · 27 comments
Closed

[Discuss] Too many saved-objects + migrations == 馃挜 #92933

kobelb opened this issue Feb 25, 2021 · 27 comments
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@kobelb
Copy link
Contributor

kobelb commented Feb 25, 2021

Currently, when there are a lot of saved-objects it can take an extremely long-time to migrate them during version upgrades. This is antithetical to the future goal of saved-object migrations causing a small downtime window. Additionally, this was recently such a large issue for Fleet's agent events, that we resorted to dropping these saved-objects during migrations for a quick-fix: #92188

Saved-objects are currently the only Platform provided "building block" for data that should abide by Kibana's entity and RBAC model. As such, having to tell developers that they need to hand-roll a replacement for saved-objects when there's too much data is going to be a problem long-term.

@kobelb kobelb added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Feb 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf
Copy link
Contributor

rudolf commented Mar 1, 2021

The bottleneck in v1 migrations and soon in v2 migrations (once we implement rewriting _id support #93002) is that Kibana reads all documents from the outdated index and then writes these into the target index in batches.

This adds network latency between Kibana and Elasticsearch to every batch. For many millions of documents this could be significant overhead compared to having Elasticsearch do a reindex operation. Note, an elasticsearch reindex operation isn't latency free, my understanding is that the coordinating node still reads batches from shards and then issues bulk index operations to the destination shards, but doing the reindex on Elasticsearch at least removes one hop. We would have to profile this on Cloud to understand how much of an impact this has.

v1 migrations used a batch size of 100 which turned out to be way too low, in one incident setting the batch size to 1000 meant the migration completed in a matter of minutes.

Here's some ideas which could improve performance:

  1. Increase the batch size (e.g. from the v2 default of 1k to 10k). This will increase Elasticsearch and Kibana heap consumption, but maybe even 10k batches are small enough?
  2. Isolate saved objects more so that a migration of one type doesn't have to affect all documents of all other types (will require index per saved object type [WIP] Scaling Saved Objects fields count聽#70471)
  3. If we're migrating from a version where the space information has already been removed from the _id we could go back to letting Elasticsearch perform the reindex operation if this improves performance.

@kobelb
Copy link
Contributor Author

kobelb commented Mar 2, 2021

Related #91768

@joshdover
Copy link
Contributor

How confident are we that increasing the batch size will scale well?

Since we're going to need to support this ID rewriting migration for all of 8.x, I wonder if we shouldn't explore the option to use a painless script mentioned in #86247, even if it means we need changes from Elasticsearch for UUIDv5:

We would either have to be able to regenerate the _id in the script by potentially adding a flag to all documents that require regeneration (painless supports uuid, but we would probably have to ask for uuidv5 to be added).

This would give us confidence that we can scale the .kibana index to millions of objects without any larger changes to the service (for example, splitting into one index per-type, option (3) above). We could delay this and do it later only if it becomes a problem, but I'd prefer we get this complex migration right the first time while we have as much context as possible about how it needs to work and everything that should be considered.

My concern with going back to a Kibana-driven reindex is that we'll be handicapping the existing Saved Objects service and won't be able to support smooth upgrades for potentially valid use cases that may require 1M+ documents.

@rudolf
Copy link
Contributor

rudolf commented Mar 3, 2021

My concern with going back to a Kibana-driven reindex is that we'll be handicapping the existing Saved Objects service and won't be able to support smooth upgrades for potentially valid use cases that may require 1M+ documents.

Yeah, I think this is a valid concern and worth profiling so that we can be certain about the performance difference between a Kibana vs Elasticsearch reindex for millions of documents.

I'll add this as an action point to #86247 so that we evaluate the options as part of our plan for 7.13.

One silver lining is that we don't yet have any valid use cases for millions of saved objects and don't anticipate having any for 7.x. So even though we need to continue supporting the _id rewriting migration for all of 8.x we could do an elasticsearch-side reindex if upgrading from >= 8.0. If a valid use case is introduced in 8.x it would therefore not be impacted by the _id rewriting migration. But still, more complex than having one algorithm which solves the problem equally well when migrating from 7.x or 8.x.

@rudolf
Copy link
Contributor

rudolf commented Mar 3, 2021

If a client-side reindex doesn't scale well and we decide to use the scripted reindex, migrations for millions of saved objects would be fairly fast if the we're only migrating types which have a low number of documents. But as soon as we migrate a type that has millions of documents we will again have the same performance (actually it'd be somewhat worse). That probably means to support migrations on millions of saved objects we would have to support plugins writing painless migration scripts so that types with huge numbers of documents can efficiently change them even if it's a worse developer experience.

@joshdover
Copy link
Contributor

So even though we need to continue supporting the _id rewriting migration for all of 8.x we could do an elasticsearch-side reindex if upgrading from >= 8.0.

Good point, that does give us some more flexibility in the future. We have to keep in mind that we'd have to plan for and anticipate that need. Without knowing every new SO type that gets added and how they're used that could be difficult (unless we add new process). We didn't know about fleet-agent-events until it was in production and already a problem.

That probably means to support migrations on millions of saved objects we would have to support plugins writing painless migration scripts so that types with huge numbers of documents can efficiently change them even if it's a worse developer experience.

Do you think we'd be able to make this opt-in, so that only plugins that need this kind of scale have to use painless?

@rudolf
Copy link
Contributor

rudolf commented Mar 3, 2021

Do you think we'd be able to make this opt-in, so that only plugins that need this kind of scale have to use painless?

yeah, I think it makes the most sense to keep both. Painless migrations happen as a scripted reindex and afterwards we do the "normal" migrations in Kibana. We only recommend painless migrations if performance would be unacceptable with "normal" migrations.

We would have to get all the painless migration scripts and compose that into a single reindex script that will get applied. Generating/compiling painlesss feels like something that can get complex fast and very hard to debug, so I think we should try to keep the feature set as minimal as possible:

  • the painless script will be applied to all documents of that type not just outdated documents, so plugins need to add their own if (doc.migrationVersion > 7.12) {...}
  • the script will also have to bump the migrationVersion manually
  • you will have to use painless lab to develop/debug your migration

I'm kind of hoping this will never be necessary but this could be one way to build an escape hatch for millions of saved objects.

@kobelb
Copy link
Contributor Author

kobelb commented Mar 3, 2021

Is my understanding correct that saved-object migrations v2 is "fast" as long as we don't have a specific saved-object type with a ton of instances (millions) that defines a normal migration?

@mshustov
Copy link
Contributor

mshustov commented Mar 5, 2021

I'll add this as an action point to #86247 so that we evaluate the options as part of our plan for 7.13.

We met today to talk about #86247 scope. An implementation based on painless script has several blockers on the Elasticsearch side (we need to make sure that painless provides support for uuid5) and the Kibana side (Spaces plugin transforming id). We decided it's not practical and risky to invest too much time in this implementation right now.
In #86247 we will use Kibana-side migration. After v8.0, providing that we have requirements to scale for millions of SO, we can reconsider this decision and experiment with the alternative implementations.

@kobelb
Copy link
Contributor Author

kobelb commented Mar 5, 2021

We'll also need to take into consideration that we haven't explicitly ever stated that Elasticsearch must have scripts enabled for Kibana to work. However, various features in Kibana have become dependent on scripting being enabled. Additionally, we haven't had any reports from users about this causing issues.

Therefore, during 7.x we previously decided that Kibana should continue to generally work when scripting is disabled in Elasticsearch, but we're comfortable with some features not working. If saved-object migrations were to stop working entirely in 7.x when scripting is disabled, this would be a deviation from the prior decision. If the need is great enough, we can revisit this decision and introduce a "breaking change" in a 7.x minor.

Starting in 8.0, we intend to explicitly document that scripting must be enabled in Elasticsearch and we're comfortable with Kibana failing in a fiery ball of glory when scripting is disabled.

@kobelb
Copy link
Contributor Author

kobelb commented Mar 5, 2021

We met today to talk about #86247 scope. An implementation based on painless script has several blockers on the Elasticsearch side (we need to make sure that painless provides support for uuid5) and the Kibana side (Spaces plugin transforming id). We decided it's not practical and risky to invest too much time in this implementation right now.
In #86247 we will use Kibana-side migration. After v8.0, providing that we have requirements to scale for millions of SO, we can reconsider this decision and experiment with the alternative implementations.

What's the practical impact of us making this decision? What happens to Fleet users who happen to have a ton of saved-objects and then upgrade to a version of Kibana that has #86247 implemented? Will their migration take hours to complete?

@mshustov
Copy link
Contributor

mshustov commented Mar 5, 2021

What's the practical impact of us making this decision? What happens to Fleet users who happen to have a ton of saved-objects and then upgrade to a version of Kibana that has #86247 implemented? Will their migration take hours to complete?

When we discussed the problem it's been mentioned that Fleet team was going to fix the root problem with a large number of SO. @jen-huang Is it on your roadmap?

@rudolf
Copy link
Contributor

rudolf commented Mar 9, 2021

What's the practical impact of us making this decision?

Like we do for v1 migrations, once #86247 is implemented we can also drop fleet-agent-events during the migration. So I think this particular bug won't affect anyone.

However, we still have a risk that a new bug / feature creates millions of saved objects and cause hours of downtime for our users. Because v2 migrations use 1000 document batches the impact will be less than the bug combined with v1 migrations, but the risk remains. Apart from educating developers, I don't think there's really any practical controls we could put in place to catch such bugs/features and reduce this risk.

I don't think we have a good framework for prioritizing such risks. This has a high impact on users, but the likelihood is probably low and the cost to fix it is fairly high and timelines-wise risky to achieve before 8.0.

As a first step we could do worst-case benchmarking once #86247 is implemented to get a sense for how many saved objects are likely to cause long downtime of e.g. > 60 minutes. And then we can start discussing and coordinating this with the Elasticsearch team so that we're unblocked for 8.0.

@jen-huang
Copy link
Contributor

When we discussed the problem it's been mentioned that Fleet team was going to fix the root problem with a large number of SO. @jen-huang Is it on your roadmap?

We will stop using fleet-agent-events in 7.13. I guess any existing docs will still be around though, which is a pain, but as long as we don't run migrations on them and don't continue writing to that saved object type, would that be ok?

@pmuellr
Copy link
Member

pmuellr commented Apr 5, 2021

One of the plugins causing problems with millions of documents is task manager. Task manager writes "task documents" to the .kibana_task_manager alias (a saved object store) when a run-once task is queued to run. If it "runs successfully", we delete the document, but if it doesn't, we don't delete the document. The thought around not deleting the document is that we can look at it during problem diagnosis.

In this particular case, there's no real need to even copy these "failed" task documents during a migration. The old documents are still around in the old migrated-from index, if we need them.

So, you could imagine using a filter of some kind when doing a migration of task manager documents which would not return the failed tasks, presumably greatly cutting down the number of documents that need an actual migration.

It's a pretty specific use case, and due to some changes we're making with the way alerts and actions run, we likely will have very few of these failed task documents in 7.13 or 7.14, so even for task manager purposes, we may not need this capability for long.

But thought I'd mention it ...

@mshustov
Copy link
Contributor

Apart from educating developers, I don't think there's really any practical controls we could put in place to catch such bugs/features and reduce this risk.

We can track the number of SO per type in our telemetry to detect a suspiciously large number of SO proactively, without waiting for the problem to be reported.
We can track migration duration per type as well. Any other ideas?

That probably means to support migrations on millions of saved objects we would have to support plugins writing painless migration scripts so that types with huge numbers of documents can efficiently change them even if it's a worse developer experience.

Does it mean plugins will write migrations against ES documents, not SO object, which defeats the purpose of SO abstraction?

So, you could imagine using a filter of some kind when doing a migration of task manager documents which would not return the failed tasks, presumably greatly cutting down the number of documents that need an actual migration.

@pmuellr thank you for the heads up. I'm wondering how the migration system would know whether it's safe to remove SO or not. Such logic should be implemented by the plugin owning a domain. Maybe we should extend SavedObjectMigrationFn to allow plugins to drop SO during migration?

export type SavedObjectMigrationFn<InputAttributes = unknown, MigratedAttributes = unknown> = (
doc: SavedObjectUnsanitizedDoc<InputAttributes>,
context: SavedObjectMigrationContext
) => SavedObjectUnsanitizedDoc<MigratedAttributes>;

something like:

export type SavedObjectMigrationFn<InputAttributes = unknown, MigratedAttributes = unknown> = (
  doc: SavedObjectUnsanitizedDoc<InputAttributes>,
  context: SavedObjectMigrationContext
  // remove SO if the function returns null
) => SavedObjectUnsanitizedDoc<MigratedAttributes> | null;

@mshustov mshustov mentioned this issue May 11, 2021
2 tasks
@mshustov
Copy link
Contributor

mshustov commented May 14, 2021

We can track the number of SO per type in our telemetry to detect a suspiciously large number of SO proactively, without waiting for the problem to be reported.

Implemented in #98916

@pmuellr
Copy link
Member

pmuellr commented May 14, 2021

I'm wondering how the migration system would know whether it's safe to remove SO or not. Such logic should be implemented by the plugin owning a domain. Maybe we should extend SavedObjectMigrationFn to allow plugins to drop SO during migration?

Yah, the Task Manager migrator could certainly determine whether or not to drop an SO (not migrate it into the new index) - there's a property we can check that tells us.

The other SO type that's problematic is action task parameter objects, which are associated with task manager documents, but the determination isn't quite as easy. I believe we've been advising to check the creation date on those, and if they are greater than a certain age, they can be deleted (can't remember if we were thinking 24hr, or perhaps 1 week). Same thing though, the actions migration could make that determination during the migration.

That correct, @mikecote ?

@mikecote
Copy link
Contributor

That correct, @mikecote ?

Both of the above are correct.

The other SO type that's problematic is action task parameter objects, which are associated with task manager documents, but the determination isn't quite as easy. I believe we've been advising to check the creation date on those, and if they are greater than a certain age, they can be deleted (can't remember if we were thinking 24hr, or perhaps 1 week). Same thing though, the actions migration could make that determination during the migration.

We do some periodical cleanup, first by looking up failed tasks and then deleting linked action_task_params SOs at the same time. Eventually, this saved object type will go away once the task manager supports user context (API keys) itself.

I don't think cleaning up SOs would solve this GH issue, especially if we're exploring saved objects as data which will enable us to create 100,000 - 1,000,000 and up SOs.

@mshustov
Copy link
Contributor

We do some periodical cleanup, first by looking up failed tasks and then deleting linked action_task_params SOs at the same time.

Why then do we have cases where the migration fails with the timeout error due to a large number of task documents? We have several incidents reported every month.

I don't think cleaning up SOs would solve this GH issue, especially if we're exploring

AFAIK SO migrations v2 has been designed to support migration up to 300,000 SO. If we need to support the migration of 1,000,000 SOs, we will have to revise the architecture, which will take some time. So it does not change the fact that we can clean up obsolete SOs to speed up the migration already now.

@mikecote
Copy link
Contributor

Why then do we have cases where the migration fails with the timeout error due to a large number of task documents? We have several incidents reported every month.

To provide a short-term fix to the surge in incident reports we've been having, I added in 7.13 (#96971) a cleanup utility that goes through old task and action_task_params documents and deletes them.

AFAIK SO migrations v2 has been designed to support migration up to 300,000 SO. If we need to support the migration of 1,000,000 SOs, we will have to revise the architecture, which will take some time. So it does not change the fact that we can clean up obsolete SOs to speed up the migration already now.

Correct and if we need to revise the architecture, we can do it with the saved objects as data story once we have requirements.

@pgayvallet
Copy link
Contributor

Linking to #104083 for the Improving the migration performance section

@mshustov
Copy link
Contributor

#106534 added the ability to filter out SOs before the migration is started. That should reduce the number of SO to be migrated and speed the process up.

@pgayvallet
Copy link
Contributor

@rudolf @gsoldevila given all the perf improvements we did to the v2 mig algorithm, so you think we're fine closing this?

@gsoldevila
Copy link
Contributor

gsoldevila commented Oct 31, 2023

There are still a few things we can do to improve it further (e.g. #160038),
but personally I'm fine with closing this one.

@rudolf
Copy link
Contributor

rudolf commented Oct 31, 2023

++ We'll continue to review telemetry data to ensure our scalability stays ahead of our biggest customers, but at this point in time scalability is no longer a problem.

@rudolf rudolf closed this as completed Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

10 participants