-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Too many saved-objects + migrations == 馃挜 #92933
Comments
Pinging @elastic/kibana-core (Team:Core) |
The bottleneck in v1 migrations and soon in v2 migrations (once we implement rewriting _id support #93002) is that Kibana reads all documents from the outdated index and then writes these into the target index in batches. This adds network latency between Kibana and Elasticsearch to every batch. For many millions of documents this could be significant overhead compared to having Elasticsearch do a reindex operation. Note, an elasticsearch reindex operation isn't latency free, my understanding is that the coordinating node still reads batches from shards and then issues bulk index operations to the destination shards, but doing the reindex on Elasticsearch at least removes one hop. We would have to profile this on Cloud to understand how much of an impact this has. v1 migrations used a batch size of 100 which turned out to be way too low, in one incident setting the batch size to 1000 meant the migration completed in a matter of minutes. Here's some ideas which could improve performance:
|
Related #91768 |
How confident are we that increasing the batch size will scale well? Since we're going to need to support this ID rewriting migration for all of 8.x, I wonder if we shouldn't explore the option to use a painless script mentioned in #86247, even if it means we need changes from Elasticsearch for UUIDv5:
This would give us confidence that we can scale the My concern with going back to a Kibana-driven reindex is that we'll be handicapping the existing Saved Objects service and won't be able to support smooth upgrades for potentially valid use cases that may require 1M+ documents. |
Yeah, I think this is a valid concern and worth profiling so that we can be certain about the performance difference between a Kibana vs Elasticsearch reindex for millions of documents. I'll add this as an action point to #86247 so that we evaluate the options as part of our plan for 7.13. One silver lining is that we don't yet have any valid use cases for millions of saved objects and don't anticipate having any for 7.x. So even though we need to continue supporting the _id rewriting migration for all of 8.x we could do an elasticsearch-side reindex if upgrading from >= 8.0. If a valid use case is introduced in 8.x it would therefore not be impacted by the _id rewriting migration. But still, more complex than having one algorithm which solves the problem equally well when migrating from 7.x or 8.x. |
If a client-side reindex doesn't scale well and we decide to use the scripted reindex, migrations for millions of saved objects would be fairly fast if the we're only migrating types which have a low number of documents. But as soon as we migrate a type that has millions of documents we will again have the same performance (actually it'd be somewhat worse). That probably means to support migrations on millions of saved objects we would have to support plugins writing painless migration scripts so that types with huge numbers of documents can efficiently change them even if it's a worse developer experience. |
Good point, that does give us some more flexibility in the future. We have to keep in mind that we'd have to plan for and anticipate that need. Without knowing every new SO type that gets added and how they're used that could be difficult (unless we add new process). We didn't know about fleet-agent-events until it was in production and already a problem.
Do you think we'd be able to make this opt-in, so that only plugins that need this kind of scale have to use painless? |
yeah, I think it makes the most sense to keep both. Painless migrations happen as a scripted reindex and afterwards we do the "normal" migrations in Kibana. We only recommend painless migrations if performance would be unacceptable with "normal" migrations. We would have to get all the painless migration scripts and compose that into a single reindex script that will get applied. Generating/compiling painlesss feels like something that can get complex fast and very hard to debug, so I think we should try to keep the feature set as minimal as possible:
I'm kind of hoping this will never be necessary but this could be one way to build an escape hatch for millions of saved objects. |
Is my understanding correct that saved-object migrations v2 is "fast" as long as we don't have a specific saved-object type with a ton of instances (millions) that defines a normal migration? |
We met today to talk about #86247 scope. An implementation based on |
We'll also need to take into consideration that we haven't explicitly ever stated that Elasticsearch must have scripts enabled for Kibana to work. However, various features in Kibana have become dependent on scripting being enabled. Additionally, we haven't had any reports from users about this causing issues. Therefore, during 7.x we previously decided that Kibana should continue to generally work when scripting is disabled in Elasticsearch, but we're comfortable with some features not working. If saved-object migrations were to stop working entirely in 7.x when scripting is disabled, this would be a deviation from the prior decision. If the need is great enough, we can revisit this decision and introduce a "breaking change" in a 7.x minor. Starting in 8.0, we intend to explicitly document that scripting must be enabled in Elasticsearch and we're comfortable with Kibana failing in a fiery ball of glory when scripting is disabled. |
What's the practical impact of us making this decision? What happens to Fleet users who happen to have a ton of saved-objects and then upgrade to a version of Kibana that has #86247 implemented? Will their migration take hours to complete? |
When we discussed the problem it's been mentioned that Fleet team was going to fix the root problem with a large number of SO. @jen-huang Is it on your roadmap? |
Like we do for v1 migrations, once #86247 is implemented we can also drop However, we still have a risk that a new bug / feature creates millions of saved objects and cause hours of downtime for our users. Because v2 migrations use 1000 document batches the impact will be less than the bug combined with v1 migrations, but the risk remains. Apart from educating developers, I don't think there's really any practical controls we could put in place to catch such bugs/features and reduce this risk. I don't think we have a good framework for prioritizing such risks. This has a high impact on users, but the likelihood is probably low and the cost to fix it is fairly high and timelines-wise risky to achieve before 8.0. As a first step we could do worst-case benchmarking once #86247 is implemented to get a sense for how many saved objects are likely to cause long downtime of e.g. > 60 minutes. And then we can start discussing and coordinating this with the Elasticsearch team so that we're unblocked for 8.0. |
We will stop using |
One of the plugins causing problems with millions of documents is task manager. Task manager writes "task documents" to the In this particular case, there's no real need to even copy these "failed" task documents during a migration. The old documents are still around in the old migrated-from index, if we need them. So, you could imagine using a filter of some kind when doing a migration of task manager documents which would not return the failed tasks, presumably greatly cutting down the number of documents that need an actual migration. It's a pretty specific use case, and due to some changes we're making with the way alerts and actions run, we likely will have very few of these failed task documents in 7.13 or 7.14, so even for task manager purposes, we may not need this capability for long. But thought I'd mention it ... |
We can track the number of SO per type in our telemetry to detect a suspiciously large number of SO proactively, without waiting for the problem to be reported.
Does it mean plugins will write migrations against ES documents, not SO object, which defeats the purpose of SO abstraction?
@pmuellr thank you for the heads up. I'm wondering how the migration system would know whether it's safe to remove SO or not. Such logic should be implemented by the plugin owning a domain. Maybe we should extend kibana/src/core/server/saved_objects/migrations/types.ts Lines 45 to 48 in 7aec820
something like: export type SavedObjectMigrationFn<InputAttributes = unknown, MigratedAttributes = unknown> = (
doc: SavedObjectUnsanitizedDoc<InputAttributes>,
context: SavedObjectMigrationContext
// remove SO if the function returns null
) => SavedObjectUnsanitizedDoc<MigratedAttributes> | null; |
Implemented in #98916 |
Yah, the Task Manager migrator could certainly determine whether or not to drop an SO (not migrate it into the new index) - there's a property we can check that tells us. The other SO type that's problematic is action task parameter objects, which are associated with task manager documents, but the determination isn't quite as easy. I believe we've been advising to check the creation date on those, and if they are greater than a certain age, they can be deleted (can't remember if we were thinking 24hr, or perhaps 1 week). Same thing though, the actions migration could make that determination during the migration. That correct, @mikecote ? |
Both of the above are correct.
We do some periodical cleanup, first by looking up failed tasks and then deleting linked I don't think cleaning up SOs would solve this GH issue, especially if we're exploring saved objects as data which will enable us to create 100,000 - 1,000,000 and up SOs. |
Why then do we have cases where the migration fails with the timeout error due to a large number of
AFAIK SO migrations v2 has been designed to support migration up to 300,000 SO. If we need to support the migration of 1,000,000 SOs, we will have to revise the architecture, which will take some time. So it does not change the fact that we can clean up obsolete SOs to speed up the migration already now. |
To provide a short-term fix to the surge in incident reports we've been having, I added in 7.13 (#96971) a cleanup utility that goes through old
Correct and if we need to revise the architecture, we can do it with the saved objects as data story once we have requirements. |
Linking to #104083 for the |
#106534 added the ability to filter out SOs before the migration is started. That should reduce the number of SO to be migrated and speed the process up. |
@rudolf @gsoldevila given all the perf improvements we did to the v2 mig algorithm, so you think we're fine closing this? |
There are still a few things we can do to improve it further (e.g. #160038), |
++ We'll continue to review telemetry data to ensure our scalability stays ahead of our biggest customers, but at this point in time scalability is no longer a problem. |
Currently, when there are a lot of saved-objects it can take an extremely long-time to migrate them during version upgrades. This is antithetical to the future goal of saved-object migrations causing a small downtime window. Additionally, this was recently such a large issue for Fleet's agent events, that we resorted to dropping these saved-objects during migrations for a quick-fix: #92188
Saved-objects are currently the only Platform provided "building block" for data that should abide by Kibana's entity and RBAC model. As such, having to tell developers that they need to hand-roll a replacement for saved-objects when there's too much data is going to be a problem long-term.
The text was updated successfully, but these errors were encountered: