Eventually consistent saved object/data index migrations #96626

rudolf · 2021-04-08T18:11:57Z

When designing v2 saved object migrations one of the implicit design tradeoffs that we made was that we chose strong read consistency at the cost of a longer downtime window.

The strong read consistency means plugins will always get all matching results to a search and every read will only get documents returned in the latest format. However, this means Kibana is down until all saved objects have been migrated. With our target downtime window of < 10 minutes, this places an upper limit to how many saved objects Kibana can store (best guess is about 300k saved objects).

However, some plugins might want to use Kibana's authorization model (rbac, spaces) and other saved objects features while still creating hundreds of thousands or millions of documents. We could theoretically support data streams or ILM managed indices which can store millions of documents depending on the user's ILM policy. To support migrations for these indices would require a completely different migration algorithm.

These saved object types could opt-in to eventually consistent migrations. In this mode, Kibana will start the migrations of these indices but won't block on startup. Any searches might receive incomplete results as the documents are being transformed and the mappings updated. Plugins would have to be designed with this in mind and might have to display a message to users like "Migrations are currently in progress, all results might not be available yet".

elasticmachine · 2021-04-08T18:11:59Z

Pinging @elastic/kibana-core (Team:Core)

afharo · 2021-04-09T15:50:43Z

How would these long-running-for-large-amount-of-data tasks play along with Elasticsearch's all-system-indices-in-a-common-thread approach?

pgayvallet · 2021-04-12T06:07:52Z

These saved object types could opt-in to eventually consistent migrations

Is this a naming commonly used for 'non-blocking' migrations? I find the term 'eventually consistent' quite misleading tbh , and would rather go with 'blocking' vs 'non-blocking' migration?

Plugins would have to be designed with this in mind and might have to display a message to users like "Migrations are currently in progress, all results might not be available yet".

I see a couple things here:

How do we provide the migration status to the plugin? I guess we'll need a new (obs-based probably) API to let type owners be informed of the status of the non-blocking type migration?
How do we handle failures in such migration? Do we just 'propagate' then to the owner via this new API, or do we want to mimic the blocking-migration behavior and terminate Kibana on 'async migrations' failures?

rudolf · 2021-04-12T11:24:53Z

How would these long-running-for-large-amount-of-data tasks play along with Elasticsearch's all-system-indices-in-a-common-thread approach?

We've been discussing removing .kibana from the system indices thread pool since the load is more like a data index than other indices (like .tasks or .security which have a very narrow focus/purpose). But if we would only run these migrations on huge indices (millions of docs) and in that case the index would definitely have to be in the "normal" thread pool.

Is this a naming commonly used for 'non-blocking' migrations? I find the term 'eventually consistent' quite misleading tbh , and would rather go with 'blocking' vs 'non-blocking' migration?

"eventually consistent" is a common database term for "when you read you might not see all the latest writes, but if you wait long enough they will show up". So yes, it's non-blocking, we won't block kibana from starting up and we won't block plugins from searching/writing, but it's really important that they design their business logic around this.

How do we provide the migration status to the plugin? I guess we'll need a new (obs-based probably) API to let type owners be informed of the status of the non-blocking type migration?

I think we can use the status API which will also mean there's a public HTTP API for checking progress

How do we handle failures in such migration? Do we just 'propagate' then to the owner via this new API, or do we want to mimic the blocking-migration behavior and terminate Kibana on 'async migrations' failures?

Yeah this is tricky... If the non-blocking migration fails for instance after Kibana being up for 3 hours, a lot of writes will have been accepted, so it's no longer possible to rollback without losing data. We can't let users just be stuck without a way out. So I think these migrations will have to be more lenient and just log an error and continue. It could potentially be disastrous like if all your data suddenly becomes unusable because they all failed to migrate, but the plugin should be designed with eventual consistency in mind, so if eventually we fix the bug and the data comes back it should all be OK. Plugins would have to do a much better job of validating writes so that it's very unlikely that we get these kinds of migration bugs.

pgayvallet · 2021-04-13T07:46:19Z

I think we can use the status API which will also mean there's a public HTTP API for checking progress

I see a few limitations doing that:

The status model wasn't really designed for such complex status structure (see ServiceStatus type). Also, having the SO service degraded or unavailable until the non-blocking migrations are completes could have consequences on plugins non relying on that new feature
If we are to update the migration status in real time (e.g adding current errors to the status, or the count of currently processed objects), we will flood the already complex status observables tree
we are throttling the status API output observable, which may have some impact on consumers of the async migration status

So I think these migrations will have to be more lenient and just log an error and continue.

I agree that this seems the only realistic option. Do you think implementing the SO quarantine would then be necessary for this feature? Having a quarantine zone would allow users to eventually fix the failing objects, that would be re run against the migration at next startup?

kobelb · 2021-08-19T16:45:54Z

Yeah this is tricky... If the non-blocking migration fails for instance after Kibana being up for 3 hours, a lot of writes will have been accepted, so it's no longer possible to rollback without losing data. We can't let users just be stuck without a way out. So I think these migrations will have to be more lenient and just log an error and continue. It could potentially be disastrous like if all your data suddenly becomes unusable because they all failed to migrate, but the plugin should be designed with eventual consistency in mind, so if eventually we fix the bug and the data comes back it should all be OK. Plugins would have to do a much better job of validating writes so that it's very unlikely that we get these kinds of migration bugs.

In my opinion, this is the biggest drawback to this approach. In the situation where a migration does have issues, rolling back isn't really an option. We'd have to tell our users that features just won't work until a newer patch version of Kibana is released that addresses the migration issue.

pgayvallet · 2024-07-05T07:43:04Z

Outdated issue, eventually consistent migration is already implemented for serverless (ZDT), and if we ever want to implement it for traditional Kibana, the plan is to find the best way to port ZDT.

I'll go ahead and close this

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects labels Apr 8, 2021

rudolf mentioned this issue Apr 8, 2021

[Platform][Alerting] Reducing the cost of change in data migrations #96291

Open

rudolf changed the title ~~Eventually consistent saved object migrations~~ Eventually consistent saved object/data index migrations Apr 12, 2021

pgayvallet added the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Jun 3, 2021

This was referenced Jun 3, 2021

Fail v2 Migrations when unknown document types are found #101052

Closed

[meta] saved objects improvements #101564

Open

mshustov mentioned this issue Jun 9, 2021

RFC: Kibana preboot lifecycle stage. #99318

Merged

mshustov mentioned this issue Jun 30, 2021

[Discuss] Too many saved-objects + migrations == 💥 #92933

Closed

pgayvallet mentioned this issue Jul 1, 2021

[SO Migration] meta list of the possible improvements by area #104083

Open

mshustov mentioned this issue Aug 18, 2021

Allow Fleet to complete package upgrade before Kibana server is ready #108993

Closed

rudolf added Feature:Migrations and removed Feature:Saved Objects labels Dec 8, 2022

pgayvallet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventually consistent saved object/data index migrations #96626

Eventually consistent saved object/data index migrations #96626

rudolf commented Apr 8, 2021

elasticmachine commented Apr 8, 2021

afharo commented Apr 9, 2021

pgayvallet commented Apr 12, 2021

rudolf commented Apr 12, 2021

pgayvallet commented Apr 13, 2021

kobelb commented Aug 19, 2021

pgayvallet commented Jul 5, 2024

Eventually consistent saved object/data index migrations #96626

Eventually consistent saved object/data index migrations #96626

Comments

rudolf commented Apr 8, 2021

elasticmachine commented Apr 8, 2021

afharo commented Apr 9, 2021

pgayvallet commented Apr 12, 2021

rudolf commented Apr 12, 2021

pgayvallet commented Apr 13, 2021

kobelb commented Aug 19, 2021

pgayvallet commented Jul 5, 2024