Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventually consistent saved object/data index migrations #96626

Closed
rudolf opened this issue Apr 8, 2021 · 7 comments
Closed

Eventually consistent saved object/data index migrations #96626

rudolf opened this issue Apr 8, 2021 · 7 comments
Labels
Feature:Migrations project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Apr 8, 2021

When designing v2 saved object migrations one of the implicit design tradeoffs that we made was that we chose strong read consistency at the cost of a longer downtime window.

The strong read consistency means plugins will always get all matching results to a search and every read will only get documents returned in the latest format. However, this means Kibana is down until all saved objects have been migrated. With our target downtime window of < 10 minutes, this places an upper limit to how many saved objects Kibana can store (best guess is about 300k saved objects).

However, some plugins might want to use Kibana's authorization model (rbac, spaces) and other saved objects features while still creating hundreds of thousands or millions of documents. We could theoretically support data streams or ILM managed indices which can store millions of documents depending on the user's ILM policy. To support migrations for these indices would require a completely different migration algorithm.

These saved object types could opt-in to eventually consistent migrations. In this mode, Kibana will start the migrations of these indices but won't block on startup. Any searches might receive incomplete results as the documents are being transformed and the mappings updated. Plugins would have to be designed with this in mind and might have to display a message to users like "Migrations are currently in progress, all results might not be available yet".

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects labels Apr 8, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@afharo
Copy link
Member

afharo commented Apr 9, 2021

How would these long-running-for-large-amount-of-data tasks play along with Elasticsearch's all-system-indices-in-a-common-thread approach?

@pgayvallet
Copy link
Contributor

These saved object types could opt-in to eventually consistent migrations

Is this a naming commonly used for 'non-blocking' migrations? I find the term 'eventually consistent' quite misleading tbh , and would rather go with 'blocking' vs 'non-blocking' migration?

Plugins would have to be designed with this in mind and might have to display a message to users like "Migrations are currently in progress, all results might not be available yet".

I see a couple things here:

  • How do we provide the migration status to the plugin? I guess we'll need a new (obs-based probably) API to let type owners be informed of the status of the non-blocking type migration?

  • How do we handle failures in such migration? Do we just 'propagate' then to the owner via this new API, or do we want to mimic the blocking-migration behavior and terminate Kibana on 'async migrations' failures?

@rudolf
Copy link
Contributor Author

rudolf commented Apr 12, 2021

How would these long-running-for-large-amount-of-data tasks play along with Elasticsearch's all-system-indices-in-a-common-thread approach?

We've been discussing removing .kibana from the system indices thread pool since the load is more like a data index than other indices (like .tasks or .security which have a very narrow focus/purpose). But if we would only run these migrations on huge indices (millions of docs) and in that case the index would definitely have to be in the "normal" thread pool.

Is this a naming commonly used for 'non-blocking' migrations? I find the term 'eventually consistent' quite misleading tbh , and would rather go with 'blocking' vs 'non-blocking' migration?

"eventually consistent" is a common database term for "when you read you might not see all the latest writes, but if you wait long enough they will show up". So yes, it's non-blocking, we won't block kibana from starting up and we won't block plugins from searching/writing, but it's really important that they design their business logic around this.

How do we provide the migration status to the plugin? I guess we'll need a new (obs-based probably) API to let type owners be informed of the status of the non-blocking type migration?

I think we can use the status API which will also mean there's a public HTTP API for checking progress

How do we handle failures in such migration? Do we just 'propagate' then to the owner via this new API, or do we want to mimic the blocking-migration behavior and terminate Kibana on 'async migrations' failures?

Yeah this is tricky... If the non-blocking migration fails for instance after Kibana being up for 3 hours, a lot of writes will have been accepted, so it's no longer possible to rollback without losing data. We can't let users just be stuck without a way out. So I think these migrations will have to be more lenient and just log an error and continue. It could potentially be disastrous like if all your data suddenly becomes unusable because they all failed to migrate, but the plugin should be designed with eventual consistency in mind, so if eventually we fix the bug and the data comes back it should all be OK. Plugins would have to do a much better job of validating writes so that it's very unlikely that we get these kinds of migration bugs.

@rudolf rudolf changed the title Eventually consistent saved object migrations Eventually consistent saved object/data index migrations Apr 12, 2021
@pgayvallet
Copy link
Contributor

I think we can use the status API which will also mean there's a public HTTP API for checking progress

I see a few limitations doing that:

  • The status model wasn't really designed for such complex status structure (see ServiceStatus type). Also, having the SO service degraded or unavailable until the non-blocking migrations are completes could have consequences on plugins non relying on that new feature
  • If we are to update the migration status in real time (e.g adding current errors to the status, or the count of currently processed objects), we will flood the already complex status observables tree
  • we are throttling the status API output observable, which may have some impact on consumers of the async migration status

So I think these migrations will have to be more lenient and just log an error and continue.

I agree that this seems the only realistic option. Do you think implementing the SO quarantine would then be necessary for this feature? Having a quarantine zone would allow users to eventually fix the failing objects, that would be re run against the migration at next startup?

@kobelb
Copy link
Contributor

kobelb commented Aug 19, 2021

Yeah this is tricky... If the non-blocking migration fails for instance after Kibana being up for 3 hours, a lot of writes will have been accepted, so it's no longer possible to rollback without losing data. We can't let users just be stuck without a way out. So I think these migrations will have to be more lenient and just log an error and continue. It could potentially be disastrous like if all your data suddenly becomes unusable because they all failed to migrate, but the plugin should be designed with eventual consistency in mind, so if eventually we fix the bug and the data comes back it should all be OK. Plugins would have to do a much better job of validating writes so that it's very unlikely that we get these kinds of migration bugs.

In my opinion, this is the biggest drawback to this approach. In the situation where a migration does have issues, rolling back isn't really an option. We'd have to tell our users that features just won't work until a newer patch version of Kibana is released that addresses the migration issue.

@pgayvallet
Copy link
Contributor

Outdated issue, eventually consistent migration is already implemented for serverless (ZDT), and if we ever want to implement it for traditional Kibana, the plan is to find the best way to port ZDT.

I'll go ahead and close this

@pgayvallet pgayvallet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
Development

No branches or pull requests

5 participants