-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry INIT step when hitting incompatible_cluster_routing_allocation #131809
Conversation
…luster_routing_allocation
@@ -171,7 +171,7 @@ Upgrade migrations fail because routing allocation is disabled or restricted (`c | |||
|
|||
[source,sh] | |||
-------------------------------------------- | |||
Unable to complete saved object migrations for the [.kibana] index: [unsupported_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}} | |||
Unable to complete saved object migrations for the [.kibana] index: [incompatible_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incompatible_cluster_routing_allocation
is a much better description than unsupported_cluster_routing_allocation
. In hindsight, unsupported
wasn't the correct term to use here and is ambiguous (leading to questions like: Is it Kibana or ES that doesn't support the setting?). Naming is hard though 😃
src/core/server/saved_objects/migrations/actions/initialize_action.ts
Outdated
Show resolved
Hide resolved
const result = await task(); | ||
expect(Either.isRight(result)).toEqual(true); | ||
}); | ||
it('resolves right when valid transient settings, incompatible persistent settings', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I hadn't considered a mix of settings when first adding this additional check. Great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR is very similar to a draft I started working on (on local only). One thing I did overlook though (and you have here) is a mix of transient and persistent settings.
Looking good!
Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
dataArchive: Path.join( | ||
__dirname, | ||
'archives', | ||
'8.0.0_v1_migrations_sample_data_saved_objects.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used this archive since it doesn't contain as many saved objects, so the migration completes a bit faster
Pinging @elastic/kibana-core (Team:Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -151,7 +151,7 @@ export interface ActionErrorTypeMap { | |||
documents_transform_failed: DocumentsTransformFailed; | |||
request_entity_too_large_exception: RequestEntityTooLargeException; | |||
unknown_docs_found: UnknownDocsFound; | |||
unsupported_cluster_routing_allocation: UnsupportedClusterRoutingAllocation; | |||
incompatible_cluster_routing_allocation: IncompatibleClusterRoutingAllocation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the naming consistent is great!
@@ -41,7 +41,7 @@ describe('migration v2', () => { | |||
await new Promise((resolve) => setTimeout(resolve, 10000)); | |||
}); | |||
|
|||
it.skip('migrates the documents to the highest version', async () => { | |||
it('migrates the documents to the highest version', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, nice!
The failing test is due to a broken doc link from fleet. I reached out to the team. |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
⚪ Backport skippedThe pull request was not backported as there were no branches to backport to. If this is a mistake, please apply the desired version labels or run the backport tool manually. Manual backportTo create the backport manually run:
Questions ?Please refer to the Backport tool documentation |
…lastic#131809) * Add reproducing test case * Fix and add integration test * Transient settings should take preference * Rename unsupported_cluster_routing_allocation error to incompatible_cluster_routing_allocation * Retry INIT when action fails with [incompatible_cluster_routing_allocation] * Apply suggestions from code review Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co> * Fix archive with trial licence and re-enable skipped test * Integration test for incompatible cluster routing allocation * Fix types after renaming UnsupportedClusterRoutingAllocation * Attempt to fix open handle tests Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
Summary
When cloud restarts an Elasticsearch cluster it disables routing allocation temporarily. This causes migrations to fail and Kibana to bootloop until the setting is enabled again. Compared to just retrying the failing step, bootlooping is much more CPU intensive (especially because the container will just try 5 times and then restart the whole container). This change makes Kibana poll the cluster settings by retrying the INIT step until they're compatible and then continues with the migration.
Related to: #131681
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers