Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

Merged
merged 12 commits into from
May 11, 2022

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented May 9, 2022

Summary

When cloud restarts an Elasticsearch cluster it disables routing allocation temporarily. This causes migrations to fail and Kibana to bootloop until the setting is enabled again. Compared to just retrying the failing step, bootlooping is much more CPU intensive (especially because the container will just try 5 times and then restart the whole container). This change makes Kibana poll the cluster settings by retrying the INIT step until they're compatible and then continues with the migration.

Related to: #131681

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk Probability Severity Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space. Low High Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks. High Low Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled. Medium High Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations bug Fixes for quality problems that affect the customer experience labels May 9, 2022
@@ -171,7 +171,7 @@ Upgrade migrations fail because routing allocation is disabled or restricted (`c

[source,sh]
--------------------------------------------
Unable to complete saved object migrations for the [.kibana] index: [unsupported_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}}
Unable to complete saved object migrations for the [.kibana] index: [incompatible_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incompatible_cluster_routing_allocation is a much better description than unsupported_cluster_routing_allocation. In hindsight, unsupported wasn't the correct term to use here and is ambiguous (leading to questions like: Is it Kibana or ES that doesn't support the setting?). Naming is hard though 😃

const result = await task();
expect(Either.isRight(result)).toEqual(true);
});
it('resolves right when valid transient settings, incompatible persistent settings', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I hadn't considered a mix of settings when first adding this additional check. Great catch!

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR is very similar to a draft I started working on (on local only). One thing I did overlook though (and you have here) is a mix of transient and persistent settings.
Looking good!

dataArchive: Path.join(
__dirname,
'archives',
'8.0.0_v1_migrations_sample_data_saved_objects.zip'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this archive since it doesn't contain as many saved objects, so the migration completes a bit faster

@rudolf rudolf marked this pull request as ready for review May 10, 2022 14:47
@rudolf rudolf requested a review from a team as a code owner May 10, 2022 14:47
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Copy link
Contributor

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -151,7 +151,7 @@ export interface ActionErrorTypeMap {
documents_transform_failed: DocumentsTransformFailed;
request_entity_too_large_exception: RequestEntityTooLargeException;
unknown_docs_found: UnknownDocsFound;
unsupported_cluster_routing_allocation: UnsupportedClusterRoutingAllocation;
incompatible_cluster_routing_allocation: IncompatibleClusterRoutingAllocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the naming consistent is great!

@@ -41,7 +41,7 @@ describe('migration v2', () => {
await new Promise((resolve) => setTimeout(resolve, 10000));
});

it.skip('migrates the documents to the highest version', async () => {
it('migrates the documents to the highest version', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, nice!

@rudolf
Copy link
Contributor Author

rudolf commented May 11, 2022

The failing test is due to a broken doc link from fleet. I reached out to the team.

@rudolf rudolf added the auto-backport Deprecated: Automatically backport this PR after it's merged label May 11, 2022
@rudolf rudolf enabled auto-merge (squash) May 11, 2022 13:43
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@rudolf rudolf merged commit 575c559 into main May 11, 2022
@rudolf rudolf deleted the retry-incompatible-cluster-settings branch May 11, 2022 15:05
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 11, 2022
@kibanamachine
Copy link
Contributor

⚪ Backport skipped

The pull request was not backported as there were no branches to backport to. If this is a mistake, please apply the desired version labels or run the backport tool manually.

Manual backport

To create the backport manually run:

node scripts/backport --pr 131809

Questions ?

Please refer to the Backport tool documentation

academo pushed a commit to academo/kibana that referenced this pull request May 12, 2022
…lastic#131809)

* Add reproducing test case

* Fix and add integration test

* Transient settings should take preference

* Rename unsupported_cluster_routing_allocation error to incompatible_cluster_routing_allocation

* Retry INIT when action fails with [incompatible_cluster_routing_allocation]

* Apply suggestions from code review

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

* Fix archive with trial licence and re-enable skipped test

* Integration test for incompatible cluster routing allocation

* Fix types after renaming UnsupportedClusterRoutingAllocation

* Attempt to fix open handle tests

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated: Automatically backport this PR after it's merged backport:skip This commit does not require backporting bug Fixes for quality problems that affect the customer experience Feature:Migrations release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants