Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

rudolf · 2022-05-09T13:56:41Z

Summary

When cloud restarts an Elasticsearch cluster it disables routing allocation temporarily. This causes migrations to fail and Kibana to bootloop until the setting is enabled again. Compared to just retrying the failing step, bootlooping is much more CPU intensive (especially because the container will just try 5 times and then restart the whole container). This change makes Kibana poll the cluster settings by retrying the INIT step until they're compatible and then continues with the migration.

Related to: #131681

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

Risk Matrix

Delete this section if it is not applicable to this PR.

Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.

When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:

Risk	Probability	Severity	Mitigation/Notes
Multiple Spaces—unexpected behavior in non-default Kibana Space.	Low	High	Integration tests will verify that all features are still supported in non-default Kibana Space and when user switches between spaces.
Multiple nodes—Elasticsearch polling might have race conditions when multiple Kibana nodes are polling for the same tasks.	High	Low	Tasks are idempotent, so executing them multiple times will not result in logical error, but will degrade performance. To test for this case we add plenty of unit tests around this logic and document manual testing procedure.
Code should gracefully handle cases when feature X or plugin Y are disabled.	Medium	High	Unit tests will verify that any feature flag or plugin combination still results in our service operational.
See more potential risk examples

For maintainers

This was checked for breaking API changes and was labeled appropriately

…luster_routing_allocation

…ation]

TinaHeiligers · 2022-05-09T19:41:16Z

docs/setup/upgrade/resolving-migration-failures.asciidoc

@@ -171,7 +171,7 @@ Upgrade migrations fail because routing allocation is disabled or restricted (`c

 [source,sh]
 --------------------------------------------
-Unable to complete saved object migrations for the [.kibana] index: [unsupported_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}}
+Unable to complete saved object migrations for the [.kibana] index: [incompatible_cluster_routing_allocation] The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}}


incompatible_cluster_routing_allocation is a much better description than unsupported_cluster_routing_allocation. In hindsight, unsupported wasn't the correct term to use here and is ambiguous (leading to questions like: Is it Kibana or ES that doesn't support the setting?). Naming is hard though 😃

src/core/server/saved_objects/migrations/actions/index.ts

src/core/server/saved_objects/migrations/actions/initialize_action.ts

TinaHeiligers · 2022-05-09T19:49:33Z

src/core/server/saved_objects/migrations/actions/initialize_action.test.ts

+    const result = await task();
+    expect(Either.isRight(result)).toEqual(true);
+  });
+  it('resolves right when valid transient settings, incompatible persistent settings', async () => {


Ah, I hadn't considered a mix of settings when first adding this additional check. Great catch!

TinaHeiligers

Your PR is very similar to a draft I started working on (on local only). One thing I did overlook though (and you have here) is a mix of transient and persistent settings.
Looking good!

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

rudolf · 2022-05-10T14:11:32Z

...r/saved_objects/migrations/integration_tests/incompatible_cluster_routing_allocation.test.ts

+      dataArchive: Path.join(
+        __dirname,
+        'archives',
+        '8.0.0_v1_migrations_sample_data_saved_objects.zip'


I used this archive since it doesn't contain as many saved objects, so the migration completes a bit faster

elasticmachine · 2022-05-10T14:47:49Z

Pinging @elastic/kibana-core (Team:Core)

TinaHeiligers

LGTM!

TinaHeiligers · 2022-05-10T16:06:05Z

src/core/server/saved_objects/migrations/actions/index.ts

@@ -151,7 +151,7 @@ export interface ActionErrorTypeMap {
  documents_transform_failed: DocumentsTransformFailed;
  request_entity_too_large_exception: RequestEntityTooLargeException;
  unknown_docs_found: UnknownDocsFound;
-  unsupported_cluster_routing_allocation: UnsupportedClusterRoutingAllocation;
+  incompatible_cluster_routing_allocation: IncompatibleClusterRoutingAllocation;


Keeping the naming consistent is great!

TinaHeiligers · 2022-05-10T16:08:27Z

src/core/server/saved_objects/migrations/integration_tests/outdated_docs.test.ts

@@ -41,7 +41,7 @@ describe('migration v2', () => {
    await new Promise((resolve) => setTimeout(resolve, 10000));
  });

-  it.skip('migrates the documents to the highest version', async () => {
+  it('migrates the documents to the highest version', async () => {


rudolf · 2022-05-11T13:40:19Z

The failing test is due to a broken doc link from fleet. I reached out to the team.

kibana-ci · 2022-05-11T15:05:01Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #44014 succeeded 42b716f
💔 Build #43716 failed 14ad755
💔 Build #43695 failed b0da41c
💔 Build #43478 failed a1f8fea
💔 Build #43342 failed 8707d7b

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

kibanamachine · 2022-05-11T15:05:47Z

⚪ Backport skipped

The pull request was not backported as there were no branches to backport to. If this is a mistake, please apply the desired version labels or run the backport tool manually.

Manual backport

To create the backport manually run:

node scripts/backport --pr 131809

Questions ?

Please refer to the Backport tool documentation

…lastic#131809) * Add reproducing test case * Fix and add integration test * Transient settings should take preference * Rename unsupported_cluster_routing_allocation error to incompatible_cluster_routing_allocation * Retry INIT when action fails with [incompatible_cluster_routing_allocation] * Apply suggestions from code review Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co> * Fix archive with trial licence and re-enable skipped test * Integration test for incompatible cluster routing allocation * Fix types after renaming UnsupportedClusterRoutingAllocation * Attempt to fix open handle tests Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

rudolf added 5 commits May 6, 2022 13:22

Add reproducing test case

9ab2612

Fix and add integration test

f429991

Transient settings should take preference

01c2888

Rename unsupported_cluster_routing_allocation error to incompatible_c…

3f7f9f5

…luster_routing_allocation

Retry INIT when action fails with [incompatible_cluster_routing_alloc…

8707d7b

…ation]

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations bug Fixes for quality problems that affect the customer experience labels May 9, 2022

TinaHeiligers reviewed May 9, 2022

View reviewed changes

src/core/server/saved_objects/migrations/actions/index.ts Outdated Show resolved Hide resolved

TinaHeiligers reviewed May 9, 2022

View reviewed changes

src/core/server/saved_objects/migrations/actions/initialize_action.ts Outdated Show resolved Hide resolved

TinaHeiligers reviewed May 9, 2022

View reviewed changes

rudolf and others added 4 commits May 9, 2022 21:56

Merge branch 'main' into retry-incompatible-cluster-settings

78bee19

Apply suggestions from code review

a1f8fea

Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>

Fix archive with trial licence and re-enable skipped test

c447c0d

Integration test for incompatible cluster routing allocation

b0da41c

rudolf commented May 10, 2022

View reviewed changes

Fix types after renaming UnsupportedClusterRoutingAllocation

14ad755

rudolf marked this pull request as ready for review May 10, 2022 14:47

rudolf requested a review from a team as a code owner May 10, 2022 14:47

rudolf added release_note:fix v8.3.0 labels May 10, 2022

TinaHeiligers approved these changes May 10, 2022

View reviewed changes

Attempt to fix open handle tests

42b716f

rudolf added the auto-backport Deprecated: Automatically backport this PR after it's merged label May 11, 2022

rudolf enabled auto-merge (squash) May 11, 2022 13:43

Merge branch 'main' into retry-incompatible-cluster-settings

4274edb

rudolf merged commit 575c559 into main May 11, 2022

rudolf deleted the retry-incompatible-cluster-settings branch May 11, 2022 15:05

kibanamachine added the backport:skip This commit does not require backporting label May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

rudolf commented May 9, 2022 •

edited

Loading

TinaHeiligers May 9, 2022

TinaHeiligers May 9, 2022

TinaHeiligers left a comment

rudolf May 10, 2022

elasticmachine commented May 10, 2022

TinaHeiligers left a comment

TinaHeiligers May 10, 2022

TinaHeiligers May 10, 2022

rudolf commented May 11, 2022

kibana-ci commented May 11, 2022

kibanamachine commented May 11, 2022

Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

Retry INIT step when hitting incompatible_cluster_routing_allocation #131809

Conversation

rudolf commented May 9, 2022 • edited Loading

Summary

Checklist

Risk Matrix

For maintainers

TinaHeiligers May 9, 2022

Choose a reason for hiding this comment

TinaHeiligers May 9, 2022

Choose a reason for hiding this comment

TinaHeiligers left a comment

Choose a reason for hiding this comment

rudolf May 10, 2022

Choose a reason for hiding this comment

elasticmachine commented May 10, 2022

TinaHeiligers left a comment

Choose a reason for hiding this comment

TinaHeiligers May 10, 2022

Choose a reason for hiding this comment

TinaHeiligers May 10, 2022

Choose a reason for hiding this comment

rudolf commented May 11, 2022

kibana-ci commented May 11, 2022

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented May 11, 2022

⚪ Backport skipped

Manual backport

Questions ?

rudolf commented May 9, 2022 •

edited

Loading