-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test: Jest Integration Tests.src/core/server/integration_tests/saved_objects/migrations/group3 - when splitting .kibana into multiple indices and one clone fails after resolving the problem and retrying the migration completes successfully #163253
Comments
Pinging @elastic/kibana-core (Team:Core) |
Caused by the same reason as #163273. Closing. |
New failure: CI Build - main |
New failure: CI Build - main |
New failure: CI Build - main |
New failure: CI Build - main |
New failure: CI Build - main |
New failure: CI Build - main |
New failure: CI Build - main |
Actual error from the last CI build:
|
New failure: CI Build - main |
New failure: CI Build - main |
/skip |
Started to break on PRs and main: |
New failure: kibana-on-merge - main |
New failure: kibana-on-merge - main |
Skipped. main: b122b46 |
This is a new error:
I'll take a look at what could cause them. UPDATE: It looks like the same error as reported in #163253 (comment) |
I have identified the place where the error happens, it's the
Even though we have a retry mechanism to keep trying until task completes, this time we're getting a
If we take a step back and look at the test, Rudolf is trying to make migrations fail on purpose, by doing: // cause a failure when cloning .kibana_slow_clone_* indices
await client.cluster.putSettings({ persistent: { 'cluster.max_shards_per_node': 15 } }); after which, we expect await expect(runMigrationsWhichFailsWhenCloning()).rejects.toThrowError(
/cluster_shard_limit_exceeded/
); However, it seems that we might be failing before we attempt to create enough SO indices to cause the expected failure. |
## Summary Addresses #163253 (see [comment](#163253 (comment))) TLDR: The PR renames the task manager SO index, forcing a reindex, and preventing the `deleteByQuery`. **Long story** * The fact that `.kibana_task_manager` SO index is compatible since 7.14.0 triggers a compatible migration path for that migrator. * This includes a cleanup step for old and excluded SO types. * This translates to a `deleteByQuery()`, which creates an async task on ES. * ES stores this task internally in an index. --- * The test at stakes is carefully limiting **shards_per_node**, effectively limiting amount of indices that we can create. * All migrators run in parallel, so whilst some of them are cloning (to the expected failure), the `.kibana_task_manager` migrator is/was running a compatible migration, which also requires an extra index. * This creates a race condition, where either the _clone_ operations (expected!) or the _deleteByQuery_ (flaky!) can fail arbitrarily, depending on who attempts to create the index first, hence the flakiness (well, this is my theory).
Fixed by #185005 |
A test failed on a tracked branch
First failure: CI Build - main
The text was updated successfully, but these errors were encountered: