-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[7.14] SO migration takes too long on pre-7.13 upgrades with heavy alerting usage #106308
Comments
Pinging @elastic/kibana-core (Team:Core) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Some queries to help clean this up 1. Get the
|
@mikecote what's the purpose of the terms filter in step 1? Is this all the taskTypes? Or its excluding some?
|
@LeeDr It's a limited set of taskTypes. Basically all the types that relate to running alert actions. There are many more around running rules (alerts), telemetry, login sessions, etc that I didn't want part of the query (so I opted-in the ones I wanted). |
@crisdarocha if possible, could you try to run a query against the .kibana_task_manager index (in your pre-upgrade 7.11.1 cluster) to get the count of docs that match the terms "task.taskType" like in step 1 above? I just want to confirm it covers a majority of the docs. |
EDIT: thought of a better way, see #106308 (comment) After thinking through this a bit, I think it's best if we implement this outside the migration algo as cleanup procedure that occurs beforehand the migration starts. The reason being that the algo's current architecture assumes that it's only interacting with one source index and would require many changes to accommodate being able to coordinate the migrations between these two indices to support these queries that need to happen across two different indices. The caveat here is that we'll be touching the source index of the prior Kibana version, even in cases where the migration ultimately fails. This makes the upgrade process a bit less 'hermetic' but I think the risks here are minimal considering that we already have logic in place that is doing similar cleanup work on 7.13. We'll also have to run these operations before we can set the write blocks on the indices, which means that the older Kibana nodes could potentially still be writing documents while these delete operations are in process. Though this behavior isn't officially supported, users will run into this situation. @mikecote are there any risks to running these |
I don't foresee any risks as any new document should have a timestamp of ~now meaning the query will leave those new docs untouched. |
As I've started to work through this a bit, I've found a bit better solution which should allow us to continue to support 'hermetic' upgrades:
I believe this is much safer as it won't change any of our basic assumptions about how the SO algo works and will greatly simplify the implementation. My goals is to have a working draft of this up sometime tomorrow. |
@joshdover brilliant! ❤️ |
@mikecote if you or someone on your team has time, it'd be helpful to prepare a Elasticsearch data archive that has something like 1k documents in this state. That way I can add an integration test pretty easily that verifies that these old docs are cleaned up. By "data archive" I essentially mean a .zip of the If creating the archive isn't going well (sorry we don't have docs on this), even just a script to create this data (using Kibana APIs or ES APIs) would be helpful. |
@joshdover thanks, I'll look into it and provide an archive. |
Thanks a ton 😄 |
@joshdover Here's an archive containing ~1500 tasks, of which ~1000 have failed (stale) and ~500 are still in queue. There's also ~1500 action_task_params. The test/queries should reduce the tasks to ~500 and the action_task_params to ~500. |
Quickly replying to @LeeDr 's question. Most documents in my I tried three options:
Solution sounds quite elegant, I hope it works! Thanks @LeeDr @joshdover and @mikecote for taking good care of that! |
I've got this working end-to-end with an integration test passing over in #106534 and I believe the risk here is fairly minimal. No fundamental changes to the migration algo were needed to implement this. 🎉 I'll have this cleaned up and the test coverage completed for review sometime on Monday, but any early eyes on the implementation now would be helpful. |
Kibana version: 7.14.0
Elasticsearch version: 7.14.0
Server OS version: all, but mostly impacts Cloud
Browser version: n/a
Browser OS version: n/a
Original install method (e.g. download page, yum, from source, etc.): n/a
Describe the bug: A bug in 7.11 caused failed tasks to not clean up unneeded docs in both the .kibana and .kibana_task_manager index. These could build up over time to hundreds of thousands of docs.
A task was added in 7.13 that would clean these up over time (the throttled task could take a month to clean up everything).
But 7.14 introduces saved object sharing which requires every saved object to be modified. In prior releases, only saved objects that had migrations defined for this release were modified. So a 7.11 -> 7.13 migration works but a 7.11 -> 7.14 migration an fail due to timeouts on Kibana becoming available.
If a user tries to upgrade from 7.11 or 7.12 to 7.14 (skipping the 7.13 release with the task for cleaning up those objects), the saved object migration will take longer than Cloud expects for Kibana to be available and restart the Kibana instance several times and report it as failed. It appears from one case, that the migration will actually complete if left to run.
For on-prem installations, I believe the migration will complete but may take 15 - 20 minutes or longer depending on how many saved objects there are.
I don't know the impact on ECE, ECK, or other installation methods.
Steps to reproduce:
Expected behavior: Kibana should complete saved object migration in a "reasonable time" to not cause timeouts on Elastic Cloud or other installation methods.
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant):
Any additional context:
The text was updated successfully, but these errors were encountered: