New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Throttle package upgrade triggered by setup after stack upgrade #162772
Comments
Pinging @elastic/fleet (Team:Fleet) |
We are seeing many problems in 8.8 and 8.9 especially with apm package, where there are many data streams trying to be rolled over. |
Great idea. Let's make sure if we implement this we use |
One more idea: it would be nice to add a flag to be able to completely disable data stream rollovers, in case the upgrade didn't succeed due to rollovers even with the throttling. At least we would let the upgrade finish and could do manual rollover. |
Another thing we should probably do on the Fleet side is add some kind of concurrency control to this kibana/x-pack/plugins/fleet/server/services/epm/elasticsearch/template/template.ts Lines 726 to 733 in a3a2f40
We should probably |
Raised an issue to improve the UI: #166857 My plan is to first create many data streams locally to be able to reproduce the issue, and then start testing the throttling improvements. I found a way to reproduce one error scenario with apm package, which is actually not related to the 1m timeout but to the simulate template bug: To reproduce, check how many shards and nodes do you have (I had 1 node and 160 shards), and set the shard limit to a lower number.
After this, when trying to reinstall apm package, it will fail with the error max shard limit. It is actually caused by the wrong index template used for simulate template, which will trigger an unnecessary rollover, this way hitting the error.
Raised a pr to fix the index template bug, with this, the reinstall succeeds, and doesn't try to rollover or create new shards. |
## Summary Resolve #164269 Some context why I picked this up now: #162772 (comment) To verify: - Make sure 8.8+ apm package is installed - Create data stream `PUT _data_stream/metrics-apm.app.default-default` - Reinstall apm package from API or UI - Check kibana info logs, expect to not see simulate template error and rollover like below ``` [2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist [2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
## Summary Resolve elastic#164269 Some context why I picked this up now: elastic#162772 (comment) To verify: - Make sure 8.8+ apm package is installed - Create data stream `PUT _data_stream/metrics-apm.app.default-default` - Reinstall apm package from API or UI - Check kibana info logs, expect to not see simulate template error and rollover like below ``` [2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception Root causes: illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist [2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default ``` ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit 3ff82f2)
The current setup behaviour is this:
@kpollich Do you think there is still value in introducing a flag to skip rollovers during package upgrade and/or a flag to skip installing preconfigured packages during setup, to be able to stop the process if it keeps retrying and manual intervention is needed? Update: added 2 flags here to be able to skip rollovers and ignore unexpected mapping errors. |
@juliaElastic I agree with you, let's close this issue for now and reopen if that's still an issue and wee need the feature flag to be introduced. |
On the second point of the issue:
Tested this by cloning the index template of one of the apm data streams
|
Related: #167160. We should embrace a "throttled by default" approach to most (if not all) of Fleet's operations against Elasticsearch to avoid scaling issues. |
…l to rollovers (#166775) Fixes #166761 Ref #162772 ## Summary - Increase overall timeout for waiting to retry "stuck" installations from 1 minute to 30 minutes - Add `pMap` concurrency control limiting concurrent `putMapping` + `rollover` requests to mitigate ES load --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…l to rollovers (elastic#166775) Fixes elastic#166761 Ref elastic#162772 ## Summary - Increase overall timeout for waiting to retry "stuck" installations from 1 minute to 30 minutes - Add `pMap` concurrency control limiting concurrent `putMapping` + `rollover` requests to mitigate ES load --------- Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit c20d177)
I agree with you here - adding the feature flag might be unnecessary if our concurrency/throttling changes have impact. Let's not do that for now. |
…7008) # Backport This will backport the following commits from `main` to `8.10`: - [[Fleet] fix index template from datastream name (#166941)](#166941) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Julia Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2023-09-22T08:20:05Z","message":"[Fleet] fix index template from datastream name (#166941)\n\n## Summary\r\n\r\nResolve #164269 context why I picked this up now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall apm package from API or UI\r\n- Check kibana info logs, expect to not see simulate template error and\r\nrollover like below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception\r\n Root causes:\r\n illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af","branchLabelMapping":{"^v8.11.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:prev-minor","v8.11.0"],"number":166941,"url":"#166941 fix index template from datastream name (#166941)\n\n## Summary\r\n\r\nResolve #164269 context why I picked this up now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall apm package from API or UI\r\n- Check kibana info logs, expect to not see simulate template error and\r\nrollover like below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception\r\n Root causes:\r\n illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.11.0","labelRegex":"^v8.11.0$","isSourceBranch":true,"state":"MERGED","url":"#166941 fix index template from datastream name (#166941)\n\n## Summary\r\n\r\nResolve #164269 context why I picked this up now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall apm package from API or UI\r\n- Check kibana info logs, expect to not see simulate template error and\r\nrollover like below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception\r\n Root causes:\r\n illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af"}}]}] BACKPORT--> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
… control to rollovers (#166775) (#167184) # Backport This will backport the following commits from `main` to `8.10`: - [[Fleet] Increase package install max timeout + add concurrency control to rollovers (#166775)](#166775) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Kyle Pollich","email":"kyle.pollich@elastic.co"},"sourceCommit":{"committedDate":"2023-09-25T18:05:03Z","message":"[Fleet] Increase package install max timeout + add concurrency control to rollovers (#166775)\n\nFixes #166761 #162772 Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\" installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency control limiting concurrent `putMapping` +\r\n`rollover` requests to mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f","branchLabelMapping":{"^v8.11.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:prev-minor","v8.11.0"],"number":166775,"url":"#166775 Increase package install max timeout + add concurrency control to rollovers (#166775)\n\nFixes #166761 #162772 Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\" installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency control limiting concurrent `putMapping` +\r\n`rollover` requests to mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.11.0","labelRegex":"^v8.11.0$","isSourceBranch":true,"state":"MERGED","url":"#166775 Increase package install max timeout + add concurrency control to rollovers (#166775)\n\nFixes #166761 #162772 Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\" installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency control limiting concurrent `putMapping` +\r\n`rollover` requests to mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f"}}]}] BACKPORT--> --------- Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co> Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Kibana version: 8.8.1 and later
After a stack upgrade, when Fleet setup is running, it can trigger many package upgrades (e.g. bundled packages, system, elastic_agent, apm)
These packages sometimes contain changes where the data stream mappings can't be updated, and require a rollover, also enabling TSDB or synthetic source triggers a rollover.
If many rollovers are triggered at the same time, it can cause timeout and trigger rollback of packages, which exacerbates the problem.
To fix the issue, find a way in the product to avoid this reaction chain:
Integration upgrades -> generates rollovers -> ES is overwhelmed -> slow to rollover and apply changes -> timeout in upgrade process -> upgrade process falls back -> rollovers again to rollback to old integration -> ES is overwhelmed -> ...
The text was updated successfully, but these errors were encountered: