Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Throttle package upgrade triggered by setup after stack upgrade #162772

Closed
1 of 2 tasks
juliaElastic opened this issue Jul 31, 2023 · 11 comments · Fixed by #167044
Closed
1 of 2 tasks

[Fleet] Throttle package upgrade triggered by setup after stack upgrade #162772

juliaElastic opened this issue Jul 31, 2023 · 11 comments · Fixed by #167044
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 31, 2023

Kibana version: 8.8.1 and later

After a stack upgrade, when Fleet setup is running, it can trigger many package upgrades (e.g. bundled packages, system, elastic_agent, apm)

These packages sometimes contain changes where the data stream mappings can't be updated, and require a rollover, also enabling TSDB or synthetic source triggers a rollover.
If many rollovers are triggered at the same time, it can cause timeout and trigger rollback of packages, which exacerbates the problem.

To fix the issue, find a way in the product to avoid this reaction chain:

  • Add throttling during package installation to leave enough time for ES to finish the rollovers.
    • The chain is: Integration upgrades -> generates rollovers -> ES is overwhelmed -> slow to rollover and apply changes -> timeout in upgrade process -> upgrade process falls back -> rollovers again to rollback to old integration -> ES is overwhelmed -> ...
  • Ensure that also indices with custom Index Templates referencing package policies of an integration (e.g. rightfully created to customize the ILM policy or customize a namespace) are also applied the settings, or the deletion of old assets will fail.
@juliaElastic juliaElastic added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Jul 31, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Sep 12, 2023

We are seeing many problems in 8.8 and 8.9 especially with apm package, where there are many data streams trying to be rolled over.
It is also unfortunate that if the package install_status is stuck in installing or install_failed, it is hard to determine what are the successful and missing/failing steps to complete.
It would be great to know where the installation got stuck (like a state machine) and try to make the reinstall continue from a partial installation.

@joshdover
Copy link
Member

It would be great to know where the installation got stuck (like a state machine) and try to make the reinstall continue from a partial installation.

Great idea. Let's make sure if we implement this we use refresh: false on the SO updates to the epm-package object to avoid making package installation/upgrade slower.

@juliaElastic
Copy link
Contributor Author

One more idea: it would be nice to add a flag to be able to completely disable data stream rollovers, in case the upgrade didn't succeed due to rollovers even with the throttling. At least we would let the upgrade finish and could do manual rollover.

@kpollich
Copy link
Member

Another thing we should probably do on the Fleet side is add some kind of concurrency control to this Promise.all block where data stream rollovers happen:

const updatedataStreamPromises = indexNameWithTemplates.map((templateEntry) => {
return updateExistingDataStream({
esClient,
logger,
dataStreamName: templateEntry.dataStreamName,
});
});
await Promise.all(updatedataStreamPromises);

We should probably pMap this and limit the concurrency so we're not spawning a Promise for every single data stream rollover at the same time. This would probably be a fairly quick win.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Sep 20, 2023

Raised an issue to improve the UI: #166857

My plan is to first create many data streams locally to be able to reproduce the issue, and then start testing the throttling improvements.

I found a way to reproduce one error scenario with apm package, which is actually not related to the 1m timeout but to the simulate template bug:

To reproduce, check how many shards and nodes do you have (I had 1 node and 160 shards), and set the shard limit to a lower number.
Make sure to have all/many of the data streams created. Use synthrace for this from kibana, or manually create data streams, e.g.

node scripts/synthtrace  distributed_trace_long.ts

After this, when trying to reinstall apm package, it will fail with the error max shard limit. It is actually caused by the wrong index template used for simulate template, which will trigger an unnecessary rollover, this way hitting the error.

// check node count
GET _cat/nodes?v=true

// check shard count
GET _cluster/stats?filter_path=indices.shards.total

// set shard limit
PUT _cluster/settings
{
  "persistent" : {
    "cluster.max_shards_per_node": 100
  }
}

// try reinstall epm with latest version
POST kbn:/api/fleet/epm/packages/apm/8.11.0-preview-1695202343
{
  "force": true
}

// unnecessary rollover - reaching shard count limit
[2023-09-21T15:13:33.090+02:00][DEBUG][plugins.fleet] Updating settings for metrics-apm.service_transaction.10m-1
[2023-09-21T15:13:33.091+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception
        Root causes:
                illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist
[2023-09-21T15:13:33.092+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default      

[2023-09-21T15:13:33.153+02:00][WARN ][plugins.fleet] Failure to install package [apm]: [ResponseError: validation_exception
        Root causes:
                validation_exception: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [283]/[100] maximum normal shards open;]

Raised a pr to fix the index template bug, with this, the reinstall succeeds, and doesn't try to rollover or create new shards.

juliaElastic added a commit that referenced this issue Sep 22, 2023
## Summary

Resolve #164269

Some context why I picked this up now:
#162772 (comment)

To verify:
- Make sure 8.8+ apm package is installed
- Create data stream `PUT _data_stream/metrics-apm.app.default-default`
- Reinstall apm package from API or UI
- Check kibana info logs, expect to not see simulate template error and
rollover like below

```
[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception
        Root causes:
                illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist
[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default
```


### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Sep 22, 2023
## Summary

Resolve elastic#164269

Some context why I picked this up now:
elastic#162772 (comment)

To verify:
- Make sure 8.8+ apm package is installed
- Create data stream `PUT _data_stream/metrics-apm.app.default-default`
- Reinstall apm package from API or UI
- Check kibana info logs, expect to not see simulate template error and
rollover like below

```
[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Mappings update for metrics-apm.app.default-default failed due to ResponseError: illegal_argument_exception
        Root causes:
                illegal_argument_exception: unable to simulate template [metrics-apm.app.default] that does not exist
[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet] Triggering a rollover for metrics-apm.app.default-default
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

(cherry picked from commit 3ff82f2)
@juliaElastic
Copy link
Contributor Author

juliaElastic commented Sep 25, 2023

The current setup behaviour is this:

  • install preconfigured packages: fleet_server, apm, system, synthetics is preconfigured in cloud
  • the packages are installed sequentially, not in parallel
  • with the fix of simulateTemplate and the concurrency control in Kyle's pr I think we will eliminate most issues that we saw recently
  • rasied pr to only do rollover on expected errors from es, so we don't keep rolling over if the error is unexpected

@kpollich Do you think there is still value in introducing a flag to skip rollovers during package upgrade and/or a flag to skip installing preconfigured packages during setup, to be able to stop the process if it keeps retrying and manual intervention is needed?
I would lean to seeing how the current enhancements work and add flags later if needed.

Update: added 2 flags here to be able to skip rollovers and ignore unexpected mapping errors.

@jlind23
Copy link
Contributor

jlind23 commented Sep 25, 2023

@juliaElastic I agree with you, let's close this issue for now and reopen if that's still an issue and wee need the feature flag to be introduced.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Sep 25, 2023

On the second point of the issue:

Ensure that also indices with custom Index Templates referencing package policies of an integration (e.g. rightfully created to customize the ILM policy or customize a namespace) are also applied the settings, or the deletion of old assets will fail.

Tested this by cloning the index template of one of the apm data streams metrics-apm.app.synth-android-default and doing a force upgrade of the package.
The fix for the simulateTemplate makes sure that the actual index template is updated (even if cloned), in this case saw in the logs:

[2023-09-25T15:33:24.978+02:00][INFO ][plugins.fleet] Attempt to update the mappings for the metrics-apm.app.synth-android-default (write_index_only)
[2023-09-25T15:33:25.001+02:00][DEBUG][plugins.fleet] Updating settings for metrics-apm.app.synth-android-default

@kpollich
Copy link
Member

Related: #167160. We should embrace a "throttled by default" approach to most (if not all) of Fleet's operations against Elasticsearch to avoid scaling issues.

kpollich added a commit that referenced this issue Sep 25, 2023
…l to rollovers (#166775)

Fixes #166761
Ref #162772

## Summary

- Increase overall timeout for waiting to retry "stuck" installations
from 1 minute to 30 minutes
- Add `pMap` concurrency control limiting concurrent `putMapping` +
`rollover` requests to mitigate ES load

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Sep 25, 2023
…l to rollovers (elastic#166775)

Fixes elastic#166761
Ref elastic#162772

## Summary

- Increase overall timeout for waiting to retry "stuck" installations
from 1 minute to 30 minutes
- Add `pMap` concurrency control limiting concurrent `putMapping` +
`rollover` requests to mitigate ES load

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit c20d177)
@kpollich
Copy link
Member

Do you think there is still value in introducing a flag to skip rollovers during package upgrade and/or a flag to skip installing preconfigured packages during setup, to be able to stop the process if it keeps retrying and manual intervention is needed?
I would lean to seeing how the current enhancements work and add flags later if needed.

I agree with you here - adding the feature flag might be unnecessary if our concurrency/throttling changes have impact. Let's not do that for now.

kibanamachine added a commit that referenced this issue Sep 26, 2023
…7008)

# Backport

This will backport the following commits from `main` to `8.10`:
- [[Fleet] fix index template from datastream name
(#166941)](#166941)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"90178898+juliaElastic@users.noreply.github.com"},"sourceCommit":{"committedDate":"2023-09-22T08:20:05Z","message":"[Fleet]
fix index template from datastream name (#166941)\n\n##
Summary\r\n\r\nResolve
#164269 context why
I picked this up
now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo
verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data
stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall
apm package from API or UI\r\n- Check kibana info logs, expect to not
see simulate template error and\r\nrollover like
below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO
][plugins.fleet] Mappings update for metrics-apm.app.default-default
failed due to ResponseError: illegal_argument_exception\r\n Root
causes:\r\n illegal_argument_exception: unable to simulate template
[metrics-apm.app.default] that does not
exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet]
Triggering a rollover for
metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af","branchLabelMapping":{"^v8.11.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:prev-minor","v8.11.0"],"number":166941,"url":"#166941
fix index template from datastream name (#166941)\n\n##
Summary\r\n\r\nResolve
#164269 context why
I picked this up
now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo
verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data
stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall
apm package from API or UI\r\n- Check kibana info logs, expect to not
see simulate template error and\r\nrollover like
below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO
][plugins.fleet] Mappings update for metrics-apm.app.default-default
failed due to ResponseError: illegal_argument_exception\r\n Root
causes:\r\n illegal_argument_exception: unable to simulate template
[metrics-apm.app.default] that does not
exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet]
Triggering a rollover for
metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.11.0","labelRegex":"^v8.11.0$","isSourceBranch":true,"state":"MERGED","url":"#166941
fix index template from datastream name (#166941)\n\n##
Summary\r\n\r\nResolve
#164269 context why
I picked this up
now:\r\nhttps://github.com//issues/162772#issuecomment-1728031080\r\n\r\nTo
verify:\r\n- Make sure 8.8+ apm package is installed\r\n- Create data
stream `PUT _data_stream/metrics-apm.app.default-default`\r\n- Reinstall
apm package from API or UI\r\n- Check kibana info logs, expect to not
see simulate template error and\r\nrollover like
below\r\n\r\n```\r\n[2023-09-21T15:54:36.559+02:00][INFO
][plugins.fleet] Mappings update for metrics-apm.app.default-default
failed due to ResponseError: illegal_argument_exception\r\n Root
causes:\r\n illegal_argument_exception: unable to simulate template
[metrics-apm.app.default] that does not
exist\r\n[2023-09-21T15:54:36.559+02:00][INFO ][plugins.fleet]
Triggering a rollover for
metrics-apm.app.default-default\r\n```\r\n\r\n\r\n### Checklist\r\n\r\n-
[x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios","sha":"3ff82f2c17e532fed5d5544ed9bbae4f6e7331af"}}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
kibanamachine added a commit that referenced this issue Sep 26, 2023
… control to rollovers (#166775) (#167184)

# Backport

This will backport the following commits from `main` to `8.10`:
- [[Fleet] Increase package install max timeout + add concurrency
control to rollovers
(#166775)](#166775)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Kyle
Pollich","email":"kyle.pollich@elastic.co"},"sourceCommit":{"committedDate":"2023-09-25T18:05:03Z","message":"[Fleet]
Increase package install max timeout + add concurrency control to
rollovers (#166775)\n\nFixes
#166761
#162772
Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\"
installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency
control limiting concurrent `putMapping` +\r\n`rollover` requests to
mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f","branchLabelMapping":{"^v8.11.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","Team:Fleet","backport:prev-minor","v8.11.0"],"number":166775,"url":"#166775
Increase package install max timeout + add concurrency control to
rollovers (#166775)\n\nFixes
#166761
#162772
Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\"
installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency
control limiting concurrent `putMapping` +\r\n`rollover` requests to
mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.11.0","labelRegex":"^v8.11.0$","isSourceBranch":true,"state":"MERGED","url":"#166775
Increase package install max timeout + add concurrency control to
rollovers (#166775)\n\nFixes
#166761
#162772
Summary\r\n\r\n- Increase overall timeout for waiting to retry \"stuck\"
installations\r\nfrom 1 minute to 30 minutes\r\n- Add `pMap` concurrency
control limiting concurrent `putMapping` +\r\n`rollover` requests to
mitigate ES load\r\n\r\n---------\r\n\r\nCo-authored-by: Kibana Machine
<42973632+kibanamachine@users.noreply.github.com>","sha":"c20d177a036be73d7b1180dc17e644afa260994f"}}]}]
BACKPORT-->

---------

Co-authored-by: Kyle Pollich <kyle.pollich@elastic.co>
Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants