Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Packages with a large number of saved objects in them cause Kibana to crash #147695

Closed
Tracked by #174166
xcrzx opened this issue Dec 16, 2022 · 5 comments · Fixed by #148141
Closed
Tracked by #174166

[Fleet] Packages with a large number of saved objects in them cause Kibana to crash #147695

xcrzx opened this issue Dec 16, 2022 · 5 comments · Fixed by #148141
Assignees
Labels
8.7 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:Fleet Team label for Observability Data Collection Fleet team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.7.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Dec 16, 2022

We have recently encountered an issue where Kibana crashes when installing a Fleet package that contains a large number of saved objects. The crash occurs during the installation process and seems to be caused by the deletion of the previous package version.

Steps to reproduce:

  1. Install a Fleet package that contains a large number of saved objects (e.g. over 10,000) using POST /api/fleet/epm/packages/<package>/<version>.
    You could follow the steps from this ticket to generate a package with a large number of saved objects and install it.
  2. Observe that Kibana crashes during the installation process.

Expected result:
The Fleet package should be installed successfully without crashing Kibana.

Actual result:
Kibana crashes during the installation process. Elasticsearch logs show dozens of warnings similar to this:

block until refresh ran out of slots and forced a refresh: [BulkShardRequest [[.kibana_8.7.0_001][0]] containing [delete {[.kibana_8.7.0][security-rule:d8fc1cca-93ed-43c1-bbb6-c0dd3eff2958:102.0.6]}] blocking until refresh]

During that time, all requests to Kibana fail with

{"statusCode":503,"error":"Service Unavailable","message":"connect EADDRNOTAVAIL 127.0.0.1:9200 - Local (0.0.0.0:0)"}

Notes:

This issue does not occur with smaller packages containing fewer saved objects.

The issue can be temporarily resolved by manually deleting the saved objects from the previous package version before installing the new one, but this is not a permanent solution.

APM logs show hundreds of DELETE requests sent in parallel, they seem to overflow Elasticsearch, making it unresponsive:

Screenshot 2022-12-16 at 15 01 23

@xcrzx xcrzx added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Dec 16, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@xcrzx
Copy link
Contributor Author

xcrzx commented Dec 16, 2022

Hey @joshdover, our team is experimenting with Fleet packages that could contain a significant number of saved objects of the security-rule type (potentially tens of thousands). Here are two PoC for reference:

  1. [Security Solution] Historical rules packages PoC #145851
  2. [Security Solution] PoC of the rule upgrade and installation workflows #144060

While working on those PoCs, we've encountered some limitations on the Fleet's side, presumably related to how package assets are tracked and then deleted. I.e., all installed package assets are listed in the installed_kibana property of the package's saved object:

installed_kibana: {
type: 'nested',
properties: {
id: { type: 'keyword' },
type: { type: 'keyword' },
},
},
.

installed_kibana is used during a package upgrade to delete assets from the previous version. From what I see, listed saved objects are deleted one by one leading to performance problems mentioned in the ticket description. Would it be an option for Fleet to switch to the recently introduced bulkDelete saved object method to improve package upgrade performance?

cc @banderror

@joshdover
Copy link
Contributor

Would it be an option for Fleet to switch to the recently introduced bulkDelete saved object method to improve package upgrade performance?

Yes we definitely should switch over now that this is available. We'd be open to accept a PR for that if your team has the time.

@xcrzx xcrzx self-assigned this Dec 23, 2022
@banderror banderror changed the title Fleet packages with a large number of saved objects in them cause Kibana to crash [Fleet] Packages with a large number of saved objects in them cause Kibana to crash Dec 29, 2022
@banderror banderror added Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Rule Management Security Detection Rule Management Team Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules labels Dec 29, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@banderror banderror added 8.7 candidate v8.7.0 impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Dec 29, 2022
xcrzx added a commit that referenced this issue Jan 3, 2023
…ects (#148141)

**Resolves: #147695,
#148174
**Related to: #145851,
#137420

## Summary

This PR improves the stability of the Fleet packages installation
process with many saved objects.

1. Changed mappings of the `installed_kibana` and `package_assets`
fields from `nested` to `object` with `enabled: false`. Values of those
fields were retrieved from `_source`, and no queries or aggregations
were performed against them. So the mappings were unused, while during
the installation of packages containing more than 10,000 saved objects,
an error was thrown due to the nested field limitations:

   ```
Error installing security_detection_engine 8.4.1: The number of nested
documents has exceeded the allowed limit of
   [10000].
This limit can be set by changing the
[index.mapping.nested_objects.limit] index level setting.
   ```
2. Improved the deletion of previous package assets by switching from
sending multiple `savedObjectsClient.delete` requests in parallel to a
single `savedObjectsClient.bulkDelete` request. Multiple parallel
requests were causing the Elasticsearch cluster to stop responding for
some time; see [this
ticket](#147695) for more info.

**Before**
![Screenshot 2022-12-28 at 11 09
35](https://user-images.githubusercontent.com/1938181/209816219-ade6dd0a-0d56-4acc-929e-b88571f0fe81.png)

**After**
![Screenshot 2022-12-28 at 13 56
44](https://user-images.githubusercontent.com/1938181/209816209-16c69922-4ae2-4589-9aa4-5a28050037f4.png)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7 candidate bug Fixes for quality problems that affect the customer experience Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team:Fleet Team label for Observability Data Collection Fleet team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.7.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants