Skip to content

Conversation

@maximpn
Copy link
Contributor

@maximpn maximpn commented Oct 31, 2025

Partially addresses: elastic/kibana#188090

Summary

This PR integrates Prebuilt Rules OOM testing Buildkite pipeline into the Pull Request Buildkite pipeline.

Details

Pull Request Builkite pipeline script have been extended in a generic way to support custom package checker scripts located under <repo-root>/.buildkite/scripts/packages/<package-name>.sh. It allows to run any custom verification and testing logic specific to a package.

This PR adds .buildkite/scripts/packages/security_detection_engine.sh script file. This script runs only for security_detection_engine package and triggers the Prebuilt Rules Out-Of-Memory testing pipeline. The triggered pipeline performs e2e testing to reveal potential blockers due to Kibana Out-Of-Memory instance failures when performing actions upon the package (installing the package, review prebuilt rules available in the package, installing prebuilt rules from the package etc.).

Tested stack versions

For now .buildkite/scripts/packages/security_detection_engine.sh triggers Prebuilt Rules OOM testing Buildkite pipeline against compatible minor versions under development. The decision is made based on Kibana's versions.json. While compatibility is determined via conditions.kibana.version field in the package's manifest.yml.

For example conditions.kibana.version has ^9.2.0 restriction and we have 9.2.2 and 9.3.0 under development. It means the OOM tests will run against 9.2.2-SNAPSHOT and 9.3.0-SNAPSHOT.

We consider extending the testing surface to the latest release patch versions after collecting more data in the CI runs.

Affected teams

@elastic/threat-research-and-detection-engineering,

FYI this PR will affect security_detection_engine package release process. Every PR containing changes to the security_detection_engine package will trigger Prebuilt Rules OOM testing ECH Buildkite pipeline.

Further improvements

  • Pushing commits to this repo in a quick succession may lead to leaving rouge resources in the cloud. It happens due to cancel_intermediate_builds: true configuration at the Integrations PR Buildkite build. Pushing a fresh commit cancels the currently running PR build leading to cancelling the triggered build. Eventually the clean up steps in the triggered build can't execute and clean up resources in the cloud.
  • We may speed up the build by using an elastic-package Docker container published to docker.elastic.co. elastic-package installation is a complex process requiring a chain on installations GVM -> Go -> elastic-package. And it takes in average 3 minutes per each integration (integrations build in parallel). On top of that Prebuilt Rules OOM testing Buildkite pipeline has to install elastic-package as well. It sums up to 6 minutes which could be reduced.

@maximpn maximpn self-assigned this Oct 31, 2025
@maximpn maximpn added enhancement New feature or request automation Integration:security_detection_engine Prebuilt Security Detection Rules labels Oct 31, 2025
@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch 5 times, most recently from e55d865 to cd9fc1a Compare October 31, 2025 11:08
@andrewkroh andrewkroh removed the Integration:security_detection_engine Prebuilt Security Detection Rules label Oct 31, 2025
@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch from cd9fc1a to 88cbc88 Compare October 31, 2025 15:20
@maximpn maximpn added Integration:security_detection_engine Prebuilt Security Detection Rules and removed Integration:security_detection_engine Prebuilt Security Detection Rules labels Oct 31, 2025
@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch 10 times, most recently from 1af6780 to 96ff986 Compare November 1, 2025 08:36
@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch 6 times, most recently from 9f42d25 to 1a5982e Compare November 3, 2025 16:40
@mrodm mrodm requested a review from a team November 19, 2025 14:51
@maximpn
Copy link
Contributor Author

maximpn commented Nov 19, 2025

@mrodm,

Pushing commits to this repo in a quick succession may lead to leaving rouge resources in the cloud. It happens due to cancel_intermediate_builds: true configuration at the Integrations PR Buildkite build. Pushing a fresh commit cancels the currently running PR build leading to cancelling the triggered build. Eventually the clean up steps in the triggered build can't execute and clean up resources in the cloud.

That setting is added to the Buildkite configuration to try to avoid "wasting" resources if there is a new commit pushed to the branch. I think it is interesting to keep that setting as it is.
An option that you could follow is adding some daily Buildkite job or similar to clean up those resources that could be left behind.

Ideally it should be possible to prevent the cleanup steps in the triggered pipeline from cancelling. I've explored the possibilities but it may time to figure out a proper solution. So it will be a follow up PR when there is time to address this task.

@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch from 44a27e0 to 687f98b Compare November 19, 2025 16:46
@maximpn maximpn requested a review from mrodm November 19, 2025 17:41
@maximpn
Copy link
Contributor Author

maximpn commented Nov 19, 2025

@mrodm,

I've addressed your comments. Could you have a look?

It seems the last build failed due to unrelated to this PR resons.

@maximpn
Copy link
Contributor Author

maximpn commented Nov 20, 2025

@mrodm,

About the improvements mentioned in the description:

We may speed up the build by using an elastic-package Docker container published to docker.elastic.co. elastic-package installation is a complex process requiring a chain on installations GVM -> Go -> elastic-package. And it takes in average 3 minutes per each integration (integrations build in parallel). On top of that Prebuilt Rules OOM testing Buildkite pipeline has to install elastic-package as well. It sums up to 6 minutes which could be reduced.

I'm not totally sure this could lead to unexpected failures, since elastic-package already is based on docker to run some commands (e.g. elastic-package test system or even elastic-package stack up). Creating a docker image for elastic-package would involve to ensure that everything works in a docker-in-a-docker scenario. I think I would try to avoid that, WDYT @elastic/ecosystem ?

I just wanted to clarify my suggestion.

AFAIK Buildkite jobs run in a Kibernetes environment. For example elastic/integrations pipelines already use Go and Ubuntu images.

Consequently it opens up a possibility to build a custom image with elastic-package from a standard Go image and use it for Buildklite jobs/steps requiring elastic-package.

@mrodm
Copy link
Collaborator

mrodm commented Nov 20, 2025

About the improvements mentioned in the description:

We may speed up the build by using an elastic-package Docker container published to docker.elastic.co. elastic-package installation is a complex process requiring a chain on installations GVM -> Go -> elastic-package. And it takes in average 3 minutes per each integration (integrations build in parallel). On top of that Prebuilt Rules OOM testing Buildkite pipeline has to install elastic-package as well. It sums up to 6 minutes which could be reduced.

I'm not totally sure this could lead to unexpected failures, since elastic-package already is based on docker to run some commands (e.g. elastic-package test system or even elastic-package stack up). Creating a docker image for elastic-package would involve to ensure that everything works in a docker-in-a-docker scenario. I think I would try to avoid that, WDYT @elastic/ecosystem ?

I just wanted to clarify my suggestion.

AFAIK Buildkite jobs run in a Kibernetes environment. For example elastic/integrations pipelines already use Go and Ubuntu images.

Consequently it opens up a possibility to build a custom image with elastic-package from a standard Go image and use it for Buildklite jobs/steps requiring elastic-package.

Ah! Now I know what you meant @maximpn

Not all steps in CI run in a kubernetes environment. There are steps that run on VMs, for instance all the steps testing the packages.

Not sure about moving elastic-package to the base image itself, since old images I think are deleted after some time. And for backport branches it is required to be able to run old versions of elastic-package too, that IIUC there could be missing when they are needed. And I don't know if there could be other issues, for instance how easy would be reverting a elastic-package release in integrations.

But I see that there could be other advantages of using a custom base image for those steps, having some software installed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could create manually those PRs in the backport branches.

Or, I saw that in some other PRs it was used mergify to create those PRs: #13856 (comment)
I don't know if there would be conflicts if they are created by mergify.

@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch from 23ecd78 to 099bd2f Compare November 20, 2025 13:54
@maximpn maximpn requested a review from mrodm November 20, 2025 15:25
@maximpn
Copy link
Contributor Author

maximpn commented Nov 20, 2025

@mrodm,

I've applied your suggestion and got rid of elif section to make the implementation cleaner. Could you have a look?

Copy link
Collaborator

@mrodm mrodm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Just a minor comment to avoid an error in bash.

FYI @elastic/ecosystem this PR introduces a way to add custom steps or tests to each package in CI, that could be used in other packages too.

@jsoriano
Copy link
Member

We may speed up the build by using an elastic-package Docker container published to docker.elastic.co. elastic-package installation is a complex process requiring a chain on installations GVM -> Go -> elastic-package. And it takes in average 3 minutes per each integration (integrations build in parallel). On top of that Prebuilt Rules OOM testing Buildkite pipeline has to install elastic-package as well. It sums up to 6 minutes which could be reduced.

I'm not totally sure this could lead to unexpected failures, since elastic-package already is based on docker to run some commands (e.g. elastic-package test system or even elastic-package stack up). Creating a docker image for elastic-package would involve to ensure that everything works in a docker-in-a-docker scenario. I think I would try to avoid that, WDYT @elastic/ecosystem ?

I agree that could be interesting to remove the build of elastic-package, but maybe the easiest path is to download the binaries from the release page.

I would keep the code to build from source too because this is useful to test elastic-package and package-spec branches, but by default I agree that it could be interesting to use the pre-built binaries.

FYI @elastic/ecosystem this PR introduces a way to add custom steps or tests to each package in CI, that could be used in other packages too.

Nice, this is great.

@maximpn maximpn force-pushed the integrate-oom-testing-for-security-detection-engine branch from 099bd2f to 5d55d01 Compare November 24, 2025 12:47
@maximpn maximpn requested a review from mrodm November 24, 2025 12:48
@mrodm
Copy link
Collaborator

mrodm commented Nov 24, 2025

/test

@elasticmachine
Copy link

elasticmachine commented Nov 24, 2025

Copy link
Collaborator

@mrodm mrodm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maximpn !

@maximpn maximpn merged commit 3808997 into elastic:main Nov 25, 2025
7 checks passed
@maximpn maximpn deleted the integrate-oom-testing-for-security-detection-engine branch November 25, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants