Skip to content

[ML] Split ES integration tests into parallel steps#2990

Merged
edsavage merged 2 commits intoelastic:mainfrom
edsavage:feature/split-es-integration-tests
Mar 17, 2026
Merged

[ML] Split ES integration tests into parallel steps#2990
edsavage merged 2 commits intoelastic:mainfrom
edsavage:feature/split-es-integration-tests

Conversation

@edsavage
Copy link
Contributor

Summary

  • Split each per-architecture Elasticsearch integration test Buildkite step into two independent parallel steps: Multi-Node Tests (javaRestTest) and YAML REST Tests (yamlRestTest).
  • Based on build 2309 timings, this reduces the critical path by ~11 min on x86_64 (40 min → 29 min) and ~14 min on aarch64 (53 min → 39 min).
  • A new ES_TEST_SUITE environment variable in dev-tools/run_es_tests.sh controls which suite to run. When unset, both run sequentially (backward compatible for local use).

Timing breakdown (from build 2309)

Architecture javaRestTest yamlRestTest Old total New wall-clock
x86_64 27 min 12 min 40 min ~29 min
aarch64 37 min 15 min 53 min ~39 min

Additional benefits

  • Immediate failure attribution — visible in Buildkite which suite failed
  • Selective retry of just the failed suite
  • Independent timeout budgets per suite

Test plan

  • PR build shows 4 integration test steps (2 per arch) instead of 2
  • All 4 steps run in parallel (both depend only on their build step, not each other)
  • Each step runs only its designated suite (check logs for single ./gradlew invocation)
  • Running dev-tools/run_es_tests.sh locally without ES_TEST_SUITE still runs both suites

Made with Cursor

Split each per-architecture Elasticsearch integration test step into
two independent steps that run in parallel:

- Multi-Node Tests (javaRestTest) — ~27 min on x86_64, ~37 min on aarch64
- YAML REST Tests (yamlRestTest) — ~12 min on x86_64, ~15 min on aarch64

Previously these ran sequentially in a single step, making the total
wall-clock time ~40 min (x86_64) / ~53 min (aarch64). Running them
in parallel reduces the critical path to the duration of the slower
suite, saving ~11-14 minutes per PR build.

The split also improves failure attribution (immediately visible which
suite failed) and enables selective retry of just the failed suite.

A new ES_TEST_SUITE environment variable controls which Gradle command
to run. When unset, both suites run sequentially for backward
compatibility with local developer use.

Made-with: Cursor
@prodsecmachine
Copy link

prodsecmachine commented Mar 12, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Resolve conflict in dev-tools/run_es_tests.sh: combine the
ES_TEST_SUITE case statement from this branch with the Gradle build
cache support (init script, GCS restore/upload) from upstream/main.

Made-with: Cursor
@edsavage edsavage merged commit e85af07 into elastic:main Mar 17, 2026
18 checks passed
edsavage added a commit to edsavage/ml-cpp that referenced this pull request Mar 17, 2026
Resolve conflict in dev-tools/run_es_tests.sh: incorporate
ES_TEST_SUITE support from elastic#2990 (parallel javaRestTest/yamlRestTest
steps) into our thin-wrapper architecture that delegates to
run_es_tests_common.sh.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants