Summary
The KB functional test step currently pulls the Elasticsearch and Kibana Docker images at runtime (~1 min) and then waits for ES to fully boot with security enabled (~7-10 min on cold agents). Even with our current workaround of starting ES before the npm build phase, this adds significant overhead and is sensitive to agent variability.
Kibana's own CI avoids this entirely by pre-baking the ES snapshot/image into the Buildkite agent and using a local cache directory ($ES_CACHE_DIR/cache), so no network pull is needed at test time.
Proposed solution
Pre-bake the Elasticsearch (and optionally Kibana) Docker images into the family/core-ubuntu-2204 GCP agent image used by the kb-functional pipeline step, similar to what the Kibana team does with their setup_es_snapshot_cache.sh script.
At test time, docker run would use the already-present image layer cache rather than pulling from the registry, reducing cold-start time from ~8 minutes to ~1-2 minutes.
References
- Kibana's implementation:
.buildkite/scripts/setup_es_snapshot_cache.sh + ES_CACHE_DIR env var on their agents
- Our current workaround: starting ES before
npm run build to overlap boot time with the build phase (elastic/cli#279)
- Affected pipeline step:
kb-functional in .buildkite/pipeline.yml
Acceptance criteria
Summary
The KB functional test step currently pulls the Elasticsearch and Kibana Docker images at runtime (~1 min) and then waits for ES to fully boot with security enabled (~7-10 min on cold agents). Even with our current workaround of starting ES before the
npmbuild phase, this adds significant overhead and is sensitive to agent variability.Kibana's own CI avoids this entirely by pre-baking the ES snapshot/image into the Buildkite agent and using a local cache directory (
$ES_CACHE_DIR/cache), so no network pull is needed at test time.Proposed solution
Pre-bake the Elasticsearch (and optionally Kibana) Docker images into the
family/core-ubuntu-2204GCP agent image used by thekb-functionalpipeline step, similar to what the Kibana team does with theirsetup_es_snapshot_cache.shscript.At test time,
docker runwould use the already-present image layer cache rather than pulling from the registry, reducing cold-start time from ~8 minutes to ~1-2 minutes.References
.buildkite/scripts/setup_es_snapshot_cache.sh+ES_CACHE_DIRenv var on their agentsnpm run buildto overlap boot time with the build phase (elastic/cli#279)kb-functionalin.buildkite/pipeline.ymlAcceptance criteria
elasticsearch:${STACK_VERSION}image is available on the agent without a registry pull