Skip to content

Add Fleet and Kibana checks in create env workflow#4092

Draft
seanrathier wants to merge 9 commits intomainfrom
seanrathier/cdr-kibana-readiness-check
Draft

Add Fleet and Kibana checks in create env workflow#4092
seanrathier wants to merge 9 commits intomainfrom
seanrathier/cdr-kibana-readiness-check

Conversation

@seanrathier
Copy link
Copy Markdown
Contributor

@seanrathier seanrathier commented Apr 2, 2026

Summary of your changes

Adds a Kibana and Fleet readiness check to the CDR composite action, ensuring Fleet Server is fully operational before integration installation scripts run. Previously, integrations were installed immediately after Kibana reported available, causing race conditions where Fleet Server was still initializing.

Root causes addressed:

  • `GET /api/fleet/status` returns 404 on ESS deployments — replaced with `POST /api/fleet/setup` (triggers Fleet initialization) followed by stability polling on `GET /api/fleet/epm/packages`
  • Fleet Server passes a single readiness check but crashes under load — now requires 5 consecutive 200s before proceeding
  • `perform_api_call` only retried 5xx errors — 429 (TLS handshake timeout surfaced as rate limit) and `400 "not available with current configuration"` (Fleet mid-init) are now also retried
  • `get_package_version` silently returned `None` on failure, causing 16 install scripts to POST a null package version and receive a confusing `400 "expected string but got null"` — now raises immediately
  • `delete_env.sh` jq expression had an operator precedence bug causing `Cannot index string with string "StackName"` when filtering CloudFormation stacks

Files changed:

  • `.github/actions/cdr/action.yml` — new "Wait for Kibana and Fleet to be ready" step: Kibana available → `POST /api/fleet/setup` → 5× consecutive `GET /api/fleet/epm/packages` 200; adds `serverless-mode` input to skip the Fleet check on serverless deployments
  • `.github/workflows/test-environment.yml` — passes `serverless-mode` through to the CDR action
  • `tests/fleet_api/base_call_api.py` — retries 429 and the Fleet-not-ready 400; increases default `max_retries` from 3 to 8 with backoff capped at 30s
  • `tests/fleet_api/common_api.py` — `get_package_version` raises on failure instead of returning `None`
  • `deploy/test-environments/delete_env.sh` — fixes jq operator precedence bug in CloudFormation stack filtering

Screenshot/Data

Related Issues

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added the necessary README/documentation (if appropriate)

Introducing a new rule?

@mergify
Copy link
Copy Markdown

mergify bot commented Apr 2, 2026

This pull request does not have a backport label. Could you fix it @seanrathier? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

seanrathier and others added 8 commits April 2, 2026 10:52
The /api/fleet/status endpoint returns 404 on ESS deployments running
Kibana 9.x. Switch to /api/fleet/agent_policies which is the first
endpoint the integration scripts rely on and confirms Fleet is truly
ready. Also log the response body on non-200 to aid future debugging,
add kbn-xsrf header, and add serverless-mode input to skip the Fleet
check on serverless deployments where Fleet is managed by Elastic Cloud.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /api/fleet/agent_policies returning 200 only confirms Kibana's Fleet
plugin is up, not that Fleet Server is ready for writes. Switch to
POST /api/fleet/setup which is idempotent and only returns
isInitialized:true once Fleet Server is fully configured and accepting
connections.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /api/fleet/setup returning isInitialized:true only confirms Fleet
data is written to Elasticsearch; Fleet Server itself can still be
starting up. Add a second polling stage on GET /api/fleet/epm/packages
which is the first call every install script makes and requires the full
Fleet Server stack to be operational before returning 200.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A single 200 from /api/fleet/epm/packages is not enough — Fleet Server
cycles through restart windows and one passing poll can be followed
immediately by 502/503. Require 3 consecutive 200s (up to 60 attempts)
before declaring Fleet stable.

Also retry the transient 400 "not available with the current
configuration" in perform_api_call — Fleet emits this during
initialisation but it resolves quickly and should not kill the script
on first occurrence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Require 5 consecutive 200s (up from 3) from GET /api/fleet/epm/packages
before declaring Fleet Server stable — a 3-pass window was too narrow
to catch restart cycles.

Increase perform_api_call max_retries default from 3 to 8 and cap
exponential backoff at 30s (5, 10, 20, 30, 30, 30, 30s ≈ 2.5 min),
giving scripts enough time to ride through a Fleet Server restart before
giving up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
429 responses from Fleet (e.g. TLS handshake timeout surfaced as rate
limit) are transient and should be retried alongside 5xx errors.

get_package_version silently returned None on failure, causing all 16
install scripts to POST a null package version and receive a confusing
400 "expected string but got null". Re-raise the exception so scripts
fail immediately with a clear error instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@seanrathier seanrathier changed the title status check Add Fleet and Kibana checks in create env workflow Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant