Add Fleet and Kibana checks in create env workflow#4092
Draft
seanrathier wants to merge 9 commits intomainfrom
Draft
Add Fleet and Kibana checks in create env workflow#4092seanrathier wants to merge 9 commits intomainfrom
seanrathier wants to merge 9 commits intomainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @seanrathier? 🙏
|
The /api/fleet/status endpoint returns 404 on ESS deployments running Kibana 9.x. Switch to /api/fleet/agent_policies which is the first endpoint the integration scripts rely on and confirms Fleet is truly ready. Also log the response body on non-200 to aid future debugging, add kbn-xsrf header, and add serverless-mode input to skip the Fleet check on serverless deployments where Fleet is managed by Elastic Cloud. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GET /api/fleet/agent_policies returning 200 only confirms Kibana's Fleet plugin is up, not that Fleet Server is ready for writes. Switch to POST /api/fleet/setup which is idempotent and only returns isInitialized:true once Fleet Server is fully configured and accepting connections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /api/fleet/setup returning isInitialized:true only confirms Fleet data is written to Elasticsearch; Fleet Server itself can still be starting up. Add a second polling stage on GET /api/fleet/epm/packages which is the first call every install script makes and requires the full Fleet Server stack to be operational before returning 200. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A single 200 from /api/fleet/epm/packages is not enough — Fleet Server cycles through restart windows and one passing poll can be followed immediately by 502/503. Require 3 consecutive 200s (up to 60 attempts) before declaring Fleet stable. Also retry the transient 400 "not available with the current configuration" in perform_api_call — Fleet emits this during initialisation but it resolves quickly and should not kill the script on first occurrence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Require 5 consecutive 200s (up from 3) from GET /api/fleet/epm/packages before declaring Fleet Server stable — a 3-pass window was too narrow to catch restart cycles. Increase perform_api_call max_retries default from 3 to 8 and cap exponential backoff at 30s (5, 10, 20, 30, 30, 30, 30s ≈ 2.5 min), giving scripts enough time to ride through a Fleet Server restart before giving up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
429 responses from Fleet (e.g. TLS handshake timeout surfaced as rate limit) are transient and should be retried alongside 5xx errors. get_package_version silently returned None on failure, causing all 16 install scripts to POST a null package version and receive a confusing 400 "expected string but got null". Re-raise the exception so scripts fail immediately with a clear error instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of your changes
Adds a Kibana and Fleet readiness check to the CDR composite action, ensuring Fleet Server is fully operational before integration installation scripts run. Previously, integrations were installed immediately after Kibana reported available, causing race conditions where Fleet Server was still initializing.
Root causes addressed:
Files changed:
Screenshot/Data
Related Issues
Checklist
Introducing a new rule?