Skip to content

feat(kb): add kibana functional tests#279

Merged
margaretjgu merged 39 commits into
mainfrom
feat/kb-functional-tests
May 5, 2026
Merged

feat(kb): add kibana functional tests#279
margaretjgu merged 39 commits into
mainfrom
feat/kb-functional-tests

Conversation

@margaretjgu
Copy link
Copy Markdown
Contributor

@margaretjgu margaretjgu commented May 4, 2026

Summary

Closes #219

  • Adds hand-authored functional tests for 5 Kibana API namespaces: data-views, spaces, alerting, connectors, saved-objects
  • Each test exercises the full CRUD lifecycle (create, get, list, delete / export, import) using the same set -euo pipefail + elastic --json + jq assertion pattern as ES functional tests
  • Fixes two pre-existing bugs in kibana-client.ts (introduced in feat(kb): Add Kibana API support with generated command definitions #195) required to make saved-objects work:
    • application/x-ndjson responses are now parsed line-by-line into a JSON array instead of throwing a JSON parse error
    • multipart/form-data requests are now sent via FormData (file upload) instead of application/json, fixing the 415 Unsupported Media Type error from Kibana
  • Adds requestType and responseType fields to KbApiDefinition so the request builder and client know which transport mode to use (depends on elastic/elastic-client-generator-js feat/kb-ndjson-multipart)
  • Adds a Buildkite CI step (run-kb-tests.sh) that starts ES + Kibana Docker containers and runs the suite on Node 22 and 24
  • Adds test:functional:kb npm script for local execution

Future optimization: #280

Test plan

  • All 5 tests pass locally against a live Kibana 9.3.0 (npm run test:functional:kb: 5 passed, 0 failed)
  • Generator PR (elastic-client-generator-js feat/kb-ndjson-multipart) merged first
  • Buildkite KB functional tests step passes in CI

@margaretjgu margaretjgu changed the title Feat/kb functional tests feat(kb): add kibana functional tests May 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

⚠️MegaLinter analysis: Success with warnings

Descriptor Linter Files Fixed Errors Warnings Elapsed time
⚠️ BASH shellcheck 8 1 0 0.46s
✅ COPYPASTE jscpd yes no no 5.26s
✅ REPOSITORY gitleaks yes no no 92.27s
✅ REPOSITORY git_diff yes no no 0.46s
✅ REPOSITORY secretlint yes no no 17.72s
✅ REPOSITORY trivy yes no no 15.94s
✅ TYPESCRIPT eslint 7 0 0 6.07s
✅ YAML yamllint 1 0 0 0.78s

Detailed Issues

⚠️ BASH / shellcheck - 1 error
In .buildkite/run-kb-tests.sh line 65:
  docker load < "$(ls "$ES_CACHE_DIR/elasticsearch-$STACK_VERSION"*.tar.gz | head -1)"
                   ^-- SC2012 (info): Use find instead of ls to better handle non-alphanumeric filenames.

For more information:
  https://www.shellcheck.net/wiki/SC2012 -- Use find instead of ls to better ...

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

@JoshMock
Copy link
Copy Markdown
Member

JoshMock commented May 5, 2026

Not sure what's up with Megalinter, but otherwise this LGTM.

@margaretjgu
Copy link
Copy Markdown
Contributor Author

Not sure what's up with Megalinter, but otherwise this LGTM.

yea it was flagging some dummy keys as a leak. updatged the linter to ignore these specific keys

@margaretjgu
Copy link
Copy Markdown
Contributor Author

ES startup timeout: root cause and fixes

Getting ES running in CI for the Kibana tests turned out to be surprisingly tricky. Here is what happened and how we fixed it.

Why ES kept timing out

When you run docker run --detach, Docker does not just start the process. It first has to unpack the image layers into the container filesystem (via overlay2). For the 695MB Elasticsearch image, this alone took 7+ minutes on cold CI agents. Our health check timer started immediately after docker run returned, so we were timing out before ES had even booted its JVM.

The script was doing things in this order:

npm install > npm build > docker pull > docker run ES > wait 6 min > timeout

Meanwhile, ES was not even alive yet.

What we changed

1. Start ES before the build

npm install + npm build takes about 15 minutes. We now start the ES and Kibana containers before that, so they boot in the background while the build runs. By the time the build finishes, both services are ready and waiting.

docker pull > docker run ES > docker run Kibana > npm install > npm build > check health (instant)

2. Better agent image

Switched from family/core-ubuntu-2204 to family/kibana-ubuntu-2404 (same image the Kibana team uses) with machineType: n2-standard-4. The newer N2 hardware has faster local storage I/O, which is the main bottleneck for Docker image unpacking.

3. Two-phase ES health check

Even after /_cluster/health returns 200, the .security-7 index may not exist yet. Kibana's alerting and connectors plugins need it to store API keys. We now explicitly probe POST /_security/api_key to confirm the security index is bootstrapped before starting Kibana. This is borrowed from Kibana's own wait_for_security_index.ts.

4. ES snapshot cache

The kibana-ubuntu-2404 agent image may have ES Docker layers pre-cached on disk. The script now checks $ES_CACHE_DIR first and loads from there before falling back to a registry pull, which is the same approach Kibana uses.

…lth checks

Ubuntu 24.04 agents use nftables which can break Docker --publish port forwarding.
Resolve container IPs on the Docker bridge network immediately after docker run
and use those for all health checks and CLI config. The Linux host can always
reach bridge network container IPs directly without going through port publishing.
margaretjgu added 13 commits May 5, 2026 16:03
Docker bridge/port-publishing both fail on kibana-ubuntu-2404 agents
(Ubuntu 24.04 uses nftables which breaks Docker NAT rules). --network host
puts both containers directly on the host network stack so localhost:9200
and localhost:5601 work unconditionally — equivalent to how Kibana's own CI
runs ES as a native process via node scripts/es snapshot.
…unner container

On kibana-ubuntu-2404 agents, all host→container networking is broken:
--network host is blocked (user namespace remapping), --publish doesn't work
(nftables replaces iptables), and direct bridge IPs are not routed to the host.

The only reliable networking is inter-container communication on a custom
bridge network. This matches how Kibana's own CI works: Kibana runs ES
natively so localhost always works; we achieve the same by running a dedicated
test-runner container (node:NODE_VERSION-bookworm-slim) on the same network as
ES and Kibana, using Docker DNS aliases (elasticsearch:9200, kibana:5601). The
built workspace is mounted read-only so node dist/cli.js works without rebuilding.
The test runner container couldn't reach ES via Docker DNS (elasticsearch:9200)
in build 268 for an unknown reason. This adds:

- Network diagnostics in the runner (resolv.conf, routes, first verbose curl)
  so the next failure gives us the exact error (DNS failure, TCP refused, etc.)
- IP fallback: docker inspect fetches ES/Kibana container IPs on the host and
  passes them as ES_IP / KB_IP; the runner uses these if DNS lookup fails
- docker logs for ES and Kibana in the cleanup trap for visibility
- Lower Node.js heap limit (6GB -> 4GB) to reduce memory pressure during build
ES 8.0+ auto-enables HTTPS on the HTTP layer when ELASTIC_PASSWORD is set.
This caused "Empty reply from server" errors because the test runner and Kibana
were connecting via http:// to a port expecting TLS. Kibana then crashed and
its DNS entry was removed, explaining the secondary "kibana DNS failed" symptom.

Setting xpack.security.http.ssl.enabled=false keeps security (auth, RBAC, API
keys) enabled while allowing plain HTTP access, which is fine for CI.
Two fixes:
- Kibana was crashing silently because --rm auto-removed the container before
  the cleanup trap could collect logs. Removed --rm so docker logs always works.
- Kibana was likely crashing because it tried to connect to ES before ES finished
  bootstrapping. Moved Kibana startup to after the npm build (~3 min buffer),
  so ES is fully ready when Kibana first connects. ES still starts early.
- Also adds xpack.security.transport.ssl.enabled=false to ES for consistency
  with the http.ssl flag (aligns with elastic/start-local reference setup).
Kibana 9.x explicitly forbids ELASTICSEARCH_USERNAME=elastic with a fatal
config validation error. We must use kibana_system instead.

Since the host cannot reach ES directly on this agent, a one-shot Node.js
container (setup-kibana.js) runs on the same Docker network, waits for ES
cluster health and the security index, sets the kibana_system password, then
exits. Kibana is then started with ELASTICSEARCH_USERNAME=kibana_system.
The repo has "type": "module" in package.json so .js files are treated as
ESM, causing "require is not defined in ES module scope". Renaming to .cjs
forces Node to treat it as CommonJS regardless of package.json. Also adds
the missing SPDX-License-Identifier header to pass the test:spdx check.
The /api/actions/connector_types and /api/alerting/rules/_find health checks
were timing out because the Fleet plugin's retry loop (FleetEncryptedSaved
ObjectEncryptionKeyRequired for agent binary source) was causing those
endpoints to return non-200 responses. Fleet's issue is unrelated to our tests.

Replace the 30-retry polling loop with a 15-second sleep after Kibana reports
"available". By that point all essential plugins (alerting, actions) are
initialised as part of the "available" state.
… check

Three root causes addressed:

1. `params` (alerting create) and `config`/`secrets` (connector create/update)
   were typed as "string" in the generated API definitions.  The CLI factory
   only JSON-parses flag values for "object"/"array" typed params, so these
   were sent as raw string literals instead of JSON objects, producing 400
   errors from the Kibana API.  Fixed in the generator (elastic-client-generator-js#174)
   and reflected here.

2. The previous `sleep 15` after Kibana's "available" status was not reliable.
   Kibana's actions plugin serves 403 "license information is not available"
   until its license subscription fires after connecting to ES.  Replaced with
   an active poll on GET /api/actions/connector_types which directly confirms
   the license is loaded and the actions API is ready.

3. Added stderr capture (2>/tmp/cli-err.txt + cat on failure) to the first
   CLI call in alerting.sh and connectors.sh so the actual HTTP error is
   visible in the Buildkite log if any future failure occurs.
Polling GET /api/actions/connector_types directly was causing repeated
500 Server Errors in Kibana's HTTP access log (the actions plugin HTTP
context is not yet wired when Kibana first reports 'available', so early
requests get a 500).  This looked like the Fleet error resurfacing.

Switch to polling /api/status and checking
  .status.plugins.actions.level == 'available'
  .status.plugins.alerting.level == 'available'

The status endpoint always returns 200 and never causes log noise.
Fleet degradation appears only in plugins.fleet and does not affect
plugins.actions or plugins.alerting.
Kibana's Docker entrypoint only processes environment variables in
SCREAMING_SNAKE_CASE format (e.g. XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY).
Dotted-notation names (e.g. xpack.encryptedSavedObjects.encryptionKey) are
not picked up, so encryptedSavedObjects.canEncrypt stayed false in CI.

Every call to getActionsClient() checks canEncrypt and throws
  'Unable to create actions client because the Encrypted Saved Objects
   plugin is missing encryption key'
causing a 500 on both POST /api/alerting/rule and GET /api/actions/connector_types.

Confirmed: local Kibana (start-local) sets XPACK_ENCRYPTEDSAVEDOBJECTS_ENCRYPTIONKEY
and all 5 functional tests pass (5/5 locally).
@margaretjgu margaretjgu enabled auto-merge (squash) May 5, 2026 23:08
@margaretjgu margaretjgu merged commit adb6cc7 into main May 5, 2026
20 of 35 checks passed
@margaretjgu margaretjgu deleted the feat/kb-functional-tests branch May 5, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: Add functional tests for Kibana API commands

2 participants