Skip to content

Fix OTLP traces e2e test stability#13752

Merged
wu-sheng merged 6 commits intomasterfrom
fix/otlp-traces-e2e-stability
Mar 20, 2026
Merged

Fix OTLP traces e2e test stability#13752
wu-sheng merged 6 commits intomasterfrom
fix/otlp-traces-e2e-stability

Conversation

@wu-sheng
Copy link
Copy Markdown
Member

Fix OTLP traces e2e test instability

The OTLP traces e2e test has been flaky due to infrastructure issues (not sampling rate — that's 100% everywhere).

Root causes identified and fixed:

  1. No health checks on OTel demo containers — trigger fired before services were ready, producing no traces during the retry window.

    • Added healthcheck with TCP checks (same pattern as base-compose OAP)
    • Added depends_on: condition: service_healthy for proper startup ordering
  2. Non-existent service endpoints causing 20-30s timeoutsCURRENCY_SERVICE_ADDR: no.exist:80 and FEATURE_FLAG_GRPC_SERVICE_ADDR: no.exist:80 caused DNS resolution failures and gRPC dial timeouts on every request, making /api/products slow or failing entirely.

    • Changed to productcatalogservice:3550 (reachable endpoint, fast gRPC "unimplemented" error instead of hanging)
  3. Tight memory limitsproductcatalogservice at 20M and frontend at 200M could OOM under CI load.

    • Bumped to 40M and 300M respectively

Also adds e2e expectation specification documents (CLAUDE.md and protocol-specific specs) for AI-assisted e2e test development.

  • Explain briefly why the bug exists and how to fix it.

    • The test containers had no health checks, so the e2e trigger started calling endpoints before services were ready. Combined with DNS timeout on non-existent service addresses, requests took 20-30s each instead of completing quickly, starving the test of valid traces within the verify window.
  • Update the CHANGES log.

1. Fix docker-compose for OTLP traces e2e test:
   - Add health checks to frontend and productcatalogservice containers
   - Use depends_on with condition: service_healthy for proper startup ordering
   - Replace no.exist:80 endpoints with reachable addresses to avoid DNS/connection timeouts
   - Increase memory limits (200M->300M frontend, 20M->40M productcatalogservice)

2. Add e2e expectation specification documents covering all query protocols:
   - Core template syntax (contains, notEmpty, gt/ge/lt/le, b64enc, regexp)
   - GraphQL/MQE with .graphqls schema references
   - LogQL, PromQL, TraceQL, Zipkin v2, Status/Debug endpoints
   - CLAUDE.md guide for navigating e2e test structure
@wu-sheng wu-sheng added test Test requirements about performance, feature or before release. AI Assistant Claude and other AI Coding Tooling and removed test Test requirements about performance, feature or before release. labels Mar 20, 2026
OTel demo images are Alpine-based (no bash, only sh). The bash
/dev/tcp trick doesn't work. Use nc -nz (same as base-compose
BanyanDB healthcheck pattern).
The CI failure was caused by the bash healthcheck, not memory.
Restore original limits: 200M frontend, 20M productcatalogservice.
Remove 7 protocol-specific docs that duplicated info already in
.graphqls schemas and existing expected files. Replace with a single
CLAUDE.md focused on how to write accurate expectation files:
- Exact function semantics (notEmpty, gt/ge/lt/le, regexp, b64enc)
- contains action for unordered subset matching
- Practical recipes: key existence, list min N records, nested objects,
  null handling, variable reuse, conditional rendering
- Common pitfalls (type mismatches, ordering, whitespace)
The core policy: verify exact values for all meaningful fields
(service names, labels, components, layers, scopes, tag keys/values),
only use notEmpty/gt for genuinely dynamic values (timestamps, UUIDs,
IPs). Updated all recipes to demonstrate accurate verification with
real SkyWalking field patterns.
When writing expected files, Claude often cannot determine which
fields are meaningful domain values vs dynamic runtime values.
Added explicit workflow: show complete raw output, propose expected
file with per-field reasoning, ask developer to confirm or correct.
@wu-sheng wu-sheng added this to the 10.4.0 milestone Mar 20, 2026
@wu-sheng wu-sheng merged commit 4f70645 into master Mar 20, 2026
186 checks passed
@wu-sheng wu-sheng deleted the fix/otlp-traces-e2e-stability branch March 20, 2026 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Assistant Claude and other AI Coding Tooling test Test requirements about performance, feature or before release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants