Fix OTLP traces e2e test stability by wu-sheng · Pull Request #13752 · apache/skywalking

wu-sheng · 2026-03-20T01:50:02Z

Fix OTLP traces e2e test instability

The OTLP traces e2e test has been flaky due to infrastructure issues (not sampling rate — that's 100% everywhere).

Root causes identified and fixed:

No health checks on OTel demo containers — trigger fired before services were ready, producing no traces during the retry window.
- Added healthcheck with TCP checks (same pattern as base-compose OAP)
- Added depends_on: condition: service_healthy for proper startup ordering
Non-existent service endpoints causing 20-30s timeouts — CURRENCY_SERVICE_ADDR: no.exist:80 and FEATURE_FLAG_GRPC_SERVICE_ADDR: no.exist:80 caused DNS resolution failures and gRPC dial timeouts on every request, making /api/products slow or failing entirely.
- Changed to productcatalogservice:3550 (reachable endpoint, fast gRPC "unimplemented" error instead of hanging)
Tight memory limits — productcatalogservice at 20M and frontend at 200M could OOM under CI load.
- Bumped to 40M and 300M respectively

Also adds e2e expectation specification documents (CLAUDE.md and protocol-specific specs) for AI-assisted e2e test development.

Explain briefly why the bug exists and how to fix it.
- The test containers had no health checks, so the e2e trigger started calling endpoints before services were ready. Combined with DNS timeout on non-existent service addresses, requests took 20-30s each instead of completing quickly, starving the test of valid traces within the verify window.
Update the CHANGES log.

1. Fix docker-compose for OTLP traces e2e test: - Add health checks to frontend and productcatalogservice containers - Use depends_on with condition: service_healthy for proper startup ordering - Replace no.exist:80 endpoints with reachable addresses to avoid DNS/connection timeouts - Increase memory limits (200M->300M frontend, 20M->40M productcatalogservice) 2. Add e2e expectation specification documents covering all query protocols: - Core template syntax (contains, notEmpty, gt/ge/lt/le, b64enc, regexp) - GraphQL/MQE with .graphqls schema references - LogQL, PromQL, TraceQL, Zipkin v2, Status/Debug endpoints - CLAUDE.md guide for navigating e2e test structure

OTel demo images are Alpine-based (no bash, only sh). The bash /dev/tcp trick doesn't work. Use nc -nz (same as base-compose BanyanDB healthcheck pattern).

The CI failure was caused by the bash healthcheck, not memory. Restore original limits: 200M frontend, 20M productcatalogservice.

Remove 7 protocol-specific docs that duplicated info already in .graphqls schemas and existing expected files. Replace with a single CLAUDE.md focused on how to write accurate expectation files: - Exact function semantics (notEmpty, gt/ge/lt/le, regexp, b64enc) - contains action for unordered subset matching - Practical recipes: key existence, list min N records, nested objects, null handling, variable reuse, conditional rendering - Common pitfalls (type mismatches, ordering, whitespace)

The core policy: verify exact values for all meaningful fields (service names, labels, components, layers, scopes, tag keys/values), only use notEmpty/gt for genuinely dynamic values (timestamps, UUIDs, IPs). Updated all recipes to demonstrate accurate verification with real SkyWalking field patterns.

When writing expected files, Claude often cannot determine which fields are meaningful domain values vs dynamic runtime values. Added explicit workflow: show complete raw output, propose expected file with per-field reasoning, ask developer to confirm or correct.

wu-sheng added test Test requirements about performance, feature or before release. AI Assistant Claude and other AI Coding Tooling and removed test Test requirements about performance, feature or before release. labels Mar 20, 2026

wu-sheng added 5 commits March 20, 2026 11:38

Fix healthcheck: use nc instead of bash /dev/tcp for Alpine images

869e693

OTel demo images are Alpine-based (no bash, only sh). The bash /dev/tcp trick doesn't work. Use nc -nz (same as base-compose BanyanDB healthcheck pattern).

Revert memory limit changes - no evidence of OOM issues

c2f57b9

The CI failure was caused by the bash healthcheck, not memory. Restore original limits: 200M frontend, 20M productcatalogservice.

wu-sheng added this to the 10.4.0 milestone Mar 20, 2026

wankai123 approved these changes Mar 20, 2026

View reviewed changes

wu-sheng merged commit 4f70645 into master Mar 20, 2026
186 checks passed

wu-sheng deleted the fix/otlp-traces-e2e-stability branch March 20, 2026 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OTLP traces e2e test stability#13752

Fix OTLP traces e2e test stability#13752
wu-sheng merged 6 commits intomasterfrom
fix/otlp-traces-e2e-stability

wu-sheng commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wu-sheng commented Mar 20, 2026

Fix OTLP traces e2e test instability

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants