apache · wu-sheng · Apr 21, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/.claude/skills/new-monitoring-feature/SKILL.md b/.claude/skills/new-monitoring-feature/SKILL.md
@@ -11,6 +11,14 @@ There is one skill per narrow concern. This one is the wiring map.
 
 ---
 
+## Working posture
+
+- **Default scope for a new monitoring feature is narrow.** For most new monitoring targets, traces are **re-used as-is** — you do not write or modify a trace analyzer. Metrics and log extensions via OTLP (MAL + LAL) are the common shape. OTLP / Zipkin span-based metrics analysis (`SpanListener`) already exists and only needs a new implementation if your feature introduces genuinely new semantic-convention attributes to extract (e.g. `gen_ai.*`, `mcp.*` for the GenAI layer). Do **not** touch `CommonAnalysisListener` / `RPCAnalysisListener` / `SegmentAnalysisListener` unless the OAP core protocol gains a new `SpanLayer` enum value — details in §2.
+- **If you find yourself editing more than this skill suggests, stop and confirm.** Scope creep on trace analysis, span-layer mapping, receiver handlers, or protobuf/proto contracts is a strong signal the framing is wrong. Ask the developer before committing that direction.
+- **Verify every new feature locally before pushing to CI.** Compile + checkstyle + license-eye are minimums, but they only tell you the code builds — they do **not** tell you the new MAL rule compiles its closures at startup, the LAL layer:auto dispatch works, the OTLP payload shape matches the MAL labels, the UI template renders against real data, or the layer shows up in `swctl service ly <LAYER>`. Run the new e2e case end-to-end locally against a live OAP (see [run-e2e skill](../run-e2e/SKILL.md)) and fire every verify step with `swctl` by hand at least once. Pushing to CI without a local green run wastes shared CI cycles and slow-loops review.
+
+---
+
 ## 0. Register the `Layer` — the feature's entry point
 
 A `Layer` is how OAP slices services / instances / endpoints by data source. **Every new feature needs a new `Layer` enum value.** The UI, storage partitioning, menu navigation, and OAL aggregation all key off it.
@@ -63,6 +71,36 @@ All five DSLs (OAL, MAL, LAL, Hierarchy + SpanListener SPI) compile via ANTLR4 +
 
 SkyWalking OAP ingests traces through two distinct entry points. Pick based on source.
 
+> **Before writing any trace extension — ask "do I actually need one?"**
+>
+> For **most new layers you do not touch trace analysis at all.** Segments (native or
+> OTLP) flow through the existing pipeline, `RPCAnalysisListener.parseExit` produces
+> the outbound `ServiceRelation` edges, the backend-side agent produces the inbound
+> metrics via its own `parseEntry`, and the new-layer Service / Instance / Endpoint
+> entities come from **MAL + LAL**, not from trace analysis. This is how browser
+> (`Layer.BROWSER`), iOS (`Layer.IOS`), mobile mini-programs
+> (`Layer.WECHAT_MINI_PROGRAM` / `Layer.ALIPAY_MINI_PROGRAM`), and most
+> scraped-metric layers work.
+>
+> Extend trace analysis **only** when the feature genuinely needs per-span
+> interpretation that the existing pipeline cannot express:
+>
+> - **A new `SpanLayer`** — OAP core protobuf has a new `SpanLayer` enum value that
+>   callers of `identifyServiceLayer` / `identifyRemoteServiceLayer` must dispatch on.
+>   That means touching `CommonAnalysisListener` / `RPCAnalysisListener`.
+> - **A genuinely new signal class not covered by existing listeners** — e.g. the GenAI
+>   work added `GenAISpanListener` for LLM / MCP instrumentation because OTLP spans
+>   carry semantic-convention attributes (`gen_ai.*`, `mcp.*`) that no prior listener
+>   extracted. That is rare.
+>
+> **Anti-pattern we've been bitten by:** using a componentId mapping in
+> `CommonAnalysisListener` to assign a client-side layer to exit-only segments. This
+> does not produce inbound metrics anyway (no entry spans → no
+> `service_cpm` / `endpoint_cpm` from OAL), and it is orthogonal to the Service entity
+> layer (which MAL / LAL sets). Mirror the browser pattern instead: ship `Layer` +
+> `component-libraries.yml` + MAL + LAL + dashboards, and leave the listener chain
+> alone.
+
 ### 2.1 SkyWalking native segments — `AnalysisListener`
 
 **Source protocol**: `apm-protocol/apm-network/src/main/proto/language-agent/Tracing.proto` (`TraceSegmentObject`).
@@ -112,7 +150,9 @@ Return `SpanListenerResult.builder().shouldPersist(false).build()` to prevent a
 
 ### 3.1 OAL (for entities already produced by core sources)
 
-`oap-server/server-starter/src/main/resources/oal/core.oal` already defines the standard `service_cpm`, `service_resp_time`, `service_sla`, `endpoint_cpm`, etc. — keyed off the `Service`, `ServiceInstance`, `Endpoint` source scopes. If your new layer just emits those entities with the right `Layer`, you get all the core metrics for free.
+`oap-server/server-starter/src/main/resources/oal/core.oal` already defines the standard `service_cpm`, `service_resp_time`, `service_sla`, `endpoint_cpm`, etc. — keyed off the `Service`, `ServiceInstance`, `Endpoint` source scopes. If your new layer emits those entities with the right `Layer`, you get the core metrics for free.
+
+**Who emits those entities matters.** `service_cpm` / `service_resp_time` / `service_sla` / `endpoint_cpm` / `endpoint_resp_time` come from `RPCAnalysisListener.parseEntry` → `callingIn.toService()` / `toEndpoint()`. That only fires on **inbound entry spans**. Client-side / edge layers (browser, iOS, mini-programs) only emit exit spans, so these OAL metrics do **not** populate under those layers — their Service entities are produced by MAL / LAL, and their dashboards use MAL metrics (`meter_<layer>_*`) for load / latency, not `service_cpm` / `service_resp_time`. Don't wire dashboards to OAL-derived metrics that the layer never produces.
 
 Additional OAL rules go in the same file. OAL grammar: `oap-server/oal-grammar/src/main/antlr4/OALParser.g4`. Syntax and examples: [`docs/en/concepts-and-designs/oal.md`](../../../docs/en/concepts-and-designs/oal.md).
 
@@ -169,6 +209,8 @@ JSON files under `oap-server/server-starter/src/main/resources/ui-initialized-te
 - `<feature>-endpoint.json` — per-endpoint dashboard.
 - Menu link in `oap-server/server-starter/src/main/resources/ui-initialized-templates/menu.yaml`.
 
+> **Collaborate with the developer on dashboard design.** Dashboard layout and metric selection are product decisions, not derivable from the MAL rules alone. Before authoring the JSON, check in with the developer on: which metric goes on which panel, which percentiles / aggregations to show, what's on the root list vs. per-service vs. per-instance vs. per-endpoint pages, what stays in the Overview tab vs. a sub-tab, widget sizes / ordering, and whether Trace / Log tabs belong at all (depends on whether the feature emits traces and logs). Then **manually set up the feature against a live OAP** — emit a few real OTLP payloads, open each dashboard in the UI, confirm every widget renders non-empty, and walk the drill-down from root → service → instance → endpoint. Only after that eyes-on pass is the dashboard ready to ship.
+
 Details that always bite:
 
 - **Tab chrome eats ~4 rows.** Set outer `Tab.h = content_max_y + 4` or you get an inner scrollbar on top of the page scrollbar.
@@ -227,7 +269,7 @@ Feature-specific extras:
 3. **OTLP data hygiene**:
    - `spanId` is exactly 16 hex chars; `traceId` is exactly 32. Protobuf-JSON decodes them as base64; other lengths return HTTP 400.
    - Setup-step curl loops must fail loudly: `curl -sS -f --retry 30 --retry-delay 5 --retry-connrefused --retry-all-errors --max-time 10 ...` + `set -e`. The pattern `for …; do curl && break || sleep 5; done` silently succeeds when every attempt connection-refused (the final `sleep` returns 0) because OAP takes ~50 s to start in CI.
-4. **Verify order matters**. `e2e verify` stops at the first failing case. Fire each verify command manually against a live OAP via `swctl` before committing (stale CLI flags and wrong metric names burn 20 min of CI retries otherwise).
+4. **Verify each case by hand before pushing** and use a published simulator image where the upstream SDK has one. See [`run-e2e` skill §6–§9](../run-e2e/SKILL.md) for verify-query triage, expected-file authoring, and the list of expected-file traps (YAML quoting on colons, nested `contains` fragility, `timeUnixNano: "0"`, setup-step curl loops that swallow failure).
 5. **Config dump**: when you modify any `application.yml` default (rule list, handler list, ...), also update `test/e2e-v2/cases/storage/expected/config-dump.yml`.
 
 ---
@@ -250,6 +292,18 @@ Run in order; each has a dedicated skill:
 | Trap | Symptom | Fix |
 |---|---|---|
 | Layer folder named with hyphens | `ui-initialized-templates/my-layer/*.json` on disk but dashboard empty | Folder must be `Layer.name().toLowerCase()` (underscores). `wechat_mini_program/`, not `wechat-mini-program/` |
+| Extending `CommonAnalysisListener` for a client-side layer | Added componentId → layer mapping "to get topology for free"; then `service_cpm` etc. don't populate anyway | Client-side (exit-only) layers don't need trace-analysis changes. Ship `Layer` + `component-libraries.yml` + MAL + LAL. See §2 header for when extending trace analysis is actually justified |
+| Dashboards wired to OAL-derived metrics on a client-side layer | Charts stay empty — `service_cpm` / `service_resp_time` / `endpoint_cpm` never populate | Those come from inbound `parseEntry`, which mini-program / browser / iOS don't emit. Use MAL metrics (`meter_<layer>_*`) instead |
+| LAL `#` comment in DSL body | OAP boot fails with `extraneous input 'metrics'` at the `#` line | LAL grammar accepts `//` and `/* */` only. No `#` comments |
+| LAL bare def-var value access — `tag 'x': myvar` | OAP boot fails with `Cannot resolve expression '[myvar]'` | `LALValueCodegen` only treats def-vars as valid when there's a method chain (`myvar?.foo`). Inline `sourceAttribute(...)` or `tag(...)` in the three value-access sites instead of storing to a def, or add a no-op chain like `myvar.toString()`. Conditions (`if (myvar == ...)`) go through a different codegen path and DO handle bare def refs |
+| MAL rule file with multi-doc YAML (`---` separator) | `MALFilterExecutionTest` fails with `expected a single document`; OAP would also fail at startup via `Rules.loadRules` (calls `Yaml.loadAs`, single-doc) | Split into one file per top-level `expSuffix` / `filter` / `metricPrefix` combo. Reference each file by name in `application.yml` (`lalFiles`, `malFiles`, `enabledOtelMetricsRules`) |
+| Histogram bucket bounds in ms with default `.histogram()` call | Percentile values come out 1000× inflated (`P50 = 100000` for a 100 ms request) | MAL's default `defaultHistogramBucketUnit = SECONDS` multiplies `le` by 1000 when converting to stored ms. For SDKs that emit ms bounds directly (mini-program, most in-house JS SDKs), use `.histogram("le", TimeUnit.MILLISECONDS)`. `TimeUnit` is registered in `MALCodegenHelper.ENUM_FQCN` so rule YAML can reference it |
+| `.rate()` / `.increase()` on a DELTA counter / histogram `_count` | Compute delta-of-delta — nonsense values | OTLP DELTA means each flush is already the per-flush count. MAL's per-minute aggregator sums directly to requests-per-minute. For CUMULATIVE (Prometheus default), use `.rate("PT1M")` for per-sec or `.increase("PT1M")` for per-min; they are **not** equivalent (differ by 60×) |
+| MAL bucket-family name with `_histogram` suffix | Rule compiles, metric stays empty | `PrometheusMetricConverter` emits the bucket family as `<metric_name>` (no suffix), with `<metric_name>_count` and `<metric_name>_sum` as the companion families. Reference the bucket family by its bare name in the rule |
+| LAL `tag("key") == null` check on a missing tag | Check never triggers — LAL's `tag()` returns empty string `""` for absent tags, not `null`, so conditional abort is a no-op | Guard on both: `tag("key") == null || tag("key") == ""`. Same for `sourceAttribute("key")` |
+| MAL rule file with `expSuffix: service(...)` mixing service-scoped and endpoint-scoped rules | Endpoint rules' metrics appear under the service entity (no endpoint_traffic rows created); `swctl endpoint ls` returns empty. `MetricConvert.formatExp:132` appends `(exp).expSuffix` to every rule, so `.endpoint(...)` in the rule's expression is followed by `.service(...)` and the service scope wins. | Follow the APISIX pattern — leave `expSuffix:` empty and explicitly chain `.service(...)` / `.endpoint(...)` on each rule. If every rule in a file is same-scope (e.g. all instance-scoped in `*-instance.yaml`), `expSuffix` is still fine |
+
+E2E-side authoring traps (log-content YAML quoting, nested `contains` fragility, `timeUnixNano: "0"`, setup-step curl swallowing failure) live in the [`run-e2e` skill §9](../run-e2e/SKILL.md).
 | Missing layer-root template | Menu item lands on empty "no dashboard" view | `Layer.vue:41-44` requires a template with `isRoot: true`; ship a `<layer>-root.json` (precedent: `ios/ios-root.json`) |
 | New layer not selectable in UI / swctl | `service ly <LAYER>` returns empty | Add the enum in `Layer.java` with a fresh id; see §0 |
 | Stale dist in image | `make docker.oap` succeeds but behavior is old | See [`package` skill](../package/SKILL.md); rebuild dist first, then image |

diff --git a/.claude/skills/run-e2e/SKILL.md b/.claude/skills/run-e2e/SKILL.md
@@ -197,6 +197,27 @@ For a new verify case, the workflow is:
 3. Write the expected file. If the response is a list, wrap the items in `{{- contains . }} ... {{- end }}` so ordering and extra actual items don't fail the match.
 4. Re-run `e2e verify` alone (the containers are still up from the previous run); iterate on the expected file without rebuilding.
 
+### 9. Expected-file authoring traps
+
+These burn CI cycles and pass locally. Each was learned the hard way.
+
+**Unquoted `content: {{ notEmpty .content }}` with `:` inside the value.** Sim-generated or real log content routinely includes colons (`POST https://api.example.com/cart failed: 500`, `HTTP/1.1 500: Internal Server Error`). Without quoting, the template renders to invalid YAML (snakeyaml parses `failed:` as a nested key) and the *whole log entry* marshals to `nil`. Symptom: diff shows `- nil` at *every* position in the expected logs list vs real maps in actual. Fix: wrap in single quotes — `content: '{{ notEmpty .content }}'`. Single-quoted YAML preserves `:` in the scalar; only fails on embedded `'`. Double quotes also work unless the content has `"`.
+
+**Nested `contains` with multiple per-element pattern assertions against a varied stream.** The template renders the block body once per actual element; when the outer block body has *multiple* inner `contains` patterns asserting specific tag key/value pairs, and only *some* actuals satisfy all the inner patterns, `go-cmp` with `contains` can end up comparing `[rendered_for_A0, nil, nil, ...]` vs `[A0, A1, A2, ...]` and fail despite `contains` being permissive on extras. Specifically: outer `contains .logs` with a single log pattern + inner `contains .tags` asserting two distinct key/value pairs. On a simulator emitting heterogeneous errors (js + promise + ajax + pageNotFound), only a subset satisfy the inner assertion. Passes locally with 1–2 logs, fails in CI with 6+.
+- Keep the outer `contains` body lenient: field-shape checks (`notEmpty`, `gt`), one discriminator tag that *every* element in the stream carries.
+- Cover per-category assertions via separate labeled-metric verify cases, not inside the nested template.
+- Rule of thumb: "at least one log exists with the right layer routing" inside the logs expected; per-category coverage via `meter_*_count{label=X}` verify cases.
+
+**Hand-crafted OTLP curl payloads drift from real SDK output.** When the upstream SDK ships a published simulator image (mini-program-monitor's `sim-wechat` / `sim-alipay`, browser-client-js sim, etc.), prefer driving the e2e with that image in `MODE=timed` with a bounded `DURATION_MS` over hand-rolling the OTLP JSON. Hand-crafted payloads miss real-world shape issues: delta-vs-cumulative temporality, label-cardinality surprises, stacktrace formatting variance, attribute key names that changed between SDK versions. Pin to a released tag (`v0.4.0`), not `:latest` or HEAD SHA — reproducibility.
+
+**`timeUnixNano: "0"` in an OTLP metric datapoint.** The receiver propagates this into MAL's bucket computation and the metric lands in the 1970 time bucket — `swctl metrics exec` over the "last N minutes" window won't find it. Either use `$(date +%s)000000000` at setup time or omit the field if the receiver accepts "now" as default.
+
+**Setup-step curl loop with `|| sleep` pattern.** The shell line `for ... do curl && break || sleep 5; done` exits 0 when every attempt connection-refused because the final `sleep 5` returns 0. OAP takes ~50 s to start in CI, so all attempts fail before OAP is ready, and the setup step silently succeeds with zero traffic ingested. Fix: `curl -sS -f --retry 30 --retry-delay 5 --retry-connrefused --retry-all-errors --max-time 10 ...` + `set -e` at step top.
+
+**`swctl` flag rejected.** If a verify case uses a flag the pinned `swctl` commit doesn't support (`service ls --layer` vs `service ly`), the whole case fails 20× before CI gives up. Fire each verify query by hand once before pushing (step 6 above).
+
+**Published image cache miss in CI.** `docker compose pull` sometimes hits rate limits or unreachable registries; the test spins until timeout with "dependency failed to start". Look at the CI log for `Error response from daemon: pull access denied` or `manifest unknown`. If you see that, pin a different image tag that's definitely published (check `docker manifest inspect <tag>` locally), not a floating one.
+
 ## Common test cases
 
 | Shorthand | Path |

diff --git a/.github/workflows/skywalking.yaml b/.github/workflows/skywalking.yaml
@@ -642,6 +642,10 @@ jobs:
             config: test/e2e-v2/cases/zipkin-virtual-genai/e2e.yaml
           - name: iOS Monitoring
             config: test/e2e-v2/cases/ios/e2e.yaml
+          - name: WeChat Mini Program Monitoring
+            config: test/e2e-v2/cases/miniprogram/wechat/e2e.yaml
+          - name: Alipay Mini Program Monitoring
+            config: test/e2e-v2/cases/miniprogram/alipay/e2e.yaml
 
           - name: Nginx
             config: test/e2e-v2/cases/nginx/e2e.yaml