Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 56 additions & 2 deletions .claude/skills/new-monitoring-feature/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ There is one skill per narrow concern. This one is the wiring map.

---

## Working posture

- **Default scope for a new monitoring feature is narrow.** For most new monitoring targets, traces are **re-used as-is** — you do not write or modify a trace analyzer. Metrics and log extensions via OTLP (MAL + LAL) are the common shape. OTLP / Zipkin span-based metrics analysis (`SpanListener`) already exists and only needs a new implementation if your feature introduces genuinely new semantic-convention attributes to extract (e.g. `gen_ai.*`, `mcp.*` for the GenAI layer). Do **not** touch `CommonAnalysisListener` / `RPCAnalysisListener` / `SegmentAnalysisListener` unless the OAP core protocol gains a new `SpanLayer` enum value — details in §2.
- **If you find yourself editing more than this skill suggests, stop and confirm.** Scope creep on trace analysis, span-layer mapping, receiver handlers, or protobuf/proto contracts is a strong signal the framing is wrong. Ask the developer before committing that direction.
- **Verify every new feature locally before pushing to CI.** Compile + checkstyle + license-eye are minimums, but they only tell you the code builds — they do **not** tell you the new MAL rule compiles its closures at startup, the LAL layer:auto dispatch works, the OTLP payload shape matches the MAL labels, the UI template renders against real data, or the layer shows up in `swctl service ly <LAYER>`. Run the new e2e case end-to-end locally against a live OAP (see [run-e2e skill](../run-e2e/SKILL.md)) and fire every verify step with `swctl` by hand at least once. Pushing to CI without a local green run wastes shared CI cycles and slow-loops review.

---

## 0. Register the `Layer` — the feature's entry point

A `Layer` is how OAP slices services / instances / endpoints by data source. **Every new feature needs a new `Layer` enum value.** The UI, storage partitioning, menu navigation, and OAL aggregation all key off it.
Expand Down Expand Up @@ -63,6 +71,36 @@ All five DSLs (OAL, MAL, LAL, Hierarchy + SpanListener SPI) compile via ANTLR4 +

SkyWalking OAP ingests traces through two distinct entry points. Pick based on source.

> **Before writing any trace extension — ask "do I actually need one?"**
>
> For **most new layers you do not touch trace analysis at all.** Segments (native or
> OTLP) flow through the existing pipeline, `RPCAnalysisListener.parseExit` produces
> the outbound `ServiceRelation` edges, the backend-side agent produces the inbound
> metrics via its own `parseEntry`, and the new-layer Service / Instance / Endpoint
> entities come from **MAL + LAL**, not from trace analysis. This is how browser
> (`Layer.BROWSER`), iOS (`Layer.IOS`), mobile mini-programs
> (`Layer.WECHAT_MINI_PROGRAM` / `Layer.ALIPAY_MINI_PROGRAM`), and most
> scraped-metric layers work.
>
> Extend trace analysis **only** when the feature genuinely needs per-span
> interpretation that the existing pipeline cannot express:
>
> - **A new `SpanLayer`** — OAP core protobuf has a new `SpanLayer` enum value that
> callers of `identifyServiceLayer` / `identifyRemoteServiceLayer` must dispatch on.
> That means touching `CommonAnalysisListener` / `RPCAnalysisListener`.
> - **A genuinely new signal class not covered by existing listeners** — e.g. the GenAI
> work added `GenAISpanListener` for LLM / MCP instrumentation because OTLP spans
> carry semantic-convention attributes (`gen_ai.*`, `mcp.*`) that no prior listener
> extracted. That is rare.
>
> **Anti-pattern we've been bitten by:** using a componentId mapping in
> `CommonAnalysisListener` to assign a client-side layer to exit-only segments. This
> does not produce inbound metrics anyway (no entry spans → no
> `service_cpm` / `endpoint_cpm` from OAL), and it is orthogonal to the Service entity
> layer (which MAL / LAL sets). Mirror the browser pattern instead: ship `Layer` +
> `component-libraries.yml` + MAL + LAL + dashboards, and leave the listener chain
> alone.

### 2.1 SkyWalking native segments — `AnalysisListener`

**Source protocol**: `apm-protocol/apm-network/src/main/proto/language-agent/Tracing.proto` (`TraceSegmentObject`).
Expand Down Expand Up @@ -112,7 +150,9 @@ Return `SpanListenerResult.builder().shouldPersist(false).build()` to prevent a

### 3.1 OAL (for entities already produced by core sources)

`oap-server/server-starter/src/main/resources/oal/core.oal` already defines the standard `service_cpm`, `service_resp_time`, `service_sla`, `endpoint_cpm`, etc. — keyed off the `Service`, `ServiceInstance`, `Endpoint` source scopes. If your new layer just emits those entities with the right `Layer`, you get all the core metrics for free.
`oap-server/server-starter/src/main/resources/oal/core.oal` already defines the standard `service_cpm`, `service_resp_time`, `service_sla`, `endpoint_cpm`, etc. — keyed off the `Service`, `ServiceInstance`, `Endpoint` source scopes. If your new layer emits those entities with the right `Layer`, you get the core metrics for free.

**Who emits those entities matters.** `service_cpm` / `service_resp_time` / `service_sla` / `endpoint_cpm` / `endpoint_resp_time` come from `RPCAnalysisListener.parseEntry` → `callingIn.toService()` / `toEndpoint()`. That only fires on **inbound entry spans**. Client-side / edge layers (browser, iOS, mini-programs) only emit exit spans, so these OAL metrics do **not** populate under those layers — their Service entities are produced by MAL / LAL, and their dashboards use MAL metrics (`meter_<layer>_*`) for load / latency, not `service_cpm` / `service_resp_time`. Don't wire dashboards to OAL-derived metrics that the layer never produces.

Additional OAL rules go in the same file. OAL grammar: `oap-server/oal-grammar/src/main/antlr4/OALParser.g4`. Syntax and examples: [`docs/en/concepts-and-designs/oal.md`](../../../docs/en/concepts-and-designs/oal.md).

Expand Down Expand Up @@ -169,6 +209,8 @@ JSON files under `oap-server/server-starter/src/main/resources/ui-initialized-te
- `<feature>-endpoint.json` — per-endpoint dashboard.
- Menu link in `oap-server/server-starter/src/main/resources/ui-initialized-templates/menu.yaml`.

> **Collaborate with the developer on dashboard design.** Dashboard layout and metric selection are product decisions, not derivable from the MAL rules alone. Before authoring the JSON, check in with the developer on: which metric goes on which panel, which percentiles / aggregations to show, what's on the root list vs. per-service vs. per-instance vs. per-endpoint pages, what stays in the Overview tab vs. a sub-tab, widget sizes / ordering, and whether Trace / Log tabs belong at all (depends on whether the feature emits traces and logs). Then **manually set up the feature against a live OAP** — emit a few real OTLP payloads, open each dashboard in the UI, confirm every widget renders non-empty, and walk the drill-down from root → service → instance → endpoint. Only after that eyes-on pass is the dashboard ready to ship.

Details that always bite:

- **Tab chrome eats ~4 rows.** Set outer `Tab.h = content_max_y + 4` or you get an inner scrollbar on top of the page scrollbar.
Expand Down Expand Up @@ -227,7 +269,7 @@ Feature-specific extras:
3. **OTLP data hygiene**:
- `spanId` is exactly 16 hex chars; `traceId` is exactly 32. Protobuf-JSON decodes them as base64; other lengths return HTTP 400.
- Setup-step curl loops must fail loudly: `curl -sS -f --retry 30 --retry-delay 5 --retry-connrefused --retry-all-errors --max-time 10 ...` + `set -e`. The pattern `for …; do curl && break || sleep 5; done` silently succeeds when every attempt connection-refused (the final `sleep` returns 0) because OAP takes ~50 s to start in CI.
4. **Verify order matters**. `e2e verify` stops at the first failing case. Fire each verify command manually against a live OAP via `swctl` before committing (stale CLI flags and wrong metric names burn 20 min of CI retries otherwise).
4. **Verify each case by hand before pushing** and use a published simulator image where the upstream SDK has one. See [`run-e2e` skill §6–§9](../run-e2e/SKILL.md) for verify-query triage, expected-file authoring, and the list of expected-file traps (YAML quoting on colons, nested `contains` fragility, `timeUnixNano: "0"`, setup-step curl loops that swallow failure).
5. **Config dump**: when you modify any `application.yml` default (rule list, handler list, ...), also update `test/e2e-v2/cases/storage/expected/config-dump.yml`.

---
Expand All @@ -250,6 +292,18 @@ Run in order; each has a dedicated skill:
| Trap | Symptom | Fix |
|---|---|---|
| Layer folder named with hyphens | `ui-initialized-templates/my-layer/*.json` on disk but dashboard empty | Folder must be `Layer.name().toLowerCase()` (underscores). `wechat_mini_program/`, not `wechat-mini-program/` |
| Extending `CommonAnalysisListener` for a client-side layer | Added componentId → layer mapping "to get topology for free"; then `service_cpm` etc. don't populate anyway | Client-side (exit-only) layers don't need trace-analysis changes. Ship `Layer` + `component-libraries.yml` + MAL + LAL. See §2 header for when extending trace analysis is actually justified |
| Dashboards wired to OAL-derived metrics on a client-side layer | Charts stay empty — `service_cpm` / `service_resp_time` / `endpoint_cpm` never populate | Those come from inbound `parseEntry`, which mini-program / browser / iOS don't emit. Use MAL metrics (`meter_<layer>_*`) instead |
| LAL `#` comment in DSL body | OAP boot fails with `extraneous input 'metrics'` at the `#` line | LAL grammar accepts `//` and `/* */` only. No `#` comments |
| LAL bare def-var value access — `tag 'x': myvar` | OAP boot fails with `Cannot resolve expression '[myvar]'` | `LALValueCodegen` only treats def-vars as valid when there's a method chain (`myvar?.foo`). Inline `sourceAttribute(...)` or `tag(...)` in the three value-access sites instead of storing to a def, or add a no-op chain like `myvar.toString()`. Conditions (`if (myvar == ...)`) go through a different codegen path and DO handle bare def refs |
| MAL rule file with multi-doc YAML (`---` separator) | `MALFilterExecutionTest` fails with `expected a single document`; OAP would also fail at startup via `Rules.loadRules` (calls `Yaml.loadAs`, single-doc) | Split into one file per top-level `expSuffix` / `filter` / `metricPrefix` combo. Reference each file by name in `application.yml` (`lalFiles`, `malFiles`, `enabledOtelMetricsRules`) |
| Histogram bucket bounds in ms with default `.histogram()` call | Percentile values come out 1000× inflated (`P50 = 100000` for a 100 ms request) | MAL's default `defaultHistogramBucketUnit = SECONDS` multiplies `le` by 1000 when converting to stored ms. For SDKs that emit ms bounds directly (mini-program, most in-house JS SDKs), use `.histogram("le", TimeUnit.MILLISECONDS)`. `TimeUnit` is registered in `MALCodegenHelper.ENUM_FQCN` so rule YAML can reference it |
| `.rate()` / `.increase()` on a DELTA counter / histogram `_count` | Compute delta-of-delta — nonsense values | OTLP DELTA means each flush is already the per-flush count. MAL's per-minute aggregator sums directly to requests-per-minute. For CUMULATIVE (Prometheus default), use `.rate("PT1M")` for per-sec or `.increase("PT1M")` for per-min; they are **not** equivalent (differ by 60×) |
| MAL bucket-family name with `_histogram` suffix | Rule compiles, metric stays empty | `PrometheusMetricConverter` emits the bucket family as `<metric_name>` (no suffix), with `<metric_name>_count` and `<metric_name>_sum` as the companion families. Reference the bucket family by its bare name in the rule |
| LAL `tag("key") == null` check on a missing tag | Check never triggers — LAL's `tag()` returns empty string `""` for absent tags, not `null`, so conditional abort is a no-op | Guard on both: `tag("key") == null || tag("key") == ""`. Same for `sourceAttribute("key")` |
| MAL rule file with `expSuffix: service(...)` mixing service-scoped and endpoint-scoped rules | Endpoint rules' metrics appear under the service entity (no endpoint_traffic rows created); `swctl endpoint ls` returns empty. `MetricConvert.formatExp:132` appends `(exp).expSuffix` to every rule, so `.endpoint(...)` in the rule's expression is followed by `.service(...)` and the service scope wins. | Follow the APISIX pattern — leave `expSuffix:` empty and explicitly chain `.service(...)` / `.endpoint(...)` on each rule. If every rule in a file is same-scope (e.g. all instance-scoped in `*-instance.yaml`), `expSuffix` is still fine |

E2E-side authoring traps (log-content YAML quoting, nested `contains` fragility, `timeUnixNano: "0"`, setup-step curl swallowing failure) live in the [`run-e2e` skill §9](../run-e2e/SKILL.md).
| Missing layer-root template | Menu item lands on empty "no dashboard" view | `Layer.vue:41-44` requires a template with `isRoot: true`; ship a `<layer>-root.json` (precedent: `ios/ios-root.json`) |
| New layer not selectable in UI / swctl | `service ly <LAYER>` returns empty | Add the enum in `Layer.java` with a fresh id; see §0 |
| Stale dist in image | `make docker.oap` succeeds but behavior is old | See [`package` skill](../package/SKILL.md); rebuild dist first, then image |
Expand Down
21 changes: 21 additions & 0 deletions .claude/skills/run-e2e/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,27 @@ For a new verify case, the workflow is:
3. Write the expected file. If the response is a list, wrap the items in `{{- contains . }} ... {{- end }}` so ordering and extra actual items don't fail the match.
4. Re-run `e2e verify` alone (the containers are still up from the previous run); iterate on the expected file without rebuilding.

### 9. Expected-file authoring traps

These burn CI cycles and pass locally. Each was learned the hard way.

**Unquoted `content: {{ notEmpty .content }}` with `:` inside the value.** Sim-generated or real log content routinely includes colons (`POST https://api.example.com/cart failed: 500`, `HTTP/1.1 500: Internal Server Error`). Without quoting, the template renders to invalid YAML (snakeyaml parses `failed:` as a nested key) and the *whole log entry* marshals to `nil`. Symptom: diff shows `- nil` at *every* position in the expected logs list vs real maps in actual. Fix: wrap in single quotes — `content: '{{ notEmpty .content }}'`. Single-quoted YAML preserves `:` in the scalar; only fails on embedded `'`. Double quotes also work unless the content has `"`.

**Nested `contains` with multiple per-element pattern assertions against a varied stream.** The template renders the block body once per actual element; when the outer block body has *multiple* inner `contains` patterns asserting specific tag key/value pairs, and only *some* actuals satisfy all the inner patterns, `go-cmp` with `contains` can end up comparing `[rendered_for_A0, nil, nil, ...]` vs `[A0, A1, A2, ...]` and fail despite `contains` being permissive on extras. Specifically: outer `contains .logs` with a single log pattern + inner `contains .tags` asserting two distinct key/value pairs. On a simulator emitting heterogeneous errors (js + promise + ajax + pageNotFound), only a subset satisfy the inner assertion. Passes locally with 1–2 logs, fails in CI with 6+.
- Keep the outer `contains` body lenient: field-shape checks (`notEmpty`, `gt`), one discriminator tag that *every* element in the stream carries.
- Cover per-category assertions via separate labeled-metric verify cases, not inside the nested template.
- Rule of thumb: "at least one log exists with the right layer routing" inside the logs expected; per-category coverage via `meter_*_count{label=X}` verify cases.

**Hand-crafted OTLP curl payloads drift from real SDK output.** When the upstream SDK ships a published simulator image (mini-program-monitor's `sim-wechat` / `sim-alipay`, browser-client-js sim, etc.), prefer driving the e2e with that image in `MODE=timed` with a bounded `DURATION_MS` over hand-rolling the OTLP JSON. Hand-crafted payloads miss real-world shape issues: delta-vs-cumulative temporality, label-cardinality surprises, stacktrace formatting variance, attribute key names that changed between SDK versions. Pin to a released tag (`v0.4.0`), not `:latest` or HEAD SHA — reproducibility.

**`timeUnixNano: "0"` in an OTLP metric datapoint.** The receiver propagates this into MAL's bucket computation and the metric lands in the 1970 time bucket — `swctl metrics exec` over the "last N minutes" window won't find it. Either use `$(date +%s)000000000` at setup time or omit the field if the receiver accepts "now" as default.

**Setup-step curl loop with `|| sleep` pattern.** The shell line `for ... do curl && break || sleep 5; done` exits 0 when every attempt connection-refused because the final `sleep 5` returns 0. OAP takes ~50 s to start in CI, so all attempts fail before OAP is ready, and the setup step silently succeeds with zero traffic ingested. Fix: `curl -sS -f --retry 30 --retry-delay 5 --retry-connrefused --retry-all-errors --max-time 10 ...` + `set -e` at step top.

**`swctl` flag rejected.** If a verify case uses a flag the pinned `swctl` commit doesn't support (`service ls --layer` vs `service ly`), the whole case fails 20× before CI gives up. Fire each verify query by hand once before pushing (step 6 above).

**Published image cache miss in CI.** `docker compose pull` sometimes hits rate limits or unreachable registries; the test spins until timeout with "dependency failed to start". Look at the CI log for `Error response from daemon: pull access denied` or `manifest unknown`. If you see that, pin a different image tag that's definitely published (check `docker manifest inspect <tag>` locally), not a floating one.

## Common test cases

| Shorthand | Path |
Expand Down
4 changes: 4 additions & 0 deletions .github/workflows/skywalking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -642,6 +642,10 @@ jobs:
config: test/e2e-v2/cases/zipkin-virtual-genai/e2e.yaml
- name: iOS Monitoring
config: test/e2e-v2/cases/ios/e2e.yaml
- name: WeChat Mini Program Monitoring
config: test/e2e-v2/cases/miniprogram/wechat/e2e.yaml
- name: Alipay Mini Program Monitoring
config: test/e2e-v2/cases/miniprogram/alipay/e2e.yaml

- name: Nginx
config: test/e2e-v2/cases/nginx/e2e.yaml
Expand Down
Loading
Loading