Skip to content

Runtime rule hot-update for MAL and LAL#13851

Merged
wu-sheng merged 5 commits into
masterfrom
feature/runtime-rule-hot-update
Apr 30, 2026
Merged

Runtime rule hot-update for MAL and LAL#13851
wu-sheng merged 5 commits into
masterfrom
feature/runtime-rule-hot-update

Conversation

@wu-sheng
Copy link
Copy Markdown
Member

@wu-sheng wu-sheng commented Apr 28, 2026

Runtime rule hot-update for MAL and LAL

Operators can now hot-update OTEL MAL, log MAL, telegraf MAL, and LAL rule files
without restarting OAP. A new admin REST surface on port 17128 (off by default)
persists rule changes to management storage; every node in an OAP cluster converges
on the new content within ~30 s. Engine state, cluster Suspend/Resume, BanyanDB
schema-watch fence, and per-backend dropTable lifecycle make the apply end-to-end
safe across structural cutovers.

⚠️ The endpoint has no built-in authentication and is disabled by default.
Enable only behind a gateway with IP allow-list + auth. See
docs/en/security/README.md for the full operator duty.


Catalogs

Catalog What it holds
otel-rules OTEL MAL rule YAML files
log-mal-rules Log-derived MAL rule YAML files
telegraf-rules Telegraf MAL rule YAML files
lal LAL rule YAML files

name mirrors the static filesystem layout — relative path under the catalog root
without extension. Segments [A-Za-z0-9._-]+ joined by /. No leading slash,
no .., no empty segments. Examples: nginx, aws-gateway/gateway-service,
k8s/node.


API surface

Write endpoints

POST /runtime/rule/addOrUpdate

Creates or replaces a rule. Body is the raw YAML (same shape you'd ship as a
static file under oap-server/server-starter/src/main/resources/<catalog>/).

curl -X POST --data-binary @vm.yaml \
  -H 'Content-Type: text/plain' \
  'http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm'

Optional flags:

  • allowStorageChange=true — required for edits that move storage identity
    (scope change, downsampling change, single/labeled/histogram switch on MAL;
    outputType change or rule-key add/remove on LAL). Drops the existing
    measure's data on BanyanDB; orphans old rows on JDBC/ES.
    Off by default.
  • force=true — recovery flag. Bypasses the byte-identical no_change HTTP
    shortcut so a re-post of known-good content is treated as a fresh apply
    request: the persisted row is re-written and any peers stuck mid-Suspend
    are re-Resumed. Engine state is content-keyed, so a true no-op against a
    healthy node remains a no-op even with this flag.
# Recovery: re-post a known-good payload to unstick a peer + bypass no_change
curl -X POST --data-binary @vm-previous-known-good.yaml \
  'http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&allowStorageChange=true&force=true'

POST /runtime/rule/inactivate

Soft-pause. OAP stops emitting metrics for the rule; backend measure +
historical data preserved for a later reactivation. The "off" intent is
durable across reboots — bundled rules on disk are not auto-resurrected
when an inactivate removes the runtime override (use delete if the
operator wants bundled to take over).

curl -X POST 'http://OAP:17128/runtime/rule/inactivate?catalog=otel-rules&name=vm'

POST /runtime/rule/delete

Removes an INACTIVE row (active rules return 409 requires_inactivate_first).
The OAP-side teardown is uniform; the storage-side effect splits on whether a
bundled YAML exists on disk for (catalog, name):

  • No bundled twin → destructive: backend resource is dropped and the rule
    is fully gone.
  • Bundled twin exists → non-destructive revert: backend resources runtime
    claimed that bundled does NOT claim (or claims at a different shape) are
    dropped; bundled-shared at matching shape is preserved (no data loss for
    measures bundled will reuse). The runtime row is removed; bundled is
    reinstalled into a static: loader on the local node. Peers converge via
    the periodic reconcile.

?mode=revertToBundled is an explicit operator hint that fails with
400 no_bundled_twin when no bundled YAML exists — useful for scripts that
want to fail loudly on assumption mismatch. To kill a bundled rule entirely,
the bundled YAML on disk must be edited or removed (it's part of the OAP
image).

Backend After destructive /delete (no twin) Old data still queryable?
BanyanDB Measure / stream group + schema dropped No
Elasticsearch dropTable no-op (merging index stays) Yes (until TTL)
JDBC (H2/MySQL/PostgreSQL/TiDB/OceanBase) dropTable no-op (table stays) Yes (until TTL)
# Destructive removal of a runtime-only rule:
curl -X POST 'http://OAP:17128/runtime/rule/inactivate?catalog=otel-rules&name=my-custom'
curl -X POST 'http://OAP:17128/runtime/rule/delete?catalog=otel-rules&name=my-custom'

# Revert a bundled rule's runtime override back to the bundled YAML:
curl -X POST 'http://OAP:17128/runtime/rule/inactivate?catalog=otel-rules&name=vm'
curl -X POST 'http://OAP:17128/runtime/rule/delete?catalog=otel-rules&name=vm&mode=revertToBundled'

Read endpoints

GET /runtime/rule

Fetch one rule. Default order: runtime row first, bundled YAML second, 404
otherwise. Raw YAML by default; JSON envelope on Accept: application/json.
Supports ETag and If-None-Match.

?source=bundled reads the on-disk bundled YAML even when a runtime row
exists — useful for "compare runtime vs bundled" / "fetch bundled body for a
revert via addOrUpdate" workflows. Returns 404 not_found when the rule
has no bundled twin.

# raw YAML — runtime first, bundled fallback
curl 'http://OAP:17128/runtime/rule?catalog=otel-rules&name=vm'

# JSON envelope (carries contentHash, source, updateTime, content)
curl -H 'Accept: application/json' \
  'http://OAP:17128/runtime/rule?catalog=otel-rules&name=vm'

# Read bundled body even though a runtime override exists
curl 'http://OAP:17128/runtime/rule?catalog=otel-rules&name=vm&source=bundled'

GET /runtime/rule/bundled

Bundled rules for one catalog as JSON. withContent=true (default) ships the
full YAML body; withContent=false omits it. Each item flags whether an
operator override exists.

# every bundled rule, with body
curl 'http://OAP:17128/runtime/rule/bundled?catalog=otel-rules' | jq

# manifest only (lighter; useful for diff vs filesystem)
curl 'http://OAP:17128/runtime/rule/bundled?catalog=otel-rules&withContent=false' | jq

GET /runtime/rule/list

Single JSON envelope {generatedAt, loaderStats, rules} merging stored rules
with this node's local state. Each row carries status, localState,
suspendOrigin, loaderGc, loaderKind (RUNTIME / STATIC / NONE),
loaderName, contentHash, bundled (whether a YAML exists on disk),
bundledContentHash (when bundled=true), updateTime, and
lastApplyError. loaderStats exposes process-wide manager counters
(active and pending). Optional catalog= filters; unknown values return
400 invalid_catalog.

# all rules, all catalogs
curl 'http://OAP:17128/runtime/rule/list' | jq

# only one catalog
curl 'http://OAP:17128/runtime/rule/list?catalog=otel-rules' | jq

# find rules with apply errors
curl 'http://OAP:17128/runtime/rule/list' | jq -c '.rules[] | select(.lastApplyError != "")'

# rules whose runtime override has drifted from the bundled YAML
curl 'http://OAP:17128/runtime/rule/list' \
  | jq -c '.rules[] | select(.bundled == true and .contentHash != .bundledContentHash)'

GET /runtime/rule/dump

Downloads a tar.gz of stored runtime rules + manifest.yaml. Useful for DR
backup or for replaying through addOrUpdate to restore state.

# all catalogs
curl -o rules.tar.gz 'http://OAP:17128/runtime/rule/dump'

# one catalog
curl -o otel.tar.gz 'http://OAP:17128/runtime/rule/dump/otel-rules'

# extract one rule and re-apply
tar -xzf rules.tar.gz runtime-rule-dump/otel-rules/vm.yaml
curl -X POST --data-binary @runtime-rule-dump/otel-rules/vm.yaml \
  'http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&force=true'

Catalog shortcut routes

Implicit catalog — useful when scripting against a single catalog:

Shortcut path Resolves to
/runtime/mal/otel/{addOrUpdate,inactivate,delete} catalog=otel-rules
/runtime/mal/log/{addOrUpdate,inactivate,delete} catalog=log-mal-rules
/runtime/lal/{addOrUpdate,inactivate,delete} catalog=lal
# equivalent to /runtime/rule/addOrUpdate?catalog=otel-rules&name=vm
curl -X POST --data-binary @vm.yaml \
  'http://OAP:17128/runtime/mal/otel/addOrUpdate?name=vm'

telegraf-rules does not have a shortcut; use the canonical /runtime/rule/...
routes.


Response shape

Write endpoints return JSON: {applyStatus, catalog, name, message}. Successful
applies use applyStatus of applied, filter_only_applied, structural_applied,
reactivated, inactivated, static_tombstoned, deleted,
reverted_to_bundled, reverted_to_bundled_partial, or no_change. Failures
map to specific codes (e.g. requires_inactivate_first, no_bundled_twin,
invalid_catalog, invalid_mode, delete_refused, compile_failed,
storage_change_requires_explicit_approval, cluster_not_ready,
forward_unknown_operation, etc.) — full table in
backend-runtime-rule-api.md.


Notices

Security

  • No built-in authentication. Module disabled by default; enable only when
    the network around port 17128 is operator-controlled.
  • Never expose port 17128 to the public internet. Bind to localhost or a
    private interface; reach it through an operator-controlled gateway with
    IP allow-list + authentication.
  • Audit every request. Rule content is arbitrary YAML that compiles into
    the OAP JVM — a malicious rule could exfiltrate data, spike resource use,
    or create metric-name collisions. Treat POST /runtime/rule/* as
    equivalent to shell access on the OAP host.
  • Keep the port off the cluster-external interface even in cluster mode.
    The cluster-internal Suspend / Forward RPCs ride on the cluster bus
    gRPC server (shared with RemoteService / HealthCheck) — that is a
    separate transport from 17128 and follows the same security posture
    as the rest of the cluster bus.

Full security notice: docs/en/security/README.md.

Data-loss guardrail

allowStorageChange=true is an explicit "I accept data loss" affirmation, not
a routine flag — it drops the existing measure's data on BanyanDB and orphans
old rows on JDBC / ES. Prefer a rename (new metric name, new rule name) so the
old data keeps accumulating until TTL and new data starts fresh under a clean
identity.

Consistency model

  • Persist is commit. Once /addOrUpdate returns 200, the cluster will
    converge on that content.
  • Last write wins. Concurrent writes to different nodes serialize on the
    cluster main; the second write wins.
  • Bounded convergence. Healthy structural commits land cluster-wide
    within 30 s (one periodic scan). Aborted commits self-heal within 60 s.
    Filter-only edits land locally in milliseconds and on every other node
    within 30 s.
  • No quorum, no leader election, no two-phase commit. The runtime-rule
    entry in storage is the single source of truth.
  • Samples for an affected metric are dropped during a structural cutover.
    This is by design — the schema is moving and in-flight samples have no
    valid landing.

BanyanDB schema-cutover fence

After firing a schema install or drop, OAP waits up to a bounded window
(default 2 s) for every BanyanDB data node to apply the change before
resuming dispatch. Best-effort by design: when all nodes confirm in
time, the 200 OK marks the moment the cluster's data boundary moves
(samples after 200 use the new shape, samples before use the old). On
laggard timeout, OAP logs a warning naming the laggards and proceeds
anyway so a single slow node doesn't wedge the apply.

Persistence + boot-time merge

Runtime rules persist in management storage and survive OAP restart. At boot,
OAP merges bundled rule files (under
oap-server/server-starter/src/main/resources/<catalog>/) with persisted
runtime rules, so the cluster never silently regresses to the bundled
defaults after a restart.

For BanyanDB specifically, schema mismatches at boot are visible, not
silent
: if BanyanDB already holds a resource whose shape doesn't match
the declared model (e.g., a rule was edited on disk while OAP was offline),
OAP skips the resource, logs an ERROR with the declared-vs-backend diff,
and continues booting. Re-shape the metric via
POST /runtime/rule/addOrUpdate.

/delete storage semantics differ per backend

For the destructive path (no bundled twin), the OAP-side teardown is uniform;
the on-disk effect is not:

  • BanyanDB drops the measure / stream group + schema → old rows gone.
  • ES / JDBC keep the index / table (append-only TTL contract) → old
    rows queryable until TTL expires.

If you need the data gone immediately on ES / JDBC, drop the table
out-of-band with the storage backend's own tools after /delete returns.

When a bundled twin exists, /delete is non-destructive by default:
runtime-only / shape-mismatched measures are dropped, but bundled-shared
measures at matching shape are preserved (bundled re-registers against
them). To force destructive removal of a bundled rule, edit the bundled
YAML on disk and restart — bundled rules are part of the OAP image, not
data.


Documentation

Tests

  • UTs across the new receiver plugin (engine / cluster / state / apply / rest).
  • E2E under test/e2e-v2/cases/runtime-rule/:
    • mal-storage/{banyandb,elasticsearch,postgresql} — full 10-phase lifecycle
      (CREATE → FILTER_ONLY → STRUCTURAL → DUMP → 4× ILLEGAL → SHAPE-BREAK →
      INACTIVATE → ACTIVATE → DELETE → DUMP) with a per-phase step label so
      verification queries attribute data back to the phase that wrote it.
    • lal — LAL hot-swap with a log-mal aggregation rule; swctl asserts the
      extracted metric carries the swap-flipped step label.
    • cluster — 2-OAP convergence over ZooKeeper.

Dependency bumps

Driven by the new BanyanDB schema-consistency RPCs whose generated validation
code requires the protobuf-java 4.x runtime:

  • gRPC 1.70.01.80.0
  • protobuf-java 3.25.54.33.1
  • pgv (protoc-gen-validate) 1.2.11.3.0
  • Netty 4.2.10.Final4.2.12.Final
  • Netty-tcnative 2.0.752.0.77

Checklist

  • Update the documentation to include this new feature.
  • Tests (UT, IT, E2E) added to verify the new feature.
  • Update the CHANGES log.
  • If it's UI related, attach screenshots below. (no UI changes)

@wu-sheng wu-sheng force-pushed the feature/runtime-rule-hot-update branch from 96bf220 to 53e180e Compare April 28, 2026 08:02
@wu-sheng wu-sheng added core feature Core and important feature. Sometimes, break backwards compatibility. backend OAP backend related. complexity:high Relate to multiple(>4) components of SkyWalking labels Apr 28, 2026
@wu-sheng wu-sheng added this to the 10.5.0 milestone Apr 28, 2026
@wu-sheng wu-sheng force-pushed the feature/runtime-rule-hot-update branch 4 times, most recently from 66f89e6 to d1369a8 Compare April 29, 2026 05:38
Adds /runtime/rule/{addOrUpdate,inactivate,delete,list,bundled,dump,get} on a
new admin port (default 17128, disabled by default) for cluster-wide MAL/LAL
rule management — push, soft-pause, drop, list, dump, single-rule fetch. The
endpoint converges through a single elected main per (catalog, name) with
Suspend/Resume RPCs broadcast to peers across the OAP cluster bus. Rules are
stored in the management storage layer (BanyanDB / Elasticsearch / JDBC) so
hot-updates survive OAP restart — at boot the merged view of bundled YAML and
persisted runtime rows takes effect without a regression to defaults.

The endpoint has no built-in authentication: operators must gateway-protect
it with IP allow-lists and never expose it to the public internet.

Engine model
------------
Three layers, with one boundary between each:

  scheduler (DSLManager + REST handler)  — DSL-agnostic. Lock acquisition,
      cluster Suspend/Resume RPC fan-out, persistence, classloader graveyard,
      cross-file ownership enforcement, tick scheduling, self-heal.
  orchestrators (DSLRuntimeApply, DSLRuntimeUnregister, DSLRuntimeDelete) —
      drive the per-DSL phase pipeline through the engine SPI.
  engines (MalRuleEngine, LalRuleEngine) — DSL-specific. Implement compile /
      verify / commit / rollback / unregister / dropBackend / reloadStatic
      against a per-engine ApplyContext; classify, claimedKeys, storageImpactKeys
      drive cross-DSL routing and the storage-change guardrail.

DSLClassLoaderManager (server-core)
-----------------------------------
Process-wide singleton that owns every per-file RuleClassLoader. Boot-time
bundled rules continue to load into the OAP main classloader (shared); per-
file `static:` loaders only mint after a runtime override is removed and the
bundled YAML must serve again. Loader name format
`<kind>:<catalog>/<rule>@<MMdd-HHmmss>` is observable in stack traces and the
graveyard's INFO/WARN log lines. The manager runs an internal daemon sweeper
that observes phantom-reference collection and warns on retired loaders that
stay alive past the configured threshold (the leak signal).

API surface:
  newBuilder(catalog, rule, kind, hash) — mints a loader for compile; not yet
                                          registered as active.
  commit(loader)                        — promotes to active, returns the
                                          displaced prior so the caller decides
                                          whether to retire.
  retire(loader)                        — graveyard a specific loader for GC
                                          observability.
  dropRuntime(catalog, rule)            — drops + retires the active loader.
  active(catalog, rule), activeCount(), pendingCount() — diagnostics.

The split between newBuilder and commit means a failed compile leaves the
live loader untouched: the failed loader is just garbage-collected, the
manager's active map still points at the previously-serving one.

/inactivate semantics (Design A)
--------------------------------
/inactivate stamps localState=NOT_LOADED. Bundled rules do NOT auto-
resurrect on /inactivate even when a bundled twin exists on disk — the
operator's "off" intent is preserved across reboots. To bring bundled back,
the operator runs /delete (drops the row, gone-keys reconcile reloads
bundled) or /addOrUpdate (with bundled YAML or their own).

/delete semantics
-----------------
Default mode: removes the runtime row.

  No bundled twin    — destructive cascade fires (BanyanDB measure / ES
                       index / JDBC table dropped). Rule fully gone.
  Bundled twin exists — non-destructive: backend resources runtime claimed
                       that bundled doesn't (or claims at different shape)
                       are dropped; bundled-shared at matching shape is
                       preserved. The runtime row is removed; bundled is
                       reinstalled into a `static:` loader synchronously on
                       the local node. Peers converge via gone-keys reconcile
                       on their next tick.

?mode=revertToBundled is an explicit operator hint that requires a bundled
twin (returns 400 no_bundled_twin when none exists) — useful for scripts
that want to fail loudly on assumption mismatch.

REST surface
------------
* /list returns a single JSON envelope {generatedAt, loaderStats, rules}
  (was NDJSON). Each row carries status, localState, suspendOrigin, loaderGc,
  loaderKind (RUNTIME/STATIC/NONE), loaderName, contentHash, bundled,
  bundledContentHash, updateTime, lastApplyError. status=BUNDLED replaces
  the prior STATIC for bundled-only rules.
* /get accepts ?source=bundled to read bundled YAML even when a runtime row
  exists — closes the "compare runtime to bundled" gap for editor flows.
* All JSON build sites use Gson.

Catalog enum at REST boundary
-----------------------------
The catalog query parameter parses to org.apache.skywalking.oap.server.core
.classloader.Catalog at the REST boundary. RuntimeRuleService's public
methods (addOrUpdate / inactivate / delete / get / listBundled / dumpCatalog)
take Catalog; unknown wire values return 400 invalid_catalog uniformly.
Internal helpers convert via getWireName() at DAO / cluster-RPC edges.

DeleteMode enum
---------------
?mode= parses to DeleteMode at the REST boundary. The string never leaves
the handler.

ForwardTarget interface removed
-------------------------------
Single production implementation (RuntimeRuleRestHandler) and zero test
fakes. The cluster gRPC service now references RuntimeRuleService directly;
the REST handler is left as a pure transport adapter (route bindings + parameter
parsing). The Result POJO becomes RuntimeRuleService.ForwardResult.

Cluster
-------
Suspend / Resume / Forward RPCs over the cluster gRPC bus. Single-main
routing (deterministic mainFor(catalog, name)) with REST forward-to-main.
Self-heal sweeps SUSPENDED bundles whose main crashed mid-apply (60 s
default).

Storage
-------
RuntimeRuleManagementDAO — per-backend upsert / read / delete on the rule
rows. Implementations for BanyanDB / ES / JDBC. /inactivate runs under
StorageManipulationOpt.localCacheOnly so the backend measure and history
stay; /delete fires the destructive cascade unless a bundled twin makes
delta-drop the right path.

Per-rule lifecycle docs
-----------------------
docs/en/setup/backend/backend-runtime-rule-api.md walks the full operator
surface: routes, applyStatus matrix, per-row status decoding (status x
loaderKind x bundled), reading bundled-vs-runtime YAML, consistency model.

E2E
---
test/e2e-v2/cases/runtime-rule/ covers the lifecycle on BanyanDB / ES /
JDBC, plus a 2-node cluster scenario (Suspend/Resume + main routing) and
the LAL pipeline. Verified end-to-end on BanyanDB locally:
CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → 4× illegal →
SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wu-sheng wu-sheng force-pushed the feature/runtime-rule-hot-update branch from d1369a8 to 12226e5 Compare April 29, 2026 05:42
…d flow

Rename `StorageManipulationOpt` factory methods + `Mode` constants throughout core
and consumers (`fullInstall` → `withSchemaChange`, `localCacheOnly` →
`withoutSchemaChange`, `localCacheVerify` → `verifySchemaOnly`, `createIfAbsent` →
`schemaCreateIfAbsent`).

Rename `Kind.STATIC` → `Kind.BUNDLED` to align with `Status.BUNDLED`. Loader name
prefix flips from `static:` to `bundled:` so diagnostics match the vocabulary.

Rename runtime-rule SPI: `loadStaticRuleFile` → `recordBundledClaims` (it stamps
claim metadata, not load), `reloadStatic` → `installBundled`, `dropBackend` →
`installRuntime` (purpose: install runtime DSL locally for delta computation), and
`DSLRuntimeApply.applyInline` → `apply`. Thread `Kind` through
`RuleEngine.compile`, `DSLRuntimeApply.apply`, and `compileAndVerify`.

Unify `/delete?mode=revertToBundled` into a single `DSLRuntimeDelete.revertToBundled`
method that runs the standard apply pipeline against the bundled YAML: install
runtime locally → `apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange)` so the
commit's delta drops runtime-only metrics and installs bundled-only ones → reset
state to boot-seeded. Eliminates the prior re-register-then-drop dance and reuses
the same code path operators already exercise via `/addOrUpdate`.

Default `/delete` no longer drops backend schema; the row is removed and any
backend resource is left as an inert artefact (matches bundled-rule deletion
semantics on disk). The schema-change moment lives only on the explicit
`?mode=revertToBundled` path.

Fix StaticRuleLoader.loadAll: now uses `rules.compute` to overlay bundled content
and RUNNING state on the engine-installed Applied (`putIfAbsent` was a no-op
because `recordBundledClaims` had already created the entry, so bundled-only
rules were missing content/state for `/list`, suspend, and first-edit classify).

Fix RuleSync.cleanupGoneKeys: pass `withoutSchemaChange` unconditionally so a
peer-promoted-to-main node cannot drop the backend during gone-keys cleanup
(contradicting the operator-facing contract that default `/delete` preserves
backend resources).

Fix revertToBundled rollback: when bundled apply fails, unregister the step-1
runtime install so local state matches the persisted INACTIVE row (previously
left runtime serving silently after a failed revert).

Add `requires_revert_to_bundled` 409 response: default `/delete` against a rule
with a bundled YAML twin is refused so letting bundled silently take over the
`(catalog, name)` requires an explicit operator decision. Reword
`requires_inactivate_first` and `revert_to_bundled_failed` responses to match
new behavior.

Documentation: update `backend-runtime-rule-api.md` (status table, error codes,
loaderKind values, `/delete` storage semantics per backend),
`runtime-rule-hot-update.md` design doc (four `/delete` paths spelled out), and
`changes.md` changelog entry.
@wu-sheng wu-sheng requested a review from wankai123 April 29, 2026 16:00
application.yml gained a new receiver-runtime-rule provider block (port 17128,
disabled by default, no authentication). configuration-vocabulary.md was missing
this entry — add it with the nine env-var-backed knobs (selector, REST host /
port / context path / idle timeout / accept queue / max header size, reconciler
interval, self-heal threshold).
Reviewer feedback: 'telemetry' previously read as numeric metrics + response
times only, but the same trust model applies to log lines — and log payloads
are far more likely to carry attacker-controllable text (URIs, headers,
exception messages from poisoned input) than numeric samples.

- Rewrite the trust paragraph to say 'metrics, traces, and logs' explicitly
  and call out log data as the most common XSS/RCE vector landing in OAP/UI.
- Add an explicit policy item recommending operators build a gateway /
  sidecar / service-mesh validation layer between agents and OAP. Several
  security vendors ship this; OAP does not validate telemetry itself.
Reviewer feedback (apache/skywalking-mailing-list, 2026-04-29): the existing
'all telemetry data should be validated' wording read as numeric-metric-only
to some readers. Make explicit that the validation contract covers every
field of every category — metrics (names + label keys + values), traces
(span names / tags / span logs / endpoints), logs (body + structured
fields), profiling results, HTTP capture/debugging dumps, and any future
telemetry surface.

Add the operator-facing recommendation to deploy a gateway / sidecar /
service-mesh validation layer between agents and OAP as a security
enhancement (several security vendors ship this; OAP does not validate
telemetry itself).

Frame the bullet list as examples, not an enumeration: the rule is
'validate every field,' not 'validate the ones we enumerated here.'
@wu-sheng wu-sheng merged commit 7754e3e into master Apr 30, 2026
431 of 437 checks passed
@wu-sheng wu-sheng deleted the feature/runtime-rule-hot-update branch April 30, 2026 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. complexity:high Relate to multiple(>4) components of SkyWalking core feature Core and important feature. Sometimes, break backwards compatibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants