Runtime rule hot-update for MAL and LAL#13851
Merged
Merged
Conversation
96bf220 to
53e180e
Compare
66f89e6 to
d1369a8
Compare
Adds /runtime/rule/{addOrUpdate,inactivate,delete,list,bundled,dump,get} on a
new admin port (default 17128, disabled by default) for cluster-wide MAL/LAL
rule management — push, soft-pause, drop, list, dump, single-rule fetch. The
endpoint converges through a single elected main per (catalog, name) with
Suspend/Resume RPCs broadcast to peers across the OAP cluster bus. Rules are
stored in the management storage layer (BanyanDB / Elasticsearch / JDBC) so
hot-updates survive OAP restart — at boot the merged view of bundled YAML and
persisted runtime rows takes effect without a regression to defaults.
The endpoint has no built-in authentication: operators must gateway-protect
it with IP allow-lists and never expose it to the public internet.
Engine model
------------
Three layers, with one boundary between each:
scheduler (DSLManager + REST handler) — DSL-agnostic. Lock acquisition,
cluster Suspend/Resume RPC fan-out, persistence, classloader graveyard,
cross-file ownership enforcement, tick scheduling, self-heal.
orchestrators (DSLRuntimeApply, DSLRuntimeUnregister, DSLRuntimeDelete) —
drive the per-DSL phase pipeline through the engine SPI.
engines (MalRuleEngine, LalRuleEngine) — DSL-specific. Implement compile /
verify / commit / rollback / unregister / dropBackend / reloadStatic
against a per-engine ApplyContext; classify, claimedKeys, storageImpactKeys
drive cross-DSL routing and the storage-change guardrail.
DSLClassLoaderManager (server-core)
-----------------------------------
Process-wide singleton that owns every per-file RuleClassLoader. Boot-time
bundled rules continue to load into the OAP main classloader (shared); per-
file `static:` loaders only mint after a runtime override is removed and the
bundled YAML must serve again. Loader name format
`<kind>:<catalog>/<rule>@<MMdd-HHmmss>` is observable in stack traces and the
graveyard's INFO/WARN log lines. The manager runs an internal daemon sweeper
that observes phantom-reference collection and warns on retired loaders that
stay alive past the configured threshold (the leak signal).
API surface:
newBuilder(catalog, rule, kind, hash) — mints a loader for compile; not yet
registered as active.
commit(loader) — promotes to active, returns the
displaced prior so the caller decides
whether to retire.
retire(loader) — graveyard a specific loader for GC
observability.
dropRuntime(catalog, rule) — drops + retires the active loader.
active(catalog, rule), activeCount(), pendingCount() — diagnostics.
The split between newBuilder and commit means a failed compile leaves the
live loader untouched: the failed loader is just garbage-collected, the
manager's active map still points at the previously-serving one.
/inactivate semantics (Design A)
--------------------------------
/inactivate stamps localState=NOT_LOADED. Bundled rules do NOT auto-
resurrect on /inactivate even when a bundled twin exists on disk — the
operator's "off" intent is preserved across reboots. To bring bundled back,
the operator runs /delete (drops the row, gone-keys reconcile reloads
bundled) or /addOrUpdate (with bundled YAML or their own).
/delete semantics
-----------------
Default mode: removes the runtime row.
No bundled twin — destructive cascade fires (BanyanDB measure / ES
index / JDBC table dropped). Rule fully gone.
Bundled twin exists — non-destructive: backend resources runtime claimed
that bundled doesn't (or claims at different shape)
are dropped; bundled-shared at matching shape is
preserved. The runtime row is removed; bundled is
reinstalled into a `static:` loader synchronously on
the local node. Peers converge via gone-keys reconcile
on their next tick.
?mode=revertToBundled is an explicit operator hint that requires a bundled
twin (returns 400 no_bundled_twin when none exists) — useful for scripts
that want to fail loudly on assumption mismatch.
REST surface
------------
* /list returns a single JSON envelope {generatedAt, loaderStats, rules}
(was NDJSON). Each row carries status, localState, suspendOrigin, loaderGc,
loaderKind (RUNTIME/STATIC/NONE), loaderName, contentHash, bundled,
bundledContentHash, updateTime, lastApplyError. status=BUNDLED replaces
the prior STATIC for bundled-only rules.
* /get accepts ?source=bundled to read bundled YAML even when a runtime row
exists — closes the "compare runtime to bundled" gap for editor flows.
* All JSON build sites use Gson.
Catalog enum at REST boundary
-----------------------------
The catalog query parameter parses to org.apache.skywalking.oap.server.core
.classloader.Catalog at the REST boundary. RuntimeRuleService's public
methods (addOrUpdate / inactivate / delete / get / listBundled / dumpCatalog)
take Catalog; unknown wire values return 400 invalid_catalog uniformly.
Internal helpers convert via getWireName() at DAO / cluster-RPC edges.
DeleteMode enum
---------------
?mode= parses to DeleteMode at the REST boundary. The string never leaves
the handler.
ForwardTarget interface removed
-------------------------------
Single production implementation (RuntimeRuleRestHandler) and zero test
fakes. The cluster gRPC service now references RuntimeRuleService directly;
the REST handler is left as a pure transport adapter (route bindings + parameter
parsing). The Result POJO becomes RuntimeRuleService.ForwardResult.
Cluster
-------
Suspend / Resume / Forward RPCs over the cluster gRPC bus. Single-main
routing (deterministic mainFor(catalog, name)) with REST forward-to-main.
Self-heal sweeps SUSPENDED bundles whose main crashed mid-apply (60 s
default).
Storage
-------
RuntimeRuleManagementDAO — per-backend upsert / read / delete on the rule
rows. Implementations for BanyanDB / ES / JDBC. /inactivate runs under
StorageManipulationOpt.localCacheOnly so the backend measure and history
stay; /delete fires the destructive cascade unless a bundled twin makes
delta-drop the right path.
Per-rule lifecycle docs
-----------------------
docs/en/setup/backend/backend-runtime-rule-api.md walks the full operator
surface: routes, applyStatus matrix, per-row status decoding (status x
loaderKind x bundled), reading bundled-vs-runtime YAML, consistency model.
E2E
---
test/e2e-v2/cases/runtime-rule/ covers the lifecycle on BanyanDB / ES /
JDBC, plus a 2-node cluster scenario (Suspend/Resume + main routing) and
the LAL pipeline. Verified end-to-end on BanyanDB locally:
CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → 4× illegal →
SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d1369a8 to
12226e5
Compare
…d flow Rename `StorageManipulationOpt` factory methods + `Mode` constants throughout core and consumers (`fullInstall` → `withSchemaChange`, `localCacheOnly` → `withoutSchemaChange`, `localCacheVerify` → `verifySchemaOnly`, `createIfAbsent` → `schemaCreateIfAbsent`). Rename `Kind.STATIC` → `Kind.BUNDLED` to align with `Status.BUNDLED`. Loader name prefix flips from `static:` to `bundled:` so diagnostics match the vocabulary. Rename runtime-rule SPI: `loadStaticRuleFile` → `recordBundledClaims` (it stamps claim metadata, not load), `reloadStatic` → `installBundled`, `dropBackend` → `installRuntime` (purpose: install runtime DSL locally for delta computation), and `DSLRuntimeApply.applyInline` → `apply`. Thread `Kind` through `RuleEngine.compile`, `DSLRuntimeApply.apply`, and `compileAndVerify`. Unify `/delete?mode=revertToBundled` into a single `DSLRuntimeDelete.revertToBundled` method that runs the standard apply pipeline against the bundled YAML: install runtime locally → `apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange)` so the commit's delta drops runtime-only metrics and installs bundled-only ones → reset state to boot-seeded. Eliminates the prior re-register-then-drop dance and reuses the same code path operators already exercise via `/addOrUpdate`. Default `/delete` no longer drops backend schema; the row is removed and any backend resource is left as an inert artefact (matches bundled-rule deletion semantics on disk). The schema-change moment lives only on the explicit `?mode=revertToBundled` path. Fix StaticRuleLoader.loadAll: now uses `rules.compute` to overlay bundled content and RUNNING state on the engine-installed Applied (`putIfAbsent` was a no-op because `recordBundledClaims` had already created the entry, so bundled-only rules were missing content/state for `/list`, suspend, and first-edit classify). Fix RuleSync.cleanupGoneKeys: pass `withoutSchemaChange` unconditionally so a peer-promoted-to-main node cannot drop the backend during gone-keys cleanup (contradicting the operator-facing contract that default `/delete` preserves backend resources). Fix revertToBundled rollback: when bundled apply fails, unregister the step-1 runtime install so local state matches the persisted INACTIVE row (previously left runtime serving silently after a failed revert). Add `requires_revert_to_bundled` 409 response: default `/delete` against a rule with a bundled YAML twin is refused so letting bundled silently take over the `(catalog, name)` requires an explicit operator decision. Reword `requires_inactivate_first` and `revert_to_bundled_failed` responses to match new behavior. Documentation: update `backend-runtime-rule-api.md` (status table, error codes, loaderKind values, `/delete` storage semantics per backend), `runtime-rule-hot-update.md` design doc (four `/delete` paths spelled out), and `changes.md` changelog entry.
application.yml gained a new receiver-runtime-rule provider block (port 17128, disabled by default, no authentication). configuration-vocabulary.md was missing this entry — add it with the nine env-var-backed knobs (selector, REST host / port / context path / idle timeout / accept queue / max header size, reconciler interval, self-heal threshold).
Reviewer feedback: 'telemetry' previously read as numeric metrics + response times only, but the same trust model applies to log lines — and log payloads are far more likely to carry attacker-controllable text (URIs, headers, exception messages from poisoned input) than numeric samples. - Rewrite the trust paragraph to say 'metrics, traces, and logs' explicitly and call out log data as the most common XSS/RCE vector landing in OAP/UI. - Add an explicit policy item recommending operators build a gateway / sidecar / service-mesh validation layer between agents and OAP. Several security vendors ship this; OAP does not validate telemetry itself.
Reviewer feedback (apache/skywalking-mailing-list, 2026-04-29): the existing 'all telemetry data should be validated' wording read as numeric-metric-only to some readers. Make explicit that the validation contract covers every field of every category — metrics (names + label keys + values), traces (span names / tags / span logs / endpoints), logs (body + structured fields), profiling results, HTTP capture/debugging dumps, and any future telemetry surface. Add the operator-facing recommendation to deploy a gateway / sidecar / service-mesh validation layer between agents and OAP as a security enhancement (several security vendors ship this; OAP does not validate telemetry itself). Frame the bullet list as examples, not an enumeration: the rule is 'validate every field,' not 'validate the ones we enumerated here.'
wankai123
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Runtime rule hot-update for MAL and LAL
Operators can now hot-update OTEL MAL, log MAL, telegraf MAL, and LAL rule files
without restarting OAP. A new admin REST surface on port 17128 (off by default)
persists rule changes to management storage; every node in an OAP cluster converges
on the new content within ~30 s. Engine state, cluster Suspend/Resume, BanyanDB
schema-watch fence, and per-backend
dropTablelifecycle make the apply end-to-endsafe across structural cutovers.
Catalogs
otel-ruleslog-mal-rulestelegraf-ruleslalnamemirrors the static filesystem layout — relative path under the catalog rootwithout extension. Segments
[A-Za-z0-9._-]+joined by/. No leading slash,no
.., no empty segments. Examples:nginx,aws-gateway/gateway-service,k8s/node.API surface
Write endpoints
POST /runtime/rule/addOrUpdateCreates or replaces a rule. Body is the raw YAML (same shape you'd ship as a
static file under
oap-server/server-starter/src/main/resources/<catalog>/).Optional flags:
allowStorageChange=true— required for edits that move storage identity(scope change, downsampling change, single/labeled/histogram switch on MAL;
outputType change or rule-key add/remove on LAL). Drops the existing
measure's data on BanyanDB; orphans old rows on JDBC/ES. Off by default.
force=true— recovery flag. Bypasses the byte-identicalno_changeHTTPshortcut so a re-post of known-good content is treated as a fresh apply
request: the persisted row is re-written and any peers stuck mid-Suspend
are re-Resumed. Engine state is content-keyed, so a true no-op against a
healthy node remains a no-op even with this flag.
POST /runtime/rule/inactivateSoft-pause. OAP stops emitting metrics for the rule; backend measure +
historical data preserved for a later reactivation. The "off" intent is
durable across reboots — bundled rules on disk are not auto-resurrected
when an
inactivateremoves the runtime override (usedeleteif theoperator wants bundled to take over).
curl -X POST 'http://OAP:17128/runtime/rule/inactivate?catalog=otel-rules&name=vm'POST /runtime/rule/deleteRemoves an
INACTIVErow (active rules return409 requires_inactivate_first).The OAP-side teardown is uniform; the storage-side effect splits on whether a
bundled YAML exists on disk for
(catalog, name):is fully gone.
claimed that bundled does NOT claim (or claims at a different shape) are
dropped; bundled-shared at matching shape is preserved (no data loss for
measures bundled will reuse). The runtime row is removed; bundled is
reinstalled into a
static:loader on the local node. Peers converge viathe periodic reconcile.
?mode=revertToBundledis an explicit operator hint that fails with400 no_bundled_twinwhen no bundled YAML exists — useful for scripts thatwant to fail loudly on assumption mismatch. To kill a bundled rule entirely,
the bundled YAML on disk must be edited or removed (it's part of the OAP
image).
/delete(no twin)dropTableno-op (merging index stays)dropTableno-op (table stays)Read endpoints
GET /runtime/ruleFetch one rule. Default order: runtime row first, bundled YAML second, 404
otherwise. Raw YAML by default; JSON envelope on
Accept: application/json.Supports
ETagandIf-None-Match.?source=bundledreads the on-disk bundled YAML even when a runtime rowexists — useful for "compare runtime vs bundled" / "fetch bundled body for a
revert via
addOrUpdate" workflows. Returns404 not_foundwhen the rulehas no bundled twin.
GET /runtime/rule/bundledBundled rules for one catalog as JSON.
withContent=true(default) ships thefull YAML body;
withContent=falseomits it. Each item flags whether anoperator override exists.
GET /runtime/rule/listSingle JSON envelope
{generatedAt, loaderStats, rules}merging stored ruleswith this node's local state. Each row carries
status,localState,suspendOrigin,loaderGc,loaderKind(RUNTIME/STATIC/NONE),loaderName,contentHash,bundled(whether a YAML exists on disk),bundledContentHash(whenbundled=true),updateTime, andlastApplyError.loaderStatsexposes process-wide manager counters(
activeandpending). Optionalcatalog=filters; unknown values return400 invalid_catalog.GET /runtime/rule/dumpDownloads a tar.gz of stored runtime rules +
manifest.yaml. Useful for DRbackup or for replaying through
addOrUpdateto restore state.Catalog shortcut routes
Implicit catalog — useful when scripting against a single catalog:
/runtime/mal/otel/{addOrUpdate,inactivate,delete}catalog=otel-rules/runtime/mal/log/{addOrUpdate,inactivate,delete}catalog=log-mal-rules/runtime/lal/{addOrUpdate,inactivate,delete}catalog=laltelegraf-rulesdoes not have a shortcut; use the canonical/runtime/rule/...routes.
Response shape
Write endpoints return JSON:
{applyStatus, catalog, name, message}. Successfulapplies use
applyStatusofapplied,filter_only_applied,structural_applied,reactivated,inactivated,static_tombstoned,deleted,reverted_to_bundled,reverted_to_bundled_partial, orno_change. Failuresmap to specific codes (e.g.
requires_inactivate_first,no_bundled_twin,invalid_catalog,invalid_mode,delete_refused,compile_failed,storage_change_requires_explicit_approval,cluster_not_ready,forward_unknown_operation, etc.) — full table inbackend-runtime-rule-api.md.
Notices
Security
the network around port 17128 is operator-controlled.
localhostor aprivate interface; reach it through an operator-controlled gateway with
IP allow-list + authentication.
the OAP JVM — a malicious rule could exfiltrate data, spike resource use,
or create metric-name collisions. Treat
POST /runtime/rule/*asequivalent to shell access on the OAP host.
The cluster-internal Suspend / Forward RPCs ride on the cluster bus
gRPC server (shared with RemoteService / HealthCheck) — that is a
separate transport from 17128 and follows the same security posture
as the rest of the cluster bus.
Full security notice: docs/en/security/README.md.
Data-loss guardrail
allowStorageChange=trueis an explicit "I accept data loss" affirmation, nota routine flag — it drops the existing measure's data on BanyanDB and orphans
old rows on JDBC / ES. Prefer a rename (new metric name, new rule name) so the
old data keeps accumulating until TTL and new data starts fresh under a clean
identity.
Consistency model
/addOrUpdatereturns 200, the cluster willconverge on that content.
cluster main; the second write wins.
within 30 s (one periodic scan). Aborted commits self-heal within 60 s.
Filter-only edits land locally in milliseconds and on every other node
within 30 s.
entry in storage is the single source of truth.
This is by design — the schema is moving and in-flight samples have no
valid landing.
BanyanDB schema-cutover fence
After firing a schema install or drop, OAP waits up to a bounded window
(default 2 s) for every BanyanDB data node to apply the change before
resuming dispatch. Best-effort by design: when all nodes confirm in
time, the
200 OKmarks the moment the cluster's data boundary moves(samples after
200use the new shape, samples before use the old). Onlaggard timeout, OAP logs a warning naming the laggards and proceeds
anyway so a single slow node doesn't wedge the apply.
Persistence + boot-time merge
Runtime rules persist in management storage and survive OAP restart. At boot,
OAP merges bundled rule files (under
oap-server/server-starter/src/main/resources/<catalog>/) with persistedruntime rules, so the cluster never silently regresses to the bundled
defaults after a restart.
For BanyanDB specifically, schema mismatches at boot are visible, not
silent: if BanyanDB already holds a resource whose shape doesn't match
the declared model (e.g., a rule was edited on disk while OAP was offline),
OAP skips the resource, logs an ERROR with the declared-vs-backend diff,
and continues booting. Re-shape the metric via
POST /runtime/rule/addOrUpdate./deletestorage semantics differ per backendFor the destructive path (no bundled twin), the OAP-side teardown is uniform;
the on-disk effect is not:
rows queryable until TTL expires.
If you need the data gone immediately on ES / JDBC, drop the table
out-of-band with the storage backend's own tools after
/deletereturns.When a bundled twin exists,
/deleteis non-destructive by default:runtime-only / shape-mismatched measures are dropped, but bundled-shared
measures at matching shape are preserved (bundled re-registers against
them). To force destructive removal of a bundled rule, edit the bundled
YAML on disk and restart — bundled rules are part of the OAP image, not
data.
Documentation
Tests
test/e2e-v2/cases/runtime-rule/:mal-storage/{banyandb,elasticsearch,postgresql}— full 10-phase lifecycle(CREATE → FILTER_ONLY → STRUCTURAL → DUMP → 4× ILLEGAL → SHAPE-BREAK →
INACTIVATE → ACTIVATE → DELETE → DUMP) with a per-phase
steplabel soverification queries attribute data back to the phase that wrote it.
lal— LAL hot-swap with a log-mal aggregation rule; swctl asserts theextracted metric carries the swap-flipped step label.
cluster— 2-OAP convergence over ZooKeeper.Dependency bumps
Driven by the new BanyanDB schema-consistency RPCs whose generated validation
code requires the protobuf-java 4.x runtime:
1.70.0→1.80.03.25.5→4.33.11.2.1→1.3.04.2.10.Final→4.2.12.Final2.0.75→2.0.77Checklist
CHANGESlog.