From 12226e5fe1ad04c2927a5ab5ff62f7c073bfa116 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Wed, 29 Apr 2026 13:40:35 +0800 Subject: [PATCH 1/5] Runtime rule hot-update for MAL and LAL MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds /runtime/rule/{addOrUpdate,inactivate,delete,list,bundled,dump,get} on a new admin port (default 17128, disabled by default) for cluster-wide MAL/LAL rule management — push, soft-pause, drop, list, dump, single-rule fetch. The endpoint converges through a single elected main per (catalog, name) with Suspend/Resume RPCs broadcast to peers across the OAP cluster bus. Rules are stored in the management storage layer (BanyanDB / Elasticsearch / JDBC) so hot-updates survive OAP restart — at boot the merged view of bundled YAML and persisted runtime rows takes effect without a regression to defaults. The endpoint has no built-in authentication: operators must gateway-protect it with IP allow-lists and never expose it to the public internet. Engine model ------------ Three layers, with one boundary between each: scheduler (DSLManager + REST handler) — DSL-agnostic. Lock acquisition, cluster Suspend/Resume RPC fan-out, persistence, classloader graveyard, cross-file ownership enforcement, tick scheduling, self-heal. orchestrators (DSLRuntimeApply, DSLRuntimeUnregister, DSLRuntimeDelete) — drive the per-DSL phase pipeline through the engine SPI. engines (MalRuleEngine, LalRuleEngine) — DSL-specific. Implement compile / verify / commit / rollback / unregister / dropBackend / reloadStatic against a per-engine ApplyContext; classify, claimedKeys, storageImpactKeys drive cross-DSL routing and the storage-change guardrail. DSLClassLoaderManager (server-core) ----------------------------------- Process-wide singleton that owns every per-file RuleClassLoader. Boot-time bundled rules continue to load into the OAP main classloader (shared); per- file `static:` loaders only mint after a runtime override is removed and the bundled YAML must serve again. Loader name format `:/@` is observable in stack traces and the graveyard's INFO/WARN log lines. The manager runs an internal daemon sweeper that observes phantom-reference collection and warns on retired loaders that stay alive past the configured threshold (the leak signal). API surface: newBuilder(catalog, rule, kind, hash) — mints a loader for compile; not yet registered as active. commit(loader) — promotes to active, returns the displaced prior so the caller decides whether to retire. retire(loader) — graveyard a specific loader for GC observability. dropRuntime(catalog, rule) — drops + retires the active loader. active(catalog, rule), activeCount(), pendingCount() — diagnostics. The split between newBuilder and commit means a failed compile leaves the live loader untouched: the failed loader is just garbage-collected, the manager's active map still points at the previously-serving one. /inactivate semantics (Design A) -------------------------------- /inactivate stamps localState=NOT_LOADED. Bundled rules do NOT auto- resurrect on /inactivate even when a bundled twin exists on disk — the operator's "off" intent is preserved across reboots. To bring bundled back, the operator runs /delete (drops the row, gone-keys reconcile reloads bundled) or /addOrUpdate (with bundled YAML or their own). /delete semantics ----------------- Default mode: removes the runtime row. No bundled twin — destructive cascade fires (BanyanDB measure / ES index / JDBC table dropped). Rule fully gone. Bundled twin exists — non-destructive: backend resources runtime claimed that bundled doesn't (or claims at different shape) are dropped; bundled-shared at matching shape is preserved. The runtime row is removed; bundled is reinstalled into a `static:` loader synchronously on the local node. Peers converge via gone-keys reconcile on their next tick. ?mode=revertToBundled is an explicit operator hint that requires a bundled twin (returns 400 no_bundled_twin when none exists) — useful for scripts that want to fail loudly on assumption mismatch. REST surface ------------ * /list returns a single JSON envelope {generatedAt, loaderStats, rules} (was NDJSON). Each row carries status, localState, suspendOrigin, loaderGc, loaderKind (RUNTIME/STATIC/NONE), loaderName, contentHash, bundled, bundledContentHash, updateTime, lastApplyError. status=BUNDLED replaces the prior STATIC for bundled-only rules. * /get accepts ?source=bundled to read bundled YAML even when a runtime row exists — closes the "compare runtime to bundled" gap for editor flows. * All JSON build sites use Gson. Catalog enum at REST boundary ----------------------------- The catalog query parameter parses to org.apache.skywalking.oap.server.core .classloader.Catalog at the REST boundary. RuntimeRuleService's public methods (addOrUpdate / inactivate / delete / get / listBundled / dumpCatalog) take Catalog; unknown wire values return 400 invalid_catalog uniformly. Internal helpers convert via getWireName() at DAO / cluster-RPC edges. DeleteMode enum --------------- ?mode= parses to DeleteMode at the REST boundary. The string never leaves the handler. ForwardTarget interface removed ------------------------------- Single production implementation (RuntimeRuleRestHandler) and zero test fakes. The cluster gRPC service now references RuntimeRuleService directly; the REST handler is left as a pure transport adapter (route bindings + parameter parsing). The Result POJO becomes RuntimeRuleService.ForwardResult. Cluster ------- Suspend / Resume / Forward RPCs over the cluster gRPC bus. Single-main routing (deterministic mainFor(catalog, name)) with REST forward-to-main. Self-heal sweeps SUSPENDED bundles whose main crashed mid-apply (60 s default). Storage ------- RuntimeRuleManagementDAO — per-backend upsert / read / delete on the rule rows. Implementations for BanyanDB / ES / JDBC. /inactivate runs under StorageManipulationOpt.localCacheOnly so the backend measure and history stay; /delete fires the destructive cascade unless a bundled twin makes delta-drop the right path. Per-rule lifecycle docs ----------------------- docs/en/setup/backend/backend-runtime-rule-api.md walks the full operator surface: routes, applyStatus matrix, per-row status decoding (status x loaderKind x bundled), reading bundled-vs-runtime YAML, consistency model. E2E --- test/e2e-v2/cases/runtime-rule/ covers the lifecycle on BanyanDB / ES / JDBC, plus a 2-node cluster scenario (Suspend/Resume + main routing) and the LAL pipeline. Verified end-to-end on BanyanDB locally: CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → 4× illegal → SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP. Co-Authored-By: Claude Opus 4.7 (1M context) --- .claude/skills/gh-pull-request/SKILL.md | 40 + .github/workflows/skywalking.yaml | 16 +- .gitignore | 1 + .licenserc.yaml | 4 +- CLAUDE.md | 22 + apm-protocol/apm-network/pom.xml | 4 +- dist-material/release-docs/LICENSE | 87 +- docker/.env | 2 +- docker/oap/docker-entrypoint.sh | 1 + docs/en/changes/changes.md | 50 +- .../runtime-rule-hot-update.md | 386 ++++ docs/en/security/README.md | 31 + docs/en/setup/backend/backend-health-check.md | 29 +- .../setup/backend/backend-runtime-rule-api.md | 454 ++++ docs/menu.yml | 4 + oap-server-bom/pom.xml | 4 +- oap-server/ai-pipeline/pom.xml | 4 +- .../meter/process/MeterProcessorTest.java | 4 +- .../v2/compiler/LALClassGenerator.java | 75 +- .../oap/log/analyzer/v2/dsl/DSL.java | 27 +- .../dsl/spec/extractor/MetricExtractor.java | 17 +- .../analyzer/v2/module/LogAnalyzerModule.java | 19 +- .../log/analyzer/v2/provider/LALConfigs.java | 113 +- .../provider/LogAnalyzerModuleProvider.java | 153 +- .../log/listener/LogFilterListener.java | 292 ++- .../oap/meter/analyzer/v2/Analyzer.java | 127 +- .../analyzer/v2/MalConverterRegistry.java | 61 + .../oap/meter/analyzer/v2/MetricConvert.java | 152 +- .../v2/compiler/MALBytecodeHelper.java | 24 +- .../v2/compiler/MALClassGenerator.java | 77 +- .../v2/compiler/MALMetadataExtractor.java | 4 +- .../oap/meter/analyzer/v2/dsl/DSL.java | 57 +- .../analyzer/v2/dsl/FilterExpression.java | 47 +- .../analyzer/v2/prometheus/rule/Rules.java | 105 +- oap-server/exporter/pom.xml | 4 +- oap-server/server-alarm-plugin/pom.xml | 4 +- .../core/alarm/provider/AlarmKernel.java | 88 + .../alarm/provider/AlarmModuleProvider.java | 3 + .../core/alarm/provider/RunningRule.java | 50 + .../grpc-configuration-sync/pom.xml | 4 +- oap-server/server-core/pom.xml | 4 +- .../oap/server/core/CoreModule.java | 4 +- .../oap/server/core/CoreModuleProvider.java | 13 +- .../server/core/alarm/AlarmKernelService.java | 56 + .../oap/server/core/alarm/AlarmModule.java | 7 +- .../core/analysis/meter/MeterSystem.java | 498 ++++- .../worker/ManagementStreamProcessor.java | 9 +- .../worker/MetricsAggregateWorker.java | 29 + .../worker/MetricsPersistentMinWorker.java | 20 + .../worker/MetricsPersistentWorker.java | 27 + .../worker/MetricsStreamProcessor.java | 325 ++- .../analysis/worker/NoneStreamProcessor.java | 9 +- .../worker/RecordStreamProcessor.java | 9 +- .../analysis/worker/TopNStreamProcessor.java | 9 +- .../classloader/BytecodeClassDefiner.java | 58 + .../oap/server/core/classloader/Catalog.java | 55 + .../core/classloader/ClassLoaderGc.java | 161 ++ .../classloader/DSLClassLoaderManager.java | 280 +++ .../core/classloader/RuleClassLoader.java | 101 + .../management/runtimerule/RuntimeRule.java | 127 ++ .../server/core/rule/ext/RuleSetMerger.java | 192 ++ .../rule/ext/RuntimeRuleOverrideResolver.java | 151 ++ .../core/rule/ext/StaticRuleRegistry.java | 170 ++ .../core/source/DefaultScopeDefine.java | 1 + .../server/core/storage/StorageModule.java | 12 +- .../annotation/ValueColumnMetadata.java | 10 + .../management/RuntimeRuleManagementDAO.java | 106 + .../core/storage/model/ModelInstaller.java | 155 +- .../core/storage/model/ModelRegistry.java | 85 + .../storage/model/StorageManipulationOpt.java | 485 ++++ .../core/storage/model/StorageModels.java | 186 +- .../core/worker/IWorkerInstanceSetter.java | 9 + .../core/worker/WorkerInstancesService.java | 8 + .../core/analysis/meter/MeterSystemTest.java | 5 +- .../ManagementPersistentWorkerTest.java | 57 + .../MetricsStreamProcessorSuspendTest.java | 191 ++ .../DSLClassLoaderManagerTest.java | 113 + .../core/classloader/RuleClassLoaderTest.java | 84 + .../core/rule/ext/StaticRuleRegistryTest.java | 100 + .../core/storage/model/StorageModelsTest.java | 55 +- .../fetcher-proto/pom.xml | 4 +- .../library-banyandb-client/pom.xml | 4 +- .../banyandb/v1/client/AbstractWrite.java | 2 +- .../banyandb/v1/client/BanyanDBClient.java | 257 ++- .../banyandb/v1/client/SchemaWatcher.java | 146 ++ .../v1/client/grpc/MetadataClient.java | 37 +- .../metadata/GroupMetadataRegistry.java | 29 +- .../IndexRuleBindingMetadataRegistry.java | 29 +- .../metadata/IndexRuleMetadataRegistry.java | 19 +- .../metadata/MeasureMetadataRegistry.java | 15 +- .../metadata/PropertyMetadataRegistry.java | 15 +- .../v1/client/metadata/Serializable.java | 2 +- .../metadata/StreamMetadataRegistry.java | 15 +- .../TopNAggregationMetadataRegistry.java | 19 +- .../metadata/TraceMetadataRegistry.java | 15 +- .../library-banyandb-client/src/main/proto | 2 +- .../v1/client/BanyanDBClientTestCI.java | 19 +- .../server/library/batchqueue/BatchQueue.java | 48 +- .../library/it/BanyanDBTestContainer.java | 94 + .../library-pprof-parser/pom.xml | 10 +- .../traceql-plugin/pom.xml | 4 +- .../aws-firehose-receiver/pom.xml | 4 +- .../envoy/EnvoyMetricReceiverProvider.java | 9 +- .../otel/OtelMetricReceiverModule.java | 3 +- .../otel/OtelMetricReceiverProvider.java | 15 +- .../OpenTelemetryMetricRequestProcessor.java | 76 +- ...RequestProcessorConverterRegistryTest.java | 79 + oap-server/server-receiver-plugin/pom.xml | 1 + .../receiver-proto/pom.xml | 4 +- .../pom.xml | 161 ++ .../receiver/runtimerule/apply/DSLDelta.java | 126 ++ .../runtimerule/apply/DeltaClassifier.java | 351 +++ .../runtimerule/apply/LalFileApplier.java | 396 ++++ .../runtimerule/apply/MalFileApplier.java | 420 ++++ .../runtimerule/apply/MalShapeExtractor.java | 210 ++ .../runtimerule/cluster/MainRouter.java | 87 + .../cluster/RuntimeRuleClusterClient.java | 216 ++ .../RuntimeRuleClusterServiceImpl.java | 369 ++++ .../runtimerule/engine/ApplyContext.java | 67 + .../runtimerule/engine/ApplyInputs.java | 47 + .../runtimerule/engine/Classification.java | 52 + .../runtimerule/engine/CompiledDSL.java | 43 + .../engine/EngineCompileException.java | 42 + .../runtimerule/engine/RuleEngine.java | 362 +++ .../engine/RuleEngineRegistry.java | 68 + .../engine/lal/CompiledLalDSL.java | 50 + .../engine/lal/LalApplyContext.java | 49 + .../runtimerule/engine/lal/LalRuleEngine.java | 500 +++++ .../engine/mal/CompiledMalDSL.java | 62 + .../engine/mal/MalApplyContext.java | 55 + .../runtimerule/engine/mal/MalRuleEngine.java | 813 +++++++ .../DbOverrideRuntimeRuleResolver.java | 158 ++ .../runtimerule/metrics/LockMetrics.java | 194 ++ .../module/RuntimeRuleModule.java} | 30 +- .../module/RuntimeRuleModuleConfig.java | 46 + .../module/RuntimeRuleModuleProvider.java | 445 ++++ .../runtimerule/reconcile/DSLManager.java | 767 +++++++ .../reconcile/DSLRuntimeApply.java | 240 ++ .../reconcile/DSLRuntimeDelete.java | 184 ++ .../reconcile/DSLRuntimeUnregister.java | 151 ++ .../runtimerule/reconcile/DSLScriptKey.java | 88 + .../reconcile/PendingApplyCommit.java | 63 + .../runtimerule/reconcile/RuleSync.java | 264 +++ .../reconcile/StaticRuleLoader.java | 196 ++ .../StructuralCommitCoordinator.java | 166 ++ .../runtimerule/reconcile/SuspendResult.java | 43 + .../reconcile/SuspendResumeCoordinator.java | 328 +++ .../receiver/runtimerule/rest/DeleteMode.java | 63 + .../rest/RuntimeRuleRestHandler.java | 225 ++ .../runtimerule/rest/RuntimeRuleService.java | 1968 +++++++++++++++++ .../runtimerule/state/AppliedRuleScript.java | 147 ++ .../runtimerule/state/DSLRuntimeState.java | 312 +++ .../runtimerule/state/EngineApplied.java | 96 + .../runtimerule/util/ContentHash.java | 52 + .../src/main/proto/runtime-rule-cluster.proto | 161 ++ ....core.rule.ext.RuntimeRuleOverrideResolver | 18 + ...ing.oap.server.library.module.ModuleDefine | 19 + ...g.oap.server.library.module.ModuleProvider | 19 + .../analyzer/v2/dsl/TestSampleFamily.java | 65 + .../runtimerule/apply/DSLDeltaTest.java | 127 ++ .../apply/DeltaClassifierTest.java | 285 +++ .../runtimerule/apply/LalFileApplierTest.java | 383 ++++ .../runtimerule/apply/MalFileApplierTest.java | 231 ++ .../runtimerule/cluster/MainRouterTest.java | 44 + .../rest/GuardrailIntegrationTest.java | 354 +++ .../rest/RuntimeRuleRestHandlerTest.java | 566 +++++ .../state/AppliedRuleScriptLockTest.java | 111 + .../state/DSLRuntimeStateTest.java | 103 + .../runtimerule/util/ContentHashTest.java | 81 + .../provider/TelegrafReceiverProvider.java | 18 +- .../zabbix/provider/ZabbixMetricsTest.java | 5 +- oap-server/server-starter/pom.xml | 5 + .../src/main/resources/application.yml | 17 + .../banyandb/BanyanDBIndexInstaller.java | 518 ++++- .../BanyanDBRuntimeRuleManagementDAO.java | 118 + .../banyandb/BanyanDBStorageProvider.java | 11 +- .../bulk/AbstractBulkWriteProcessor.java | 4 +- .../banyandb/stream/AbstractBanyanDBDAO.java | 2 +- .../storage/plugin/banyandb/BanyanDBIT.java | 356 --- .../StorageModuleElasticsearchProvider.java | 15 +- .../base/StorageEsInstaller.java | 3 +- .../query/RuntimeRuleManagementEsDAO.java | 116 + .../jdbc/common/JDBCStorageProvider.java | 15 +- .../jdbc/common/JDBCTableInstaller.java | 3 +- .../dao/JDBCRuntimeRuleManagementDAO.java | 150 ++ .../profile/core/MockCoreModuleProvider.java | 4 +- .../core/mock/MockWorkerInstancesService.java | 4 + pom.xml | 19 +- .../runtime-rule/cluster/cluster-flow.sh | 178 ++ .../runtime-rule/cluster/docker-compose.yml | 91 + .../cases/runtime-rule/cluster/e2e.yaml | 59 + .../runtime-rule/cluster/expected/ok.txt | 1 + .../cases/runtime-rule/lal/docker-compose.yml | 60 + test/e2e-v2/cases/runtime-rule/lal/e2e.yaml | 66 + .../cases/runtime-rule/lal/expected/ok.txt | 1 + .../e2e-v2/cases/runtime-rule/lal/lal-flow.sh | 209 ++ .../runtime-rule/lal/log-emitter/Dockerfile | 22 + .../runtime-rule/lal/log-emitter/emitter.py | 78 + .../runtime-rule/lal/seed-rules/lal-v1.yaml | 40 + .../runtime-rule/lal/seed-rules/lal-v2.yaml | 39 + .../runtime-rule/lal/seed-rules/log-mal.yaml | 26 + .../mal-storage/banyandb/docker-compose.yml | 69 + .../mal-storage/banyandb/e2e.yaml | 82 + .../mal-storage/banyandb/expected/ok.txt | 1 + .../elasticsearch/docker-compose.yml | 78 + .../mal-storage/elasticsearch/e2e.yaml | 70 + .../runtime-rule/mal-storage/expected/ok.txt | 1 + .../mal-storage/otlp-emitter/Dockerfile | 21 + .../mal-storage/otlp-emitter/emitter.py | 129 ++ .../mal-storage/postgresql/docker-compose.yml | 79 + .../mal-storage/postgresql/e2e.yaml | 71 + .../mal-storage/runtime-rule-flow.sh | 578 +++++ .../seed-rules/illegal-duplicate-metric.yaml | 25 + .../seed-rules/illegal-malformed.yaml | 24 + .../seed-rules/illegal-shape-flip.yaml | 24 + .../seed-rules/seed-rule-filter-only.yaml | 24 + .../seed-rules/seed-rule-instance.yaml | 27 + .../seed-rules/seed-rule-structural.yaml | 28 + .../mal-storage/seed-rules/seed-rule.yaml | 28 + .../e2e-mock-baseline-server/pom.xml | 2 +- .../java-test-service/e2e-protocol/pom.xml | 2 +- .../opentelemetry-proto/pom.xml | 2 +- test/e2e-v2/script/env | 2 +- 223 files changed, 23599 insertions(+), 958 deletions(-) create mode 100644 docs/en/concepts-and-designs/runtime-rule-hot-update.md create mode 100644 docs/en/setup/backend/backend-runtime-rule-api.md create mode 100644 oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MalConverterRegistry.java create mode 100644 oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmKernel.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmKernelService.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/BytecodeClassDefiner.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/Catalog.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/management/runtimerule/RuntimeRule.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuleSetMerger.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuntimeRuleOverrideResolver.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistry.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/management/RuntimeRuleManagementDAO.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java create mode 100644 oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementPersistentWorkerTest.java create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessorSuspendTest.java create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistryTest.java create mode 100644 oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/SchemaWatcher.java create mode 100644 oap-server/server-library/library-integration-test/src/main/java/org/apache/skywalking/oap/server/library/it/BanyanDBTestContainer.java create mode 100644 oap-server/server-receiver-plugin/otel-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessorConverterRegistryTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/pom.xml create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDelta.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifier.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalShapeExtractor.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouter.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyContext.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyInputs.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/Classification.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/CompiledDSL.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/EngineCompileException.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngineRegistry.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/CompiledLalDSL.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalApplyContext.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/CompiledMalDSL.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalApplyContext.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/extension/DbOverrideRuntimeRuleResolver.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/metrics/LockMetrics.java rename oap-server/{server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelCreator.java => server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModule.java} (53%) create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLScriptKey.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/PendingApplyCommit.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StructuralCommitCoordinator.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResult.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResumeCoordinator.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScript.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeState.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/EngineApplied.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHash.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/proto/runtime-rule-cluster.proto create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleDefine create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleProvider create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/TestSampleFamily.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDeltaTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifierTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplierTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouterTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScriptLockTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeStateTest.java create mode 100644 oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHashTest.java create mode 100644 oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBRuntimeRuleManagementDAO.java delete mode 100644 oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIT.java create mode 100644 oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/query/RuntimeRuleManagementEsDAO.java create mode 100644 oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/dao/JDBCRuntimeRuleManagementDAO.java create mode 100755 test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh create mode 100644 test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml create mode 100644 test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt create mode 100644 test/e2e-v2/cases/runtime-rule/lal/docker-compose.yml create mode 100644 test/e2e-v2/cases/runtime-rule/lal/e2e.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/lal/expected/ok.txt create mode 100755 test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh create mode 100644 test/e2e-v2/cases/runtime-rule/lal/log-emitter/Dockerfile create mode 100644 test/e2e-v2/cases/runtime-rule/lal/log-emitter/emitter.py create mode 100644 test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v1.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v2.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/lal/seed-rules/log-mal.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/docker-compose.yml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/expected/ok.txt create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/docker-compose.yml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/expected/ok.txt create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/Dockerfile create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/emitter.py create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/docker-compose.yml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml create mode 100755 test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-duplicate-metric.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-malformed.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-shape-flip.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-filter-only.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-instance.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-structural.yaml create mode 100644 test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule.yaml diff --git a/.claude/skills/gh-pull-request/SKILL.md b/.claude/skills/gh-pull-request/SKILL.md index fa08cd598fc5..657cd50c42e3 100644 --- a/.claude/skills/gh-pull-request/SKILL.md +++ b/.claude/skills/gh-pull-request/SKILL.md @@ -32,6 +32,46 @@ license-eye header check If invalid files are found, fix with `license-eye header fix` and re-check. +### 3. Unnecessary fully-qualified class names + +The project checkstyle forbids inline FQCNs — every type reference in code should resolve +through an `import`, not a fully-qualified name. Checkstyle does not always catch this (it +misses cases like inline `java.util.HashMap`, `java.util.concurrent.TimeUnit`, or +`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` used as a local +variable type, generic parameter, or `new` target). Audit the files the branch touched +before pushing: + +Use the `Grep` tool (ripgrep) rather than BSD `grep` on macOS — the scan below relies on a +negative lookahead that BSD `grep` doesn't support and GNU `grep -P` does: + +``` +pattern: ^(?!\s*(import |package |\s*\*)).*\b(java\.util\.|java\.io\.|java\.nio\.|java\.util\.concurrent\.|javassist\.|org\.apache\.skywalking\.)[A-Z][A-Za-z0-9_]* +glob: *.java +output_mode: content +-n: true +``` + +Scope the scan to files the branch touched, not the whole tree — pre-existing FQDNs on +unrelated files generate noise. Use `git diff --name-only master...HEAD -- '*.java'` to get +the changed list, then run the ripgrep pattern against each. + +Acceptable exceptions (same as the `CLAUDE.md` rule): + - Two classes with the same simple name would collide if both imported. + - A Javadoc `{@link}` where the short name would be ambiguous to the reader. + - Inside a string literal (e.g., a class name passed to `Class.forName`). + +Fix every other hit — add an `import` and switch to the short name. This includes +`new java.util.HashMap<>()`, `java.util.Set` parameter types, and +`org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics.Timer` as a local +variable type. Field declarations, method signatures, local variables, and generic +type arguments should all use the imported short name. + +Re-run checkstyle after the fix — a sloppy `sed`/`replace_all` can corrupt the `import` +line itself (e.g., turning `import java.util.concurrent.locks.ReentrantLock;` into +`import ReentrantLock;`), which causes a cryptic checkstyle `Range [0, -1) out of +bounds for length N` error, not a normal violation line. If you see that error, inspect +the imports block first. + ## Commit and push After checks pass, commit and push: diff --git a/.github/workflows/skywalking.yaml b/.github/workflows/skywalking.yaml index 88f19ea1ed54..784d60eb7c34 100644 --- a/.github/workflows/skywalking.yaml +++ b/.github/workflows/skywalking.yaml @@ -294,7 +294,9 @@ jobs: distribution: temurin - name: Integration test run: | - # Exclude slow integration tests and run those tests separately below. + # Exclude slow integration tests (run in slow-integration-test). Runtime-rule + # and BanyanDB storage CRUD are verified end-to-end in the dedicated e2e cases + # (see test/e2e-v2/cases/runtime-rule/ and test/e2e-v2/cases/banyandb). ./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow || \ ./mvnw -B clean integration-test -Dcheckstyle.skip -DskipUTs=true -DexcludedGroups=slow @@ -394,6 +396,18 @@ jobs: config: test/e2e-v2/cases/storage/es/es-sharding/e2e.yaml env: ES_VERSION=8.18.8 + - name: Runtime Rule MAL Storage BanyanDB + config: test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml + - name: Runtime Rule MAL Storage PostgreSQL + config: test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml + - name: Runtime Rule MAL Storage Elasticsearch 8.18.8 + config: test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml + env: ES_VERSION=8.18.8 + - name: Runtime Rule LAL Hot-Update + config: test/e2e-v2/cases/runtime-rule/lal/e2e.yaml + - name: Runtime Rule Cluster Convergence + config: test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml + - name: Alarm ES config: test/e2e-v2/cases/alarm/es/e2e.yaml - name: Alarm ES Sharding diff --git a/.gitignore b/.gitignore index 78364b6f65f1..c261fb546ada 100644 --- a/.gitignore +++ b/.gitignore @@ -43,4 +43,5 @@ test/script-cases/scripts/**/*.generated-classes/ # Claude Code local settings .claude/settings.local.json +.claude/scheduled_tasks.lock *.generated-classes/ diff --git a/.licenserc.yaml b/.licenserc.yaml index 8ed6da744a31..83abf66806af 100644 --- a/.licenserc.yaml +++ b/.licenserc.yaml @@ -134,10 +134,10 @@ dependency: version: 1.12.0 license: Apache-2.0 - name: build.buf.protoc-gen-validate:pgv-java-stub - version: 1.2.1 + version: 1.3.0 license: Apache-2.0 - name: build.buf.protoc-gen-validate:protoc-gen-validate - version: 1.2.1 + version: 1.3.0 license: Apache-2.0 - name: com.aayushatharva.brotli4j:service version: 1.20.0 diff --git a/CLAUDE.md b/CLAUDE.md index e1586e50600c..dacd9a41bafe 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -90,6 +90,17 @@ public class XxxModuleProvider extends ModuleProvider { - No star imports (`import xxx.*`) - No unused or redundant imports - No empty statements (standalone `;`) +- No fully-qualified class names inline in code — always add an `import` statement and + use the short name. Acceptable exceptions: (a) two classes with the same simple name + would collide if both imported, (b) the class appears exactly once in a Javadoc + `{@link}` where the short name would be ambiguous to the reader. Field declarations, + method signatures, local variables, and generic type arguments should always use the + imported short name — `private RemoteClientManager rcm;`, not `private + org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager rcm;`. +- No one-line delegate methods. A wrapper whose only body is a single forwarding call + to another class (`return Other.foo(a, b);`) adds a hop without value. Inline the + call at the use site, or call the underlying object directly (including via method + reference: `obj::foo` instead of `this::wrapperOfFoo`). **Required patterns:** - `@Override` annotation required for overridden methods @@ -105,6 +116,13 @@ public class XxxModuleProvider extends ModuleProvider { - Package names: `org.apache.skywalking.*` or `test.apache.skywalking.*` - Type names: `PascalCase` or `UPPER_CASE_WITH_UNDERSCORES` - Local variables/parameters/members: `camelCase` +- **Function-oriented naming, not abstract metaphor**: classes and methods are named for + what they do, not for an abstract concept. Prefer concrete verbs (`load`, `apply`, + `unregister`, `compile`, `verify`, `commit`, `rollback`) over metaphorical ones + (`seed`, `hydrate`, `bootstrap`, `prime`). Class names follow the same rule — + `StaticRuleLoader` (loads static rules), not `StaticBundleSeeder`; `DSLSyncTimer` (syncs + DB → state on a timer), not `TickRunner`. If you can't name a method without reaching + for a metaphor, the method is probably doing too much; split it. **File limits:** - Max file length: 3000 lines @@ -257,6 +275,10 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all 10. **Relative paths in docs are valid**: Relative file paths (e.g., `../../../oap-server/...`) in documentation work both in the repo and on the documentation website, supported by website build tooling 11. **Module service registration**: When adding a service to `CoreModule.services()`, update ALL `CoreModuleProvider` implementations — not just the main one. Search with `grep -rn "extends CoreModuleProvider" oap-server/ --include="*.java"`. The `MockCoreModuleProvider` in `server-tools/profile-exporter/` also needs it, or the profile exporter e2e test will fail at startup. 12. **Multiple OAP packagings**: The OAP server is not only the main `server-starter`. The `server-tools/` directory contains standalone tools (e.g., profile exporter) that boot with mock module providers and a subset of modules. Changes to core module contracts (services, required modules) must be reflected in these tools too. +13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup). +14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly. +15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines. +16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence. ## Analysis and Design Principles diff --git a/apm-protocol/apm-network/pom.xml b/apm-protocol/apm-network/pom.xml index 2be48f693915..cc60598b8f62 100644 --- a/apm-protocol/apm-network/pom.xml +++ b/apm-protocol/apm-network/pom.xml @@ -92,11 +92,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/dist-material/release-docs/LICENSE b/dist-material/release-docs/LICENSE index db841d94b50d..f9d0e55bd7c2 100644 --- a/dist-material/release-docs/LICENSE +++ b/dist-material/release-docs/LICENSE @@ -208,8 +208,8 @@ Apache-2.0 licenses ======================================================================== The following components are provided under the Apache-2.0 License. See project link for details. The text of each license is the standard Apache 2.0 license. - https://mvnrepository.com/artifact/build.buf.protoc-gen-validate/pgv-java-stub/1.2.1 Apache-2.0 - https://mvnrepository.com/artifact/build.buf.protoc-gen-validate/protoc-gen-validate/1.2.1 Apache-2.0 + https://mvnrepository.com/artifact/build.buf.protoc-gen-validate/pgv-java-stub/1.3.0 Apache-2.0 + https://mvnrepository.com/artifact/build.buf.protoc-gen-validate/protoc-gen-validate/1.3.0 Apache-2.0 https://mvnrepository.com/artifact/com.aayushatharva.brotli4j/brotli4j/1.20.0 Apache-2.0 https://mvnrepository.com/artifact/com.aayushatharva.brotli4j/service/1.20.0 Apache-2.0 https://mvnrepository.com/artifact/com.alibaba.nacos/nacos-auth-plugin/2.3.2 Apache-2.0 @@ -226,7 +226,7 @@ The text of each license is the standard Apache 2.0 license. https://mvnrepository.com/artifact/com.fasterxml.jackson.datatype/jackson-datatype-jsr310/2.20.1 Apache-2.0 https://mvnrepository.com/artifact/com.fasterxml.jackson.module/jackson-module-kotlin/2.13.4 Apache-2.0 https://mvnrepository.com/artifact/com.fasterxml/classmate/1.5.1 Apache-2.0 - https://mvnrepository.com/artifact/com.google.api.grpc/proto-google-common-protos/2.48.0 Apache-2.0 + https://mvnrepository.com/artifact/com.google.api.grpc/proto-google-common-protos/2.63.2 Apache-2.0 https://mvnrepository.com/artifact/com.google.auto.service/auto-service-annotations/1.0.1 Apache-2.0 https://mvnrepository.com/artifact/com.google.code.findbugs/jsr305/3.0.2 Apache-2.0 https://mvnrepository.com/artifact/com.google.code.gson/gson/2.9.0 Apache-2.0 @@ -236,7 +236,6 @@ The text of each license is the standard Apache 2.0 license. https://mvnrepository.com/artifact/com.google.guava/guava/32.0.1-jre Apache-2.0 https://mvnrepository.com/artifact/com.google.guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava Apache-2.0 https://mvnrepository.com/artifact/com.google.inject/guice/4.1.0 Apache-2.0 - https://mvnrepository.com/artifact/com.google.j2objc/j2objc-annotations/2.8 Apache-2.0 https://mvnrepository.com/artifact/com.graphql-java/java-dataloader/3.2.1 Apache-2.0 https://mvnrepository.com/artifact/com.linecorp.armeria/armeria/1.34.2 Apache-2.0 https://mvnrepository.com/artifact/com.linecorp.armeria/armeria-graphql/1.34.2 Apache-2.0 @@ -254,7 +253,7 @@ The text of each license is the standard Apache 2.0 license. https://mvnrepository.com/artifact/commons-codec/commons-codec/1.11 Apache-2.0 https://mvnrepository.com/artifact/commons-io/commons-io/2.17.0 Apache-2.0 https://mvnrepository.com/artifact/commons-net/commons-net/3.9.0 Apache-2.0 - https://mvnrepository.com/artifact/commons-validator/commons-validator/1.9.0 Apache-2.0 + https://mvnrepository.com/artifact/commons-validator/commons-validator/1.10.1 Apache-2.0 https://npmjs.com/package/d3-flame-graph/v/4.1.3 4.1.3 Apache-2.0 https://npmjs.com/package/echarts/v/5.4.1 5.4.1 Apache-2.0 https://mvnrepository.com/artifact/io.etcd/jetcd-api/0.6.1 Apache-2.0 @@ -290,46 +289,47 @@ The text of each license is the standard Apache 2.0 license. https://mvnrepository.com/artifact/io.fabric8/kubernetes-model-scheduling/6.7.1 Apache-2.0 https://mvnrepository.com/artifact/io.fabric8/kubernetes-model-storageclass/6.7.1 Apache-2.0 https://mvnrepository.com/artifact/io.fabric8/zjsonpatch/0.3.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-api/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-context/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-core/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-grpclb/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-netty/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-protobuf/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-protobuf-lite/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-services/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-stub/1.70.0 Apache-2.0 - https://mvnrepository.com/artifact/io.grpc/grpc-util/1.70.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-api/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-context/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-core/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-grpclb/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-netty/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-protobuf/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-protobuf-lite/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-services/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-stub/1.80.0 Apache-2.0 + https://mvnrepository.com/artifact/io.grpc/grpc-util/1.80.0 Apache-2.0 https://mvnrepository.com/artifact/io.micrometer/context-propagation/1.2.0 Apache-2.0 https://mvnrepository.com/artifact/io.micrometer/micrometer-commons/1.14.4 Apache-2.0 https://mvnrepository.com/artifact/io.micrometer/micrometer-core/1.14.4 Apache-2.0 https://mvnrepository.com/artifact/io.micrometer/micrometer-observation/1.14.4 Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-buffer/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-base/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-compression/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-dns/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-haproxy/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-http/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-http2/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-marshalling/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-protobuf/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-codec-socks/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-common/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-handler/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-handler-proxy/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-resolver/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-resolver-dns/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-resolver-dns-classes-macos/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-resolver-dns-native-macos/4.2.10.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-buffer/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-base/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-compression/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-dns/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-haproxy/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-http/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-http2/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-marshalling/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-protobuf/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-codec-socks/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-common/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-handler/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-handler-proxy/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-resolver/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-resolver-dns/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-resolver-dns-classes-macos/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-resolver-dns-native-macos/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-tcnative-boringssl-static/2.0.77.Final Apache-2.0 https://mvnrepository.com/artifact/io.netty/netty-tcnative-boringssl-static/2.0.75.Final Apache-2.0 https://mvnrepository.com/artifact/io.netty/netty-tcnative-classes/2.0.75.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport-classes-epoll/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport-classes-kqueue/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport-native-epoll/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport-native-kqueue/4.2.10.Final Apache-2.0 - https://mvnrepository.com/artifact/io.netty/netty-transport-native-unix-common/4.2.10.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport-classes-epoll/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport-classes-kqueue/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport-native-epoll/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport-native-kqueue/4.2.12.Final Apache-2.0 + https://mvnrepository.com/artifact/io.netty/netty-transport-native-unix-common/4.2.12.Final Apache-2.0 https://mvnrepository.com/artifact/io.perfmark/perfmark-api/0.27.0 Apache-2.0 https://mvnrepository.com/artifact/io.prometheus/simpleclient/0.6.0 Apache-2.0 https://mvnrepository.com/artifact/io.prometheus/simpleclient_common/0.6.0 Apache-2.0 @@ -341,6 +341,7 @@ The text of each license is the standard Apache 2.0 license. https://mvnrepository.com/artifact/javax.inject/javax.inject/1 Apache-2.0 https://mvnrepository.com/artifact/joda-time/joda-time/2.10.5 Apache-2.0 https://mvnrepository.com/artifact/net.jodah/failsafe/2.4.4 Apache-2.0 + https://mvnrepository.com/artifact/org.apache.commons/commons-compress/1.21 Apache-2.0 https://mvnrepository.com/artifact/org.apache.commons/commons-lang3/3.18.0 Apache-2.0 https://mvnrepository.com/artifact/org.apache.commons/commons-text/1.4 Apache-2.0 https://mvnrepository.com/artifact/org.apache.curator/curator-client/4.3.0 Apache-2.0 @@ -394,8 +395,8 @@ BSD-3-Clause licenses The following components are provided under the BSD-3-Clause License. See project link for details. The text of each license is also included in licenses/LICENSE-[project].txt. - https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java/3.25.5 BSD-3-Clause - https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java-util/3.25.5 BSD-3-Clause + https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java/4.33.1 BSD-3-Clause + https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java-util/4.33.1 BSD-3-Clause https://npmjs.com/package/d3-collection/v/1.0.7 1.0.7 BSD-3-Clause https://npmjs.com/package/d3-ease/v/3.0.1 3.0.1 BSD-3-Clause https://npmjs.com/package/d3-tip/node_modules/d3-selection/v/1.4.2 1.4.2 BSD-3-Clause @@ -541,7 +542,7 @@ The text of each license is also included in licenses/LICENSE-[project].txt. https://npmjs.com/package/monaco-editor/v/0.34.1 0.34.1 MIT https://npmjs.com/package/nanoid/v/3.3.11 3.3.11 MIT https://mvnrepository.com/artifact/org.checkerframework/checker-qual/3.33.0 MIT - https://mvnrepository.com/artifact/org.codehaus.mojo/animal-sniffer-annotations/1.24 MIT + https://mvnrepository.com/artifact/org.codehaus.mojo/animal-sniffer-annotations/1.26 MIT https://mvnrepository.com/artifact/org.curioswitch.curiostack/protobuf-jackson/2.8.1 MIT https://mvnrepository.com/artifact/org.slf4j/slf4j-api/1.7.30 MIT https://npmjs.com/package/pinia/v/2.0.28 2.0.28 MIT @@ -588,7 +589,7 @@ https://golang.org/LICENSE licenses The following components are provided under the https://golang.org/LICENSE License. See project link for details. The text of each license is also included in licenses/LICENSE-[project].txt. - https://mvnrepository.com/artifact/com.google.re2j/re2j/1.7 https://golang.org/LICENSE + https://mvnrepository.com/artifact/com.google.re2j/re2j/1.8 https://golang.org/LICENSE ======================================================================== https://opensource.org/licenses/BSD-2-Clause;description=BSD 2-Clause License licenses diff --git a/docker/.env b/docker/.env index 8ca45571cd8f..3a0bd16f619a 100644 --- a/docker/.env +++ b/docker/.env @@ -6,6 +6,6 @@ # docker compose up ELASTICSEARCH_IMAGE=docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2 -BANYANDB_IMAGE=ghcr.io/apache/skywalking-banyandb:7568a326bb7b10b6aa804bf0f4239904c347d9d5 +BANYANDB_IMAGE=ghcr.io/apache/skywalking-banyandb:69c8f4d20ebb6532ea4c16a7ed7114dd6ec9770b OAP_IMAGE=ghcr.io/apache/skywalking/oap:latest UI_IMAGE=ghcr.io/apache/skywalking/ui:latest diff --git a/docker/oap/docker-entrypoint.sh b/docker/oap/docker-entrypoint.sh index 9e7a2784eaa3..be8439e2926c 100755 --- a/docker/oap/docker-entrypoint.sh +++ b/docker/oap/docker-entrypoint.sh @@ -1,3 +1,4 @@ + #!/bin/sh # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 3f18f9b22d4a..ff8d7151c6d8 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -1,8 +1,57 @@ ## 10.5.0 #### Project + +* **Runtime rule hot-update for MAL and LAL.** Operators can now ship metric (MAL) and log + (LAL) rule changes without restarting OAP. A push to a new admin endpoint persists the rule + to the configured storage backend, and every node in the cluster converges to the new + content within ~30 seconds. Common workflows: + * `addOrUpdate` — create or replace a rule. Body is the raw YAML you would normally ship + with OAP's static rule files. Returns 200 once the rule is applied locally and + persisted; peers pick it up on their next periodic scan (≤ 30 s). + * `inactivate` — soft-pause a rule. The OAP stops emitting metrics for that rule but the + backend measure (and its history) is preserved, so a later `addOrUpdate` to the same + `(catalog, name)` is lossless. The "off" intent is durable across reboots; bundled rules + on disk are not auto-resurrected when an `inactivate` removes the runtime override. + This is the safe way to take a rule offline. + * `delete` — removes an `INACTIVE` row (active rules return `409 requires_inactivate_first`). + For runtime-only rules with no bundled YAML on disk, the backend measure is dropped and + the rule is fully gone. For rules that have a bundled YAML, `delete` is non-destructive: + backend resources runtime claimed that bundled does not (or claims at a different shape) + are dropped, bundled-shared at matching shape is preserved, the row is removed, and the + bundled rule is reinstalled into a `static:` loader on the local node — peers converge + via the periodic reconcile. `?mode=revertToBundled` is an explicit operator hint that + fails with `400 no_bundled_twin` when no bundled YAML exists. + * `get` / `bundled` / `list` / `dump` — read-side endpoints for fetching a single rule's + YAML (with `ETag` support; `?source=bundled` reads the on-disk bundled YAML even when a + runtime override is in place), listing the bundled-vs-runtime overlay per catalog, + inspecting cluster-wide rule state as a JSON envelope (`{generatedAt, loaderStats, + rules}` — each row carries `status`/`localState`/`loaderKind`/`bundled`/`bundledContentHash` + so a UI can render override badges without a second roundtrip), and exporting all rules + as a tar.gz for backup / DR. + Hot-updates survive OAP restart: at boot OAP merges bundled rule files with persisted + runtime rules, so the cluster never silently regresses to the bundled defaults. + **The endpoint is disabled by default and listens on port `17128` when enabled. It has + no built-in authentication — operators must gateway-protect it with IP allow-lists and + never expose it to the public internet.** +* **BanyanDB schema mismatches are now visible at boot, not silent.** If BanyanDB already + holds a resource whose shape doesn't match what the current rule declares (e.g., a rule + was edited on disk while OAP was offline), OAP now skips that resource, logs an ERROR + with the declared-vs-backend diff, and continues booting — previously the mismatch was + silently accepted and samples for the affected resource were quietly dropped. To + re-shape a mismatched metric, push the desired YAML through + `POST /runtime/rule/addOrUpdate`. * Bump infra-e2e to testcontainers-go v0.42.0 (apache/skywalking-infra-e2e#146), which uses Docker Compose v2 plugin natively and removes docker-compose v1 dependency. * Remove deprecated `version` field from all docker-compose files for Compose v2 compatibility. +* **Best-effort schema-cutover fence for BanyanDB.** After firing a schema install or drop + OAP now waits up to a bounded window (default 2s) for every BanyanDB data node to apply + the change before resuming dispatch — the typical case gets a clean cutover where + samples after `200 OK` use the new shape. On laggard timeout, OAP logs a warning and + proceeds anyway so a single slow node doesn't wedge the apply. +* Bump dependencies: gRPC `1.70.0` → `1.80.0`, protobuf-java `3.25.5` → `4.33.1`, Netty + `4.2.10.Final` → `4.2.12.Final`, Netty-tcnative `2.0.75` → `2.0.77`, pgv (protoc-gen-validate) + `1.2.1` → `1.3.0`. Driven by the new BanyanDB schema-consistency RPCs whose generated + validation code requires the `protobuf-java 4.x` runtime. #### OAP Server * Add Zipkin Virtual GenAI e2e test. Use `zipkin_json` exporter to avoid protobuf dependency conflict @@ -54,4 +103,3 @@ * Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs. All issues and pull requests are [here](https://github.com/apache/skywalking/issues?q=milestone:10.5.0) - diff --git a/docs/en/concepts-and-designs/runtime-rule-hot-update.md b/docs/en/concepts-and-designs/runtime-rule-hot-update.md new file mode 100644 index 000000000000..9ee963f4cca6 --- /dev/null +++ b/docs/en/concepts-and-designs/runtime-rule-hot-update.md @@ -0,0 +1,386 @@ +# Runtime Rule Hot-Update — Architecture + +Operators change MAL and LAL rule files at runtime without restarting OAP. Changes +persist in management storage, survive reboots, and propagate across every node in +an OAP cluster within a bounded window. This page explains the contract: what the +feature guarantees, how the cluster stays consistent, and what to expect when +something goes wrong. The HTTP surface is documented separately in +[Runtime Rule Hot-Update API](../setup/backend/backend-runtime-rule-api.md). + +## Vocabulary + +- **Runtime rule entry** — the unit of operator state: one entry per `(catalog, name)`, + carrying the full rule-file content plus a status of `ACTIVE` or `INACTIVE`. Entries + live in management storage (the same persistence layer used for UI templates, UI + menus, and other cluster-wide operator state). +- **Catalog** — the rule group named in the API, currently `otel-rules`, `log-mal-rules`, + `telegraf-rules`, or `lal`. It mirrors the on-disk directory layout so a rule's + `(catalog, name)` identity is portable between disk and the runtime-rule entry store. +- **Main** — the single OAP node designated to run the on-demand workflow. Every node + can compute it locally from the sorted cluster peer list; no election. +- **Peer** — every node other than the main. +- **Periodic scan** — every OAP node re-reads management storage every 30 s (default, + configurable) and brings its in-memory state into line with what storage holds. This + is the convergence loop the consistency contract is built on. + +## Scope: MAL and LAL, not OAL + +Runtime hot-update covers only the **MAL** (`otel-rules`, `log-mal-rules`, +`telegraf-rules`) and **LAL** (`lal`) catalogs. OAL rules are deliberately out of +scope. Three reasons, in order of weight: + +1. **OAL targets SkyWalking-native traffic sources; MAL and LAL target third-party + data.** OAL rules derive metrics from the fixed set of sources the platform + already knows how to collect — distributed traces, service / instance / endpoint + traffic, Istio ALS records, native-agent telemetry. The source catalog doesn't + change between deployments; OAL expresses the metrics SkyWalking itself exposes + over that catalog. MAL and LAL are where third-party data lands: Prometheus + scrapes, OpenTelemetry meters, Telegraf / Zabbix / SNMP pollers, log-to-metric + extraction, custom receivers. New integrations, label cleanups, filter + adjustments — the edits operators make most often — all live in MAL or LAL + rule files. +2. **Operators iterate on MAL and LAL far more often than on OAL.** A new + Prometheus target comes online, a log format changes upstream, a filter needs + tightening to exclude a noisy source. These are production-frequency changes + that a hot-update path removes the restart tax from. OAL edits, when they + happen, are usually one-time decisions about which built-in metrics an + installation exposes — decisions that fit the deployment cycle. +3. **OAL is a deeper integration; MAL and LAL are contained extension points.** + OAL lives inside the analytical pipeline at the heart of SkyWalking. MAL and + LAL already sit at a known extension boundary, so adding "change this + configuration without restart" is a local concern. Touching the core to give + OAL the same capability is a much larger effort and was deferred. + +Operators who need to change OAL behavior still restart OAP, the way they did before +this feature. Everything below this section is scoped to MAL and LAL. + +## The consistency contract + +This is the headline. Everything else in this document derives from it. + +> **Persist is commit. Every node converges to the persisted state on its next +> periodic scan. Healthy structural commits land cluster-wide within 30 s; aborted +> commits self-heal within 60 s. No quorum, no leader election, no two-phase +> protocol.** + +Concretely: + +- The runtime-rule entry in management storage is the **single source of truth**. + Once `POST /addOrUpdate` returns 200, the entry is durable and every node in + the cluster will eventually run that exact content. +- **Last write wins.** If two operators push the same rule to different nodes, + whichever write hits storage second wins; the cluster converges on the next scan. +- **Local in-memory state is provisional.** A node can lag the entry briefly (during + a periodic scan, during a structural apply, immediately after rejoining the + cluster), but never indefinitely. Convergence bounds: + +| Event | Convergence bound | +|---------------------------------------------------------------|------------------------------------------------------------------------------| +| Healthy structural commit | ≤ 30 s for every peer (one scan). | +| Main aborts mid-structural (after pause broadcast, before persist) | ≤ 60 s (peer self-heal). | +| Main crashes mid-structural | ≤ 60 s. | +| Pause broadcast dropped to one peer | ≤ 30 s (the peer notices on the next scan). | +| Peer partitioned during apply, rejoins later | ≤ 30 s after rejoin. | +| Two operators applying the same file to different nodes | Last write wins; cluster converges within 30 s of the second commit. | +| Management storage unavailable | In-memory state held stable; resumes within ≤ 30 s of storage return. | + +Operators reading `/runtime/rule/list` see two timestamps that make convergence +observable: the persisted `updateTime` (storage) and the per-node `localState` +(this OAP's in-memory view). When they agree the node is converged; when they +disagree a periodic scan is in flight or the node is mid-apply. + +## Three workflows + +The feature is three cooperating workflows: + +1. **Boot** — OAP starts; static rule files on disk are loaded, with persisted + runtime-rule entries substituted in or skipped at load time. Backend schema is + read-only at boot — never reshaped, never dropped. +2. **On-demand** — an operator calls `POST /addOrUpdate`, `/inactivate`, or + `/delete`. This is the **only** workflow that may change backend schema, because + the operator explicitly asked for it. +3. **Periodic scan** — every OAP node re-reads every runtime-rule entry every 30 s + and converges its local state to match. This is what closes the convergence + bounds above; nothing else does. + +``` + ┌─────────────────────────┐ + │ management storage │ ← runtime-rule entries + │ (one entry per file) │ (ACTIVE / INACTIVE) + └─────────────────────────┘ + ▲ ▲ + │ write │ read + │ │ + ┌─── boot ───┐ ┌── on-demand ──┐ ┌── periodic scan ─┐ + │ static │ │ admin HTTP │ │ every 30 s on │ + │ files │ │ add / update │ │ every node │ + │ on disk │ │ inactivate │ │ │ + │ │ │ │ delete │ │ diff entries vs │ + │ ▼ │ │ │ │ │ in-memory state, │ + │ runtime │ │ ▼ │ │ apply each delta │ + │ entry │ │ pause peers │ │ │ + │ overrides │ │ → storage │ │ main: storage │ + │ static or │ │ check │ │ peer: local │ + │ skips it │ │ → persist → │ │ state only │ + │ │ │ resume │ │ │ + └────────────┘ └───────────────┘ └──────────────────┘ +``` + +## Boot workflow + +At boot each analyzer module loads its static rule files from disk. The runtime-rule +plugin intercepts each file before compilation and decides per-`(catalog, name)`: + +- No persisted entry → compile the disk file as-is. +- Persisted `ACTIVE` entry → compile the entry's content in place of the disk file. +- Persisted `INACTIVE` entry → skip the file; the operator has tombstoned the rule. + +Compilation registers each rule under **create-if-absent** semantics: missing backend +resources are created; resources that already exist with a different shape are +**skipped with an ERROR log**, the affected metric is disabled until the operator +reconciles via `/addOrUpdate`. Boot is **never** allowed to silently reshape the +backend — that would mask edits made while OAP was offline. + +After every analyzer has loaded, the runtime-rule plugin runs **one synchronous +scan** so any runtime-only entries (no static disk file) are applied before +receivers open ingress. From that point on, the periodic-scan workflow takes over. + +### Why boot cannot reshape the backend + +If boot were allowed to update backend storage, a shape mismatch could silently +rewrite the BanyanDB measure or the Elasticsearch mapping. Before this feature, the +backends behaved inconsistently: + +- BanyanDB and JDBC silently accepted the mismatch — samples were written against + the old schema and quietly truncated or rejected later. +- Elasticsearch hard-failed boot on a strict-mapping type conflict. + +Create-if-absent unifies the contract: schema mismatches are always surfaced as an +explicit ERROR, boot always continues, the affected metric is disabled, and the +operator reconciles explicitly through the on-demand workflow. The same shape is +visible across every backend. + +## On-demand workflow + +Triggered by an HTTP call to one of the admin endpoints. A request arriving at any +node is forwarded to the **main** (the node selected from the cluster view); the main +runs the workflow and the receiving node relays the response to the operator. + +Two paths, picked from the diff between the new content and the current entry: + +- **Filter-only path** — body, filter, and tag tweaks that preserve every metric's + storage identity. The main applies the change locally, persists the row, and + returns. Peers pick up the new content on their next periodic scan and apply + the same fast path. No cluster pause, no backend schema change, no alarm reset. +- **Structural path** — anything that moves metric identity (metric set added or + removed, scope or downsampling function changed, LAL `(layer, ruleName)` set + changed). The main runs: + 1. **Pause the cluster** — broadcast a pause to every peer over the cluster bus. + Peers stop dispatching samples for the affected metrics and drain in-flight + batches. Unreachable peers are logged and skipped; they self-recover via the + periodic scan. + 2. **Update backend storage on this node**, including the schema-visibility fence on + BanyanDB (see below). + 3. **Persist the entry** — this is the cluster-wide commit point. + 4. **Resume the cluster** — broadcast a resume so peers re-open dispatch. Peers + that missed the resume self-heal within 60 s. + 5. **Reset alarm windows** for any metric whose identity changed, so accumulated + state doesn't carry across the change. + +If any step before persist fails, the entry is **not** advanced, the local node +rolls back to the previous rule state, peers self-heal back to the old content within +60 s, and the operator gets `HTTP 500` with `applyStatus` indicating the failure. + +If persist itself fails, the same rollback happens — the durable state never moved, +so neither does the cluster. + +If persist succeeds but the local finishing step fails (a rare path), the operator +gets `HTTP 500 commit_deferred`: storage holds the new content (peers will converge +on it), but this node hasn't fully applied it yet and will retry on its next scan. + +### Lifecycle + +A rule moves through three observable states: + +``` + [absent] ──/addOrUpdate──► ACTIVE ──/inactivate──► INACTIVE ──/delete──► [absent] + ▲ │ + └─────/addOrUpdate────────┘ (reactivate) +``` + +The three endpoints split deactivation, destruction, and cleanup so an operator +never destroys data they might want back: + +- **`/addOrUpdate`** is the only path that *enters* `ACTIVE`. It handles "new rule" + and "reactivate" the same way — a post against an `INACTIVE` entry runs the full + structural pipeline so backend schema and dispatch handlers are re-created from + the posted bytes. Posting the same content against an `INACTIVE` row counts as a + reactivation, not a no-op, because the *status* is what matters. +- **`/inactivate`** is the **soft-pause** path. The OAP-internal state for the rule + is torn down (dispatch handlers removed, compiled rule dropped, alarm windows + reset), but the **backend measure and its data are explicitly preserved**. + Re-activation via `/addOrUpdate` reuses the existing measure; the cost is a + recompile, not a backfill or a metric-identity change. +- **`/delete`** is the **destructive** endpoint — the **only** one that drops + data. It refuses to operate on an `ACTIVE` row (returns `HTTP 409 + requires_inactivate_first`), so destruction always goes through the explicit + two-step `/inactivate → /delete` workflow. On an `INACTIVE` row it drops the + backend measure and removes the entry; on an absent row it is an idempotent + `200 not_found`. + +If a static version of the rule exists on disk, `/delete` of the runtime entry +causes the rule to revert to the static content on the next periodic scan. This is +the intended recovery path for "undo all operator state, go back to what ships in +the OAP distribution." + +### Inactive rules still hold their names + +`/inactivate` clears the runtime rule from memory but the entry is preserved with status +`INACTIVE`. The cross-file ownership guards (MAL metric names; LAL `(layer, +ruleName)` keys) treat that entry as **still owning its names**: a new file +claiming any of them is rejected with `held by inactive `. The operator's +recourse is to update the inactive rule (re-`/addOrUpdate`) or `/delete` it before +reusing its names elsewhere. This keeps `/inactivate` reversible without ever +risking name collisions or accidental backend-data loss across rules that share a +metric name. + +## Periodic scan + +Every node runs the periodic scan independently, every 30 s by default +(configurable). The scan: + +1. Reads every runtime-rule entry from management storage. +2. Diffs the entries against the in-memory state and classifies each difference. +3. Applies each delta. The main may update backend storage; peers update only their + local in-memory state. Peers never write schema changes to the backend. +4. **Self-heals** any rule that has been paused by a peer for more than the + self-heal threshold (60 s default) and whose underlying entry hasn't moved — + this is the recovery path for a main that crashed mid-apply. +5. **Catches up** any rule whose paused state was missed because the cluster pause + broadcast didn't arrive (RPC drop, partition). The peer re-applies the new + content from the entry without waiting for a fresh pause. + +The periodic scan is the only mechanism that closes the consistency bounds. Cluster +pause broadcasts and the live application path are optimisations on top — they make +healthy commits visible faster than 30 s — but the periodic scan is what guarantees +convergence will happen at all, even after pause RPCs are dropped, peers crash, or +the cluster topology flaps. + +## Cluster model + +- **Coordinator-agnostic.** Runs on any OAP cluster coordinator (Zookeeper / + Kubernetes / Standalone / Etcd / Nacos) without adding a coordinator of its own. +- **Single main.** The lexicographically-first node in the sorted peer list is + the main; every node computes this locally with no negotiation. Main changes + only when the first node joins or leaves the cluster, which is rare. +- **Forwarding.** Calls to non-main nodes are forwarded to the main over the + cluster bus; the operator gets the main's response transparently. A narrow + fail-safe returns `HTTP 421 cluster_view_split` if a forwarded request arrives + at a node whose own view also says it is not the main — this signals a + transient peer-list disagreement, not a data problem. +- **Pause / resume broadcasts** are **best-effort**. They make the cluster + converge in seconds rather than the 30 s scan window, but the system is correct + even when every broadcast is lost — the periodic scan still converges within + bounds. +- **No two-phase commit.** The on-demand workflow takes a single backend write + (the entry upsert) as the cluster-wide commit point. Everything before it is + reversible; everything after it eventually appears on every node. + +### Concurrent same-file writes to different nodes + +Under stable topology only one node is the main, so concurrent writes serialize +on the main's per-file lock — second write wins, both operators get an honest 200, +the cluster converges within one scan. + +Under a brief topology flip where two nodes both believe they are the main, both +start their workflow locally and one detects the conflict via the pause broadcast +(the other side is already paused by `SELF`, not `PEER`). The detecting main +returns `HTTP 409 split_brain_detected` to its operator, broadcasts a resume, and +aborts; the surviving workflow runs to completion. **Even if detection misses the +window** and both operators get 200, the periodic scan resolves it: every node +re-reads the entry on the next scan and converges to whichever write reached +storage second. The 409 is an operator-feedback optimisation, not a correctness +gate — without it the system still converges. + +## Schema-visibility fence (BanyanDB) + +BanyanDB's distributed mode propagates registry writes from the meta-server to +every data node asynchronously. A naive flow — register the schema, immediately +resume dispatch — has a race: the registry holds the new measure but a data node +may not yet have caught up, so the first sample after the apply lands on an +unprepared node. + +For runtime hot-updates this would mean the operator's `200 OK` could come back +before the cluster's data boundary actually moved. The runtime-rule install path +narrows the gap on a best-effort basis: every BanyanDB schema write returns an +etcd `mod_revision`, and the installer waits — synchronously, before resuming +dispatch, up to a bounded timeout (default 2s) — for every BanyanDB data node +to catch up to the highest revision the apply produced. + +The visible contract for operators is: + +- Between operator request and `200 OK`, all sample dispatch for the affected + metric is paused on every node. In-flight samples are dropped (this is by + design: a structural change means the schema is moving and in-flight data has + no valid landing). +- When all data nodes confirm within the bounded window, the `200 OK` marks the + moment the cluster's data boundary moves: samples written at or after the `200` + use the new shape; samples written before use the old shape. +- When one or more nodes haven't applied within the window, OAP logs a warning + naming the laggards and resumes dispatch anyway. The schema is already + authoritative in etcd, so late nodes apply it asynchronously through their + watcher — until they do, samples landing on those specific nodes for that + metric may be rejected by the local data node briefly. This trades strict + cluster-wide cutover for not wedging an apply behind a single slow node; + operators who need strict behavior should fix the slow node, not loosen the + timeout. + +Elasticsearch and JDBC don't have multi-node schema fan-out; their storage change is +visible when the call returns, so the fence is a no-op for those backends. + +## Failure handling — what operators see + +The feature is designed so failures are visible without tailing logs and so the +recovery path is the same path operators already use. + +- **Rule parse / compile error** — `HTTP 400 compile_failed` with the parser + message. The entry was not persisted; this node and the cluster keep serving + the prior rule for every metric. +- **Storage shape conflict the operator didn't acknowledge** — `HTTP 409 + storage_change_requires_explicit_approval`. No pause broadcast, no persist, no + side effects. Re-push with `?allowStorageChange=true` if the change is + intentional. +- **Backend storage verification failed mid-apply** — `HTTP 500 ddl_verify_failed`. + Newly added metrics are rolled back so the backend doesn't accumulate orphans; the prior + rule keeps serving every metric that wasn't being added or reshaped. + `lastApplyError` on `/runtime/rule/list` carries the failure message. +- **Persist failed** — `HTTP 500 persist_failed`. Local state is rolled back to + the pre-apply rule; peers self-heal within 60 s. The cluster never advanced + past the failure. +- **Persist succeeded but the local finishing step failed** — `HTTP 500 commit_deferred`. + Storage is authoritative (peers will converge), but this node will retry on + its next periodic scan. +- **Cluster routing fail-safe** — `HTTP 421 cluster_view_split` when a forwarded + request reaches a node that also doesn't believe it's the main. Wait for the + peer-list to settle (seconds) and retry. + +`GET /runtime/rule/list` is the canonical operator view of cluster state: persisted +status, per-node `localState`, and `lastApplyError` for any rule whose most recent +apply failed. There is no separate alert channel — `/list` plus the OAP log are +the entire diagnostic surface. + +## What this feature does not do + +- **OAL hot-update** is out of scope (see "Scope" above). +- **Authentication** is not built in. The admin endpoint is disabled by default; + when enabled it must be gateway-protected. See the + [API doc](../setup/backend/backend-runtime-rule-api.md) for setup guidance. +- **Bulk import.** `/dump` produces a tar.gz for backup, but restore is "extract + one file, POST it to `/addOrUpdate`". There is no single-call cluster import. +- **Rule rollback.** Storage is last-write-wins; there is no automatic + "previous version" history. Operators who need rollback should keep their + rule YAMLs in version control and re-push the desired version through + `/addOrUpdate`. +- **Across OAP-version clusters.** Different OAP binaries ship different static + rule content; the runtime entries override consistently, but unoverridden static + rules diverge along the version split. Use deployment discipline. diff --git a/docs/en/security/README.md b/docs/en/security/README.md index f2c91994b2cb..1dfb7ace34d1 100644 --- a/docs/en/security/README.md +++ b/docs/en/security/README.md @@ -24,6 +24,37 @@ Remote Code Execution (RCE) issues. For some sensitive environment, consider to limit the telemetry report frequency in case of DoS/DDoS for exposed OAP and UI services. +## Runtime Rule Admin Surface (port 17128) + +The `skywalking-runtime-rule-receiver-plugin` exposes an HTTP admin API on port 17128 that +lets operators **add, override, inactivate, and delete MAL/LAL rule files at runtime** without +restarting OAP. Rules are compiled and loaded into the OAP JVM on the fly. This surface is +**far more powerful than the telemetry receiver ports** — a request can register new Javassist- +compiled bytecode, mutate `MeterSystem` state, and drop backend schema (BanyanDB measures). + +The module is **disabled by default**. Enabling it (via `SW_RECEIVER_RUNTIME_RULE=default` or +the YAML selector) opens port 17128 with **no authentication**. This is intentional for now — +the design goal is a simple admin socket that a gateway / service mesh wraps with the +operator's existing auth story. + +Required operator actions when enabling: + +1. **Never expose port 17128 to the public internet.** Bind to a private network interface or + `localhost` and reach it through an operator-controlled gateway. +2. **Gateway-protect with IP allow-list + authentication.** Only the operator team should be + able to reach the endpoint. +3. **Audit every request.** Rule content is arbitrary YAML that compiles into the OAP JVM — + a malicious rule could exfiltrate data, spike resource use, or create metric-name + collisions. Treat `POST /runtime/rule/*` as equivalent to shell access on the OAP host. +4. **Keep the port off the cluster-external interface even in cluster mode.** The cluster- + internal Suspend RPC is registered on the OAP cluster-bus gRPC server (shared with + RemoteService / HealthCheck) — that is a separate transport from 17128 and follows the + same security posture as the rest of the cluster bus. + +Without these protections an attacker with network reach to port 17128 can execute arbitrary +code inside the OAP JVM. See `docs/en/setup/backend/backend-runtime-rule-api.md` for the full +API surface. + ## Client-Side Monitoring Client-side applications — iOS/iPadOS apps (via OpenTelemetry Swift SDK), browser web apps diff --git a/docs/en/setup/backend/backend-health-check.md b/docs/en/setup/backend/backend-health-check.md index c717851ca16d..e98d6b6d7913 100644 --- a/docs/en/setup/backend/backend-health-check.md +++ b/docs/en/setup/backend/backend-health-check.md @@ -67,4 +67,31 @@ You may use the [grpc-health-probe](https://github.com/grpc-ecosystem/grpc-healt health of OAP gRPC services. ## CLI tool -Please follow the [CLI doc](https://github.com/apache/skywalking-cli#checkhealth) to get the health status score directly through the `checkhealth` command. + +The `swctl` CLI ships a `health` subcommand that runs the GraphQL `checkHealth` +query (and, by default, the gRPC `HealthCheck` service) and exits with a +non-zero status when the OAP is unhealthy. + +```bash +# Plain gRPC +swctl --base-url=http://OAP:12800/graphql health + +# OAP gRPC with TLS (cert verification is intentionally skipped) +swctl --base-url=http://OAP:12800/graphql health --grpcTLS=true +``` + +### Reading the response + +A healthy OAP returns the same `score: 0` envelope shown in the GraphQL +section above and the process exits 0. A failing run prints the GraphQL / +gRPC error and exits non-zero — straightforward to wire into a shell readiness +loop: + +```bash +if swctl --base-url=http://OAP:12800/graphql health >/dev/null 2>&1; then + echo "OAP healthy" +else + echo "OAP not healthy" + exit 1 +fi +``` diff --git a/docs/en/setup/backend/backend-runtime-rule-api.md b/docs/en/setup/backend/backend-runtime-rule-api.md new file mode 100644 index 000000000000..40243f1699e6 --- /dev/null +++ b/docs/en/setup/backend/backend-runtime-rule-api.md @@ -0,0 +1,454 @@ +# Runtime Rule Hot-Update API + +The runtime rule receiver plugin lets operators add, override, inactivate, and delete +MAL and LAL rule files at runtime without restarting OAP. Changes are saved in the +configured storage backend (JDBC, Elasticsearch, or BanyanDB) and propagate across +every node in an OAP cluster within 30 s by default. + +> For the consistency contract, the three workflows (boot / on-demand / periodic scan), +> the lifecycle, and how cluster failures are handled, see +> [Runtime Rule Hot-Update — Architecture](../../concepts-and-designs/runtime-rule-hot-update.md). +> This page focuses on the REST API surface. + +## ⚠️ Security notice + +The admin port has **no authentication** in this release. The module is therefore +**disabled by default**; enabling it opens an HTTP endpoint that can change metric and +log-processing rules while OAP is running. + +Operators enabling the module MUST: + +1. Gateway-protect the port with an IP allow-list and separate authentication rules. +2. Never expose port 17128 to the public internet. +3. Bind the HTTP server to `localhost` or a private network interface if remote access is + not required. + +## Enabling the module + +Set the selector to `default` in `application.yml` or via env var: + +```bash +SW_RECEIVER_RUNTIME_RULE=default +``` + +Default port is `17128`. All config knobs are in `application.yml` under the +`receiver-runtime-rule` block — host, port, periodic-scan interval, self-heal threshold. + +## HTTP surface + +`/addOrUpdate` takes the rule file as the raw request body and identifies the rule with +the `catalog` and `name` query parameters. `/inactivate` and `/delete` use the same +parameters with an empty body. There is no JSON request envelope, so shell scripts can +send a YAML file directly with `--data-binary @file.yaml`. + +### Content encoding + +Rule content is **UTF-8 YAML text**. The API never base64-encodes content. + +- Raw responses (`Content-Type: application/x-yaml; charset=utf-8`) and raw + `/addOrUpdate` request bodies carry content byte-identically. +- JSON responses encode content as a **standard JSON string** — special characters are + JSON-escaped (`"`, `\`, control characters become `\u00XX`, newlines become `\n`). A + standard JSON parser yields the original UTF-8 YAML; no additional decode step. + +Example JSON response from `GET /runtime/rule?catalog=otel-rules&name=vm` with +`Accept: application/json`: + +```json +{ + "catalog": "otel-rules", + "name": "vm", + "status": "ACTIVE", + "source": "runtime", + "contentHash": "5f9b8d3e...", + "updateTime": 1714102400000, + "content": "expSuffix: instance(['host_name','host_ip'], Layer.OS_LINUX)\nmetricPrefix: meter_vm\nmetricsRules:\n - name: cpu_total_percentage\n exp: avg_node_cpu_utilization\n" +} +``` + +The `\n` in the `content` string is the JSON escape for newline. After `JSON.parse()` +the value is the YAML body that `/addOrUpdate` would accept verbatim. + +`contentHash` is the SHA-256 of the UTF-8 content bytes (lowercase hex). It is identical +across raw and JSON modes; the JSON envelope's escaping does not affect the hash. The +same hash appears on `GET /runtime/rule/list`, where `localState` shows whether this node +has already applied that stored version. + +`/addOrUpdate` decodes the request body as UTF-8 regardless of the `Content-Type` header. +Send valid UTF-8 YAML. If the decoded content cannot be parsed or compiled as a rule, the +server returns `400 compile_failed`. + +### Canonical routes + +**Write endpoints** + +| Method | Path | Body | Effect | +|--------|------------------------------------------------------------------------------------|---------------|--------| +| POST | `/runtime/rule/addOrUpdate?catalog=&name=[&allowStorageChange=true][&force=true]` | raw rule YAML | Creates or replaces a rule. Edits that keep the same metric storage shape are applied without pausing the cluster. Edits that add, remove, or reshape metrics pause affected traffic, update and verify backend storage, save the rule, and then resume. If the posted content exactly matches the current `ACTIVE` rule, the server returns `no_change`; `force=true` skips that shortcut for recovery. | +| POST | `/runtime/rule/inactivate?catalog=&name=` | empty | Soft-pauses a rule. OAP stops using the rule and saves it as `INACTIVE`, while the backend measure and historical data remain available for reactivation. | +| POST | `/runtime/rule/delete?catalog=&name=[&mode=revertToBundled]` | empty | Removes an `INACTIVE` runtime row. Active rules return `409 requires_inactivate_first`. **No bundled twin on disk** → destructive: backend resource is dropped and the rule is fully gone. **Bundled twin on disk** → non-destructive: backend is preserved (bundled will reuse it), the row is removed, and the bundled rule is reinstalled into a `static:` loader on the local node. Peers converge via the gone-keys reconcile path on their next tick. `?mode=revertToBundled` is an explicit operator hint that requires a bundled twin (returns `400 no_bundled_twin` when none exists) — useful for scripts that want to fail loudly if their assumption was wrong. The OAP-side teardown (cluster-wide unparking, dispatcher / worker / catalog / model removal, stored-rule removal) is uniform; the **storage-side** effect is per-backend (see below). | + +**Read endpoints** + +| Method | Path | Effect | +|--------|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------| +| GET | `/runtime/rule?catalog=&name=` | Fetches one rule. Runtime rule first, bundled rule second, otherwise 404. Raw YAML by default; JSON envelope on `Accept: application/json`. Supports `ETag` and `If-None-Match`. | +| GET | `/runtime/rule/bundled?catalog=&withContent=false` | Returns bundled rules for one catalog as JSON. `withContent` defaults to true; `false` omits each YAML body. Each item includes whether an operator override exists. | +| GET | `/runtime/rule/list[?catalog=]` | Returns a single JSON envelope `{generatedAt, loaderStats, rules}` merged from stored rules and this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Optional `catalog=` narrows the output; unknown values return `400 invalid_catalog`. | +| GET | `/runtime/rule/dump[/]` | Downloads a tar.gz of stored runtime rules plus `manifest.yaml`. The server has no bulk import endpoint; the CLI restore command replays individual `addOrUpdate` and `inactivate` calls. | + +### `/delete` storage semantics — per backend + +`/delete` always tears down the rule on the OAP side: the cluster unparks the affected +dispatchers, removes the workers, drops the model from the in-memory registry, removes +the stored rule, and the rule no longer appears in `/runtime/rule/list`. What happens to +the **on-disk data** depends on the storage plugin: + +| Backend | After `/delete` | Old data still queryable? | +|---|---|---| +| **BanyanDB** | The measure / stream group + schema are dropped (`dropMeasure` / `dropStream`). | No — rows are gone. | +| **Elasticsearch** | `dropTable` is a documented **no-op**. The merging index (e.g. `metrics-all`) and any per-metric index stay. | Yes — historical samples remain in place until TTL expires. | +| **JDBC (H2 / MySQL / PostgreSQL / TiDB / OceanBase)** | `dropTable` is a documented **no-op**. The merging table (e.g. `meter_sum_`) stays. | Yes — historical samples remain in place until TTL expires. | + +The ES / JDBC behaviour is intentional and consistent with how the static catalog treats +table lifecycle on those backends: tables are append-only, and TTL — not DDL — reclaims +space. If you need the data gone immediately, drop the table out-of-band with the storage +backend's own tools after `/delete` returns. + +A re-`addOrUpdate` of the same rule (same name, same scope and downsampling) replays +schema registration. On BanyanDB this re-creates the measure; on ES / JDBC this is a +no-op against the existing index / table. In both cases new samples land alongside any +retained history. + +### Catalog shortcut routes + +Implicit catalog in the path — useful when scripting against a single catalog: + +- `/runtime/mal/otel/{addOrUpdate,inactivate,delete}` → `catalog=otel-rules` +- `/runtime/mal/log/{addOrUpdate,inactivate,delete}` → `catalog=log-mal-rules` +- `/runtime/lal/{addOrUpdate,inactivate,delete}` → `catalog=lal` + +`telegraf-rules` is supported by the canonical `/runtime/rule/...` routes; it does not +currently have a shortcut route. + +### Valid catalogs + names + +| Catalog | What it holds | +|---|---| +| `otel-rules` | OTEL MAL rule YAML files | +| `log-mal-rules` | Log-derived MAL rule YAML files | +| `telegraf-rules` | Telegraf MAL rule YAML files | +| `lal` | LAL rule YAML files | + +Rule `name` mirrors the static filesystem layout — a relative path under the catalog root +without extension. Segments match `[A-Za-z0-9._-]+`, joined by `/`. No leading slash, +no `..`, no empty segments, no backslash. Examples: `nginx`, `aws-gateway/gateway-service`, +`k8s/node`. + +### `allowStorageChange` parameter + +`/addOrUpdate` (and the three catalog shortcut variants) accept an optional +`allowStorageChange` query parameter. Default is **false** when absent. + +The server rejects any update that would move storage identity unless this flag is set: + +- **MAL**: scope type change (`service(...)` → `instance(...)`), explicit downsampling + function change (`.downsampling(SUM)` → `.downsampling(MAX)`), switching between single / + labeled / histogram variants. +- **LAL**: changing `outputType` on any rule, adding or removing a rule key within a file. + +These are the edits that drop an existing BanyanDB measure's data, change how new samples +are stored, or leave old rows behind on JDBC / Elasticsearch. Body, filter, and tag tweaks +that preserve each metric's storage identity are always accepted and do not reset alarm +windows. + +Accepted truthy forms (case-insensitive): `true`, `1`, `yes`. Anything else is treated as +false. + +> **Recommendation — avoid storage-wipe edits in production.** Passing +> `allowStorageChange=true` drops the existing measure's data on BanyanDB and orphans the +> old rows on JDBC / Elasticsearch; any alarm rules, dashboards, and historical queries +> that reference the old shape will miss the pre-change window. Unless the data loss is +> understood and intended — typically only on a staging cluster or during a planned +> schema migration — leave the flag off. Prefer a rename (new metric name, new rule +> name) so the old data keeps accumulating until TTL and the new data starts fresh under +> a clean identity. Treat the flag as an explicit "I accept data loss" affirmation, not a +> convenience toggle. + +### Recovery from a failed apply + +When an `/addOrUpdate` fails during validation or apply, the node does **not** lose the +previous rule version. The pre-change rule keeps serving every metric that was not being +changed, and the response includes an `applyStatus` explaining the failure. + +**What to expect during a failure:** + +- The node keeps serving the prior rule for **unchanged** metrics. Samples continue + flowing to the existing measures; dashboards and alarm rules against those metrics keep + working. +- Metrics that were **newly added** by the failed attempt are rolled back (no orphan + measures left on BanyanDB). +- Metrics in the **storage-changing** set — where the rule changed a metric's function or scope + and was allowed through with `allowStorageChange=true` — are lost. The old measure was + removed before the new one attempted registration; a mid-flight failure leaves neither. + This is the documented cost of `allowStorageChange`. +- `/runtime/rule/list` reports the rule's `lastApplyError` so the failure is visible + without tailing logs. +- Severe backend or apply failures also write an `ERROR` log line naming the catalog, rule, + and reason. +- Peers either self-heal back to running on the old content (if the row was never + committed) or retry the same broken content on their next periodic scan and fail the same + way (if the row did advance). Either way they never serve samples against a moved schema + while the main's apply was in flight. + +**Recovery flow:** + +1. **Inspect.** `curl /runtime/rule/list | jq 'select(.lastApplyError != null)'`. Confirm + which rule is degraded and read the error message. +2. **Diagnose.** Check the OAP log when the list output is not enough. Typical causes: + - Rule syntax or parse error — fix the YAML and re-push via `/addOrUpdate`. + - Storage schema moved without the guardrail — re-push with + `?allowStorageChange=true&force=true` (see below), or rename the metric so the old + measure keeps accumulating until TTL and new data flows under a new identity. + - Backend unavailable during a schema update — retry once the backend is healthy; the + next periodic scan will also retry without operator action. +3. **Apply the fix.** Two options: + + **Option A: re-push via `/runtime/rule/addOrUpdate` with the recovery flags.** + + ```bash + curl -X POST --data-binary @vm-previous-known-good.yaml \ + "http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&allowStorageChange=true&force=true" + ``` + + Two flags layer on top of the regular addOrUpdate: + - `allowStorageChange=true` — accepts shape-breaking edits the guardrail would otherwise + reject with 409. + - `force=true` — bypasses the same-content `no_change` HTTP shortcut so a re-post of + known-good bytes is treated as a fresh apply request. The persisted row (if any) is + re-written and any peers stuck mid-Suspend are re-Resumed; **schema and dispatch + handlers are not rebuilt** — the in-memory engine state is content-keyed, so a true + no-op against a healthy node remains a no-op even with `force=true`. This is the + unstick path for "the prior push raced a transient backend failure and a peer is + still suspended"; if you need the engine to recompile (e.g., after manually editing + a backend table), re-post the content with a single character change first, then the + real content. + Combine both when the recovery target re-shapes the measure. Same failure modes as a + normal `/addOrUpdate` — bad rule content still fails `400 compile_failed` and the prior rule + keeps serving. + + **Option B: Manual restore from a prior `/dump` tarball.** If you have a dump taken + before the broken push, extract the specific file and re-post it with the recovery + flags: + + ```bash + # assuming runtime-rule-dump-2026-04-22.tar.gz was taken before the broken push. + # Archive entries are under runtime-rule-dump//.yaml. + tar -xzf runtime-rule-dump-2026-04-22.tar.gz runtime-rule-dump/otel-rules/vm.yaml + curl -X POST --data-binary @runtime-rule-dump/otel-rules/vm.yaml \ + "http://OAP:17128/runtime/rule/addOrUpdate?catalog=otel-rules&name=vm&allowStorageChange=true&force=true" + ``` + +4. **Verify.** Re-run `list` and confirm `lastApplyError` is cleared and `localState` is + `RUNNING`. Watch the OAP log for the apply-OK confirmation. +5. **(Best practice)** Take a fresh `/runtime/rule/dump` immediately after a successful + recovery so the new baseline is captured for any future incident. + +**What the recovery flags do NOT do:** + +- They do not roll the rule content back to a previous version automatically. Runtime-rule + storage is last-write-wins; `/addOrUpdate` (with or without `force=true`) is a write + path, not a rollback path. The operator supplies the content to restore. +- They do not bypass rule compile errors. If the content is syntactically invalid, the node + returns `400 compile_failed` whether or not `force=true` is set. The flags accept + storage-level changes the guardrail would block and re-drive a stuck same-content + shortcut; they do not accept broken rule content. + +### Response codes + +Write endpoints return JSON: `{applyStatus, catalog, name, message}`. Read endpoints use +the response formats listed above; their error responses use the same JSON shape. + +**Success** + +| Status | `applyStatus` | Meaning | +|---------------|----------------------------|--------------------------------------------------------------------------------------------------------| +| 200 OK | `no_change` | content byte-identical to current row; nothing to do | +| 200 OK | `filter_only_applied` | body / filter edits applied via fast path; no backend storage change, no alarm reset | +| 200 OK | `structural_applied` | storage-changing edit applied: cluster pause, backend update and check, persist, cluster resume all succeeded | +| 200 OK | `inactivated` | row flipped to `INACTIVE`; backend measure and data preserved | +| 200 OK | `static_tombstoned` | `/inactivate` against a rule that exists only on disk; an `INACTIVE` tombstone row is now persisted | +| 200 OK | `already_inactive` | `/inactivate` against an already-inactive row; idempotent no-op | +| 200 OK | `deleted` | row hard-deleted; backend measure dropped (MAL) or in-process handlers removed (LAL) | +| 200 OK | `not_found` | `/inactivate` or `/delete` against an absent rule; idempotent no-op | +| 200 OK | `filter_only_persisted` | row persisted but the in-memory swap threw on this node; converges on the next periodic scan | + +**Client error — caller has to act** + +| Status | `applyStatus` | Meaning | +|-------------------|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------| +| 400 Bad Request | `compile_failed`, `empty_body`, `invalid_*` | rule parse failure or request validation failure; row was NOT persisted | +| 409 Conflict | `storage_change_requires_explicit_approval` | update would move storage identity and `allowStorageChange` was not set — no cluster pause, no persist, no side effects | +| 409 Conflict | `update_in_progress` | another apply is already in flight for this rule; retry after a few seconds | +| 409 Conflict | `requires_inactivate_first` | `/delete` against an `ACTIVE` row; run `/inactivate` first, then `/delete` | +| 503 Service Unavailable | `storage_unavailable` | storage could not be read while checking the current rule; retry when storage is healthy | + +**Cluster-routing errors — usually transient** + +| Status | `applyStatus` | Meaning | +|--------------------------------|--------------------------|--------------------------------------------------------------------------------------------------------------------------| +| 409 Conflict | `origin_conflict` | a peer rejected the cluster pause because it was already running its own apply (split-brain); the loser aborts and resumes | +| 409 Conflict | `split_brain_detected` | this node detected a competing main during the cluster pause; aborted and broadcast a resume | +| 421 Misdirected Request | `cluster_view_split` | the receiving node's peer-list disagreed with the sender's; refused to re-forward. Wait a few seconds for the peer-list to settle, then retry | +| 502 Bad Gateway | `forward_failed` | could not reach the cluster main to forward the request; transport error message in `message` | + +**Server error — apply or persist failed** + +| Status | `applyStatus` | Meaning | +|---------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------| +| 500 Internal Server Error | `ddl_verify_failed` | backend storage was changed but the post-apply check rejected the new shape; new metrics rolled back, prior rule preserved | +| 500 Internal Server Error | `apply_failed` | server failed while applying the rule; partial changes rolled back, prior rule preserved | +| 500 Internal Server Error | `persist_failed` | row write failed; on filter-only this node still serves the pre-edit rule, on structural the local node rolled back and resumed peers | +| 500 Internal Server Error | `commit_deferred` | apply succeeded and row was persisted, but the local finishing step failed on this node. Storage is authoritative and peers will converge; this node will retry on its next periodic scan | +| 500 Internal Server Error | `teardown_deferred` | row was inactivated, but local cleanup failed; this node retries on the next periodic scan | +| 500 Internal Server Error | `dao_unavailable`, `inactivate_failed`, `delete_backend_drop_failed`, `delete_failed`, other `*_failed` | management storage or backend cleanup failed; no destructive row removal is completed unless the backend cleanup succeeded | + +## Per-node list output + +`GET /runtime/rule/list` returns a single JSON envelope: + +```json +{ + "generatedAt": 1730000000000, + "loaderStats": { "active": 27, "pending": 0 }, + "rules": [ + { + "catalog": "otel-rules", + "name": "vm", + "status": "ACTIVE", + "localState": "RUNNING", + "suspendOrigin": "NONE", + "loaderGc": "LIVE", + "loaderKind": "RUNTIME", + "loaderName": "runtime-rule:otel-rules/vm@0428-153042", + "contentHash": "7c3a…", + "bundled": true, + "bundledContentHash": "c3d4…", + "updateTime": 1730000000000, + "lastApplyError": "" + } + ] +} +``` + +Bundled-only rows (no operator override) and recently deleted rows omit fields that +do not exist in storage, such as `updateTime` and `lastApplyError`. UI clients call +`fetch().json()` once; operators can `jq '.rules[]'` for line-oriented inspection. + +### Rule status by source (bundled vs. runtime) + +The combination of `status`, `loaderKind`, and `bundled` tells you which copy of a rule +the OAP is actually serving on this node. Reading these three fields together: + +| Operator action history | `status` | `loaderKind` | `bundled` | What is serving | +|---|---|---|---|---| +| Bundled rule shipped on disk; operator never touched it | `BUNDLED` | `NONE` | `true` | Bundled YAML, served from the OAP's shared default classloader (registered at boot by the catalog loaders). | +| Operator pushed `/addOrUpdate` overriding a bundled rule | `ACTIVE` | `RUNTIME` | `true` | Runtime override in a per-file `runtime-rule:` loader. Compare `contentHash` with `bundledContentHash` to detect drift. | +| Operator pushed `/addOrUpdate` for a brand-new rule (no bundled twin) | `ACTIVE` | `RUNTIME` | `false` | Runtime override in a per-file `runtime-rule:` loader. No bundled fallback. | +| Operator `/inactivate`d a runtime override of a bundled rule | `INACTIVE` | `NONE` | `true` | Nothing — handlers are unregistered. The bundled rule does **not** auto-resurrect; to turn it back on, push `/addOrUpdate` (with the bundled YAML or your own) or call `/delete` (which reverts to bundled). | +| Operator `/inactivate`d a bundled-only rule | `INACTIVE` | `NONE` | `true` | Nothing — same as above. The `INACTIVE` row is a tombstone carrying the bundled YAML at inactivate-time. | +| Operator `/inactivate`d a brand-new runtime rule | `INACTIVE` | `NONE` | `false` | Nothing — handlers gone. To turn back on: `/addOrUpdate` (with new content) or `/delete` (rule is fully gone). | +| `/delete` propagating after a bundled-twin row was removed | `n/a` (no row) | `STATIC` | `true` | Bundled rule, freshly compiled into a `static:` loader. Equivalent to a fresh boot of bundled. | + +Quick decision rules for an operator reading `/list`: + +- `status=BUNDLED` → comes from disk only. +- `status=ACTIVE` + `bundled=true` + `contentHash != bundledContentHash` → runtime override is *modified* relative to bundled. UIs typically render this as "Override (modified)". +- `status=ACTIVE` + `bundled=true` + `contentHash == bundledContentHash` → runtime override matches bundled. UIs typically render this as "Override (matches bundled)" — common after an explicit `/addOrUpdate ?source=bundled` revert. +- `status=ACTIVE` + `bundled=false` → runtime-only rule, no on-disk twin. +- `status=INACTIVE` → soft-paused. The DAO row preserves the content the operator last had; `/list` does not surface it (call `GET /runtime/rule` for the YAML). +- `loaderKind=STATIC` → a `static:` loader is currently serving (transient, between `/delete` and the next clean state). +- `loaderKind=NONE` → no per-file loader. For `BUNDLED` this is normal (shared default loader). For `INACTIVE` this is the rule being off. + +- `status` — `ACTIVE` or `INACTIVE` for stored rows. `BUNDLED` and `n/a` are synthesized + list values: + - `BUNDLED` — shipped on disk, no operator override. Healthy steady state; no runtime + row exists. + - `n/a` — transient: a runtime row was just removed and this node hasn't swept it yet. + Cleared on the next periodic scan. +- `localState` — per-node transient: `RUNNING` | `SUSPENDED` | `NOT_LOADED`. Distinct from + `status`; a node mid-structural-apply is `ACTIVE` + `SUSPENDED`. After `/inactivate`, + `localState` is `NOT_LOADED` regardless of whether a bundled twin exists on disk — + `/inactivate` is a soft-pause that respects the operator's "off" intent. Bundled + fall-over only fires on `/delete` (default mode for a bundled-twin row) or the gone-keys + reconcile path. +- `suspendOrigin` — when `localState=SUSPENDED`, who paused this node: + - `SELF` — this node is running its own apply. + - `PEER` — the cluster main paused this node for its apply. + - `BOTH` — should not appear under correct routing; presence signals a transient + split-brain that clears via the normal handshake or the 60 s self-heal. +- `loaderGc` — diagnostic indicator showing whether the per-rule isolation has been retired + for this rule and (if so) whether the JVM has reclaimed it. Operators normally don't + need to act on this; a value other than `LIVE` for an `ACTIVE` row would suggest a + rule cleanup issue worth investigating. +- `loaderKind` — origin of the per-file class loader currently serving this rule: + - `RUNTIME` — operator-pushed runtime override. + - `STATIC` — bundled rule serving via static fall-over (a runtime override was previously + in place, then removed; the bundled YAML was reloaded into a fresh `static:` loader). + - `NONE` — no per-file loader (typical for bundled-only rules served from the shared + default loader; also a row whose loader has been retired but not yet replaced). +- `loaderName` — formatted loader name (`:/@`), the same + string the JVM surfaces in stack traces and the loader graveyard's INFO/WARN log lines. + Empty when `loaderKind` is `NONE`. +- `contentHash` — SHA-256 of the stored content for runtime rows, or the local content for + bundled-only and recently deleted rows. Matching hashes plus `localState=RUNNING` mean two + nodes are serving the same content for that rule. +- `bundled` — `true` when a bundled YAML exists on disk for `(catalog, name)`. Set on every + row regardless of status, so a UI can render an "Override" / "Modified from bundled" + badge by comparing `contentHash` with `bundledContentHash`. +- `bundledContentHash` — SHA-256 of the bundled YAML, present only when `bundled=true`. + A diff between `contentHash` and `bundledContentHash` indicates a runtime override that + has drifted from the bundled rule. +- `lastApplyError` — most recent local apply error. Empty when the last apply succeeded, + no attempt has been made, or the rule was inactivated (the inactive path clears stale + errors so `/list` doesn't surface an error against a rule that is already down). +- `pendingUnregister` — only set for `status=n/a` entries; the row was just deleted and + teardown is scheduled for the next periodic scan. + +The `loaderStats` envelope counter exposes process-wide DSL classloader bookkeeping — +`active` is the number of rules currently served by per-file loaders, `pending` is the +number of retired loaders the JVM has not yet collected. A steadily elevated `pending` +across many polls is the leak signal the OAP also surfaces as a WARN log line. + +### Reading a single rule's content — `GET /runtime/rule` + +`GET /runtime/rule?catalog=…&name=…` returns the YAML body for one rule. By default +(`source=runtime` or omitted) the runtime row wins — bundled YAML is returned only when no +runtime row exists. Pass `?source=bundled` to read the bundled YAML even when a runtime +override is in place; the response 404s with `not_found` when the rule has no bundled twin. + +This makes the "compare runtime override against bundled" workflow a two-call sequence: +fetch the runtime body with the default request, then fetch the bundled body with +`?source=bundled` and diff in the editor. `POST /runtime/rule/delete` drops the runtime +override; the next `/list` will show the row served by the bundled fall-over +(`loaderKind=STATIC`). + +## Consistency model — at a glance + +The full contract is in the +[architecture doc](../../concepts-and-designs/runtime-rule-hot-update.md#the-consistency-contract). +The headline: + +- **Persist is commit.** Once `/addOrUpdate` returns 200, the cluster will converge on + that content. +- **Last write wins.** Concurrent writes to different nodes serialize on the cluster main; + the second write wins. The losing operator gets `409 split_brain_detected` if the cluster + detected the race; otherwise both operators see 200 and the second commit's content is + what every node ends up running. +- **Bounded convergence.** Healthy structural commits land cluster-wide within 30 s + (one periodic scan). Aborted commits self-heal within 60 s. Filter-only edits land + locally in milliseconds and on every other node within 30 s. +- **No quorum, no leader election, no two-phase commit.** The runtime-rule entry in + storage is the single source of truth. +- **Samples for an affected metric are dropped during a structural cutover.** This is by + design — the schema is moving and in-flight samples have no valid landing. diff --git a/docs/menu.yml b/docs/menu.yml index f806393dee1c..94cd2e8e23d5 100644 --- a/docs/menu.yml +++ b/docs/menu.yml @@ -390,6 +390,8 @@ catalog: path: "/en/concepts-and-designs/mal" - name: "Analysis Logs" path: "/en/concepts-and-designs/lal" + - name: "Runtime Rule Hot-Update" + path: "/en/concepts-and-designs/runtime-rule-hot-update" - name: "Profiling" path: "/en/concepts-and-designs/profiling" - name: "Service Hierarchy Configuration" @@ -400,6 +402,8 @@ catalog: catalog: - name: "Dynamic Code Generation and Debugging" path: "/en/operation/dynamic-code-generation-debugging" + - name: "Runtime Rule Hot-Update" + path: "/en/setup/backend/backend-runtime-rule-api" - name: "Security Notice" path: "/en/security/readme" - name: "Academy" diff --git a/oap-server-bom/pom.xml b/oap-server-bom/pom.xml index b03e64845e04..06b1fa3cf56b 100644 --- a/oap-server-bom/pom.xml +++ b/oap-server-bom/pom.xml @@ -40,8 +40,8 @@ 3.5.7 32.0.1-jre 2.0 - 3.25.5 - 3.25.5 + 4.33.1 + 4.33.1 1.11 3.18.0 2.17.0 diff --git a/oap-server/ai-pipeline/pom.xml b/oap-server/ai-pipeline/pom.xml index 4ec64c5f9933..ea14d0ea97b3 100644 --- a/oap-server/ai-pipeline/pom.xml +++ b/oap-server/ai-pipeline/pom.xml @@ -93,11 +93,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/analyzer/agent-analyzer/src/test/java/org/apache/skywalking/oap/server/analyzer/provider/meter/process/MeterProcessorTest.java b/oap-server/analyzer/agent-analyzer/src/test/java/org/apache/skywalking/oap/server/analyzer/provider/meter/process/MeterProcessorTest.java index 8d9e94f7399e..e0248be718e4 100644 --- a/oap-server/analyzer/agent-analyzer/src/test/java/org/apache/skywalking/oap/server/analyzer/provider/meter/process/MeterProcessorTest.java +++ b/oap-server/analyzer/agent-analyzer/src/test/java/org/apache/skywalking/oap/server/analyzer/provider/meter/process/MeterProcessorTest.java @@ -88,7 +88,9 @@ public void setup() throws StorageException, ModuleStartException { "PROCESSOR", mockProcessor ); - doNothing().when(mockProcessor).create(any(), (StreamDefinition) any(), any()); + // MetricsStreamProcessor.create now takes a StorageManipulationOpt on every path so + // the shape-mismatch gate at the installer level can surface to stream registration. + doNothing().when(mockProcessor).create(any(), (StreamDefinition) any(), any(), any()); final MeterProcessService processService = new MeterProcessService(moduleManager); List config = MeterConfigs.loadConfig("meter-analyzer-config", Arrays.asList("config")); processService.start(config); diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/compiler/LALClassGenerator.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/compiler/LALClassGenerator.java index 59854cc96198..3d55c3cf6159 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/compiler/LALClassGenerator.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/compiler/LALClassGenerator.java @@ -20,10 +20,14 @@ import java.io.DataOutputStream; import java.io.File; import java.io.FileOutputStream; +import java.io.IOException; import java.util.ArrayList; +import java.util.Collections; import java.util.HashMap; +import java.util.HashSet; import java.util.List; import java.util.Map; +import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; import javassist.ClassPool; import javassist.CtClass; @@ -31,6 +35,8 @@ import javassist.CtNewMethod; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.log.analyzer.v2.compiler.rt.LalExpressionPackageHolder; +import org.apache.skywalking.oap.server.core.classloader.BytecodeClassDefiner; +import org.apache.skywalking.oap.server.core.source.LogBuilder; import org.apache.skywalking.oap.log.analyzer.v2.dsl.LalExpression; import org.apache.skywalking.oap.server.core.WorkPath; import org.apache.skywalking.oap.server.library.util.StringUtil; @@ -61,10 +67,18 @@ public final class LALClassGenerator { private static final String H = "org.apache.skywalking.oap.log.analyzer.v2.compiler.rt.LalRuntimeHelper"; - private static final java.util.Set USED_CLASS_NAMES = - java.util.Collections.synchronizedSet(new java.util.HashSet<>()); + private static final Set USED_CLASS_NAMES = + Collections.synchronizedSet(new HashSet<>()); private final ClassPool classPool; + /** + * When non-null, generated LAL classes are defined in this ClassLoader via + * {@code ctClass.toClass(loader, null)} — used by the runtime-rule hot-update path so one + * YAML file's full LAL class family lives in a single per-file {@code RuleClassLoader} and + * drops together on unregister. Null = legacy startup path: uses the neighbor-class form + * with {@link LalExpressionPackageHolder} so classes land in the OAP app loader. + */ + private final ClassLoader targetClassLoader; private File classOutputDir; private String classNameHint; private Class inputType; @@ -187,14 +201,25 @@ void restoreProtoVarState(final Object[] state) { } public LALClassGenerator() { - this(ClassPool.getDefault()); + this(ClassPool.getDefault(), null); if (StringUtil.isNotEmpty(System.getenv("SW_DYNAMIC_CLASS_ENGINE_DEBUG"))) { classOutputDir = new File(WorkPath.getPath().getParentFile(), "lal-rt"); } } public LALClassGenerator(final ClassPool classPool) { + this(classPool, null); + } + + /** + * Runtime-rule constructor: caller supplies the per-file {@link ClassPool} (already scoped + * to a per-file {@code RuleClassLoader} via {@code LoaderClassPath}) and the target + * {@link ClassLoader}. Every class this generator emits will be loaded into + * {@code targetClassLoader} rather than the OAP app loader. + */ + public LALClassGenerator(final ClassPool classPool, final ClassLoader targetClassLoader) { this.classPool = classPool; + this.targetClassLoader = targetClassLoader; } public void setClassOutputDir(final File dir) { @@ -255,6 +280,15 @@ private String buildHintedName() { } private String dedupClassName(final String base) { + // Runtime-rule hot-update path: every apply gets a fresh per-file RuleClassLoader, so + // two apps of the same rule can safely carry the same generated class name — they live + // in different classloader namespaces. Skip the process-wide dedup set to keep it from + // growing without bound over thousands of hot-updates. The legacy startup path + // (targetClassLoader == null) still needs dedup because it defines classes into the + // shared app loader via LalExpressionPackageHolder. + if (targetClassLoader != null) { + return base; + } if (USED_CLASS_NAMES.add(base)) { return base; } @@ -439,7 +473,7 @@ public LalExpression compileFromModel(final LALScriptModel model) throws Excepti final ParserType parserType = detectParserType(model.getStatements()); final Class resolvedOutput = this.outputType != null ? this.outputType - : org.apache.skywalking.oap.server.core.source.LogBuilder.class; + : LogBuilder.class; // inputType is only meaningful for parser-less rules (NONE) where parsed.* // generates direct proto getter calls. When a parser is present (json/yaml/text), // parsed.* reads from the parsed map and tag() reads from LogData.Builder tags, @@ -501,11 +535,40 @@ public LalExpression compileFromModel(final LALScriptModel model) throws Excepti writeClassFile(ctClass); - final Class clazz = ctClass.toClass(LalExpressionPackageHolder.class); + final Class clazz = defineClass(ctClass); ctClass.detach(); return (LalExpression) clazz.getDeclaredConstructor().newInstance(); } + /** + * Loads a generated class through the configured {@link #targetClassLoader} when set + * (runtime-rule hot-update path: class lands in the per-file {@code RuleClassLoader}), + * or via the neighbor-class form when {@code targetClassLoader} is {@code null} + * (startup path: class lands in the OAP app loader alongside + * {@link LalExpressionPackageHolder}). + * + *

{@link BytecodeClassDefiner} loaders (the runtime-rule {@code RuleClassLoader}) + * receive the {@code CtClass.toBytecode()} bytes via their public {@code defineClass} + * — bypasses Javassist's deprecated {@code toClass(loader, ProtectionDomain)} reflection + * path so we don't need {@code --add-opens java.base/java.lang} on the OAP container. + * Same shape as {@code MALClassGenerator}; both DSLs share the contract. + */ + private Class defineClass(final CtClass ctClass) throws javassist.CannotCompileException { + if (targetClassLoader != null) { + if (targetClassLoader instanceof BytecodeClassDefiner) { + try { + return ((BytecodeClassDefiner) targetClassLoader) + .defineClass(ctClass.getName(), ctClass.toBytecode()); + } catch (final IOException e) { + throw new javassist.CannotCompileException( + "failed to serialise " + ctClass.getName() + " bytes", e); + } + } + return ctClass.toClass(targetClassLoader, null); + } + return ctClass.toClass(LalExpressionPackageHolder.class); + } + private static boolean hasParsedAccess( final List stmts) { for (final LALScriptModel.FilterStatement stmt : stmts) { @@ -629,7 +692,7 @@ public String generateSource(final String dsl) { final LALScriptModel model = LALScriptParser.parse(dsl); final Class resolvedOutput = this.outputType != null ? this.outputType - : org.apache.skywalking.oap.server.core.source.LogBuilder.class; + : LogBuilder.class; final ParserType pt = detectParserType(model.getStatements()); final GenCtx genCtx = new GenCtx( pt, pt == ParserType.NONE ? this.inputType : null, resolvedOutput); diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/DSL.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/DSL.java index b279744a4ceb..ad850c216a4e 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/DSL.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/DSL.java @@ -17,6 +17,7 @@ package org.apache.skywalking.oap.log.analyzer.v2.dsl; +import javassist.ClassPool; import lombok.AccessLevel; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; @@ -65,8 +66,32 @@ public static DSL of(final ModuleManager moduleManager, final Class outputType, final String ruleName, final String yamlSource) throws ModuleStartException { + return of(moduleManager, config, dsl, inputType, outputType, ruleName, + yamlSource, null, null); + } + + /** + * Runtime-rule overload: compile with a per-file {@link ClassPool} and target + * {@link ClassLoader}. The generated {@code LalExpression} class is defined in the + * supplied loader instead of the shared OAP app loader. The caller-supplied pool must + * already be scoped to the loader via {@code appendClassPath(new LoaderClassPath(loader))}. + * + *

When both {@code pool} and {@code targetClassLoader} are null this uses the legacy + * default pool + app loader — startup path, unchanged. + */ + public static DSL of(final ModuleManager moduleManager, + final LogAnalyzerModuleConfig config, + final String dsl, + final Class inputType, + final Class outputType, + final String ruleName, + final String yamlSource, + final ClassPool pool, + final ClassLoader targetClassLoader) throws ModuleStartException { try { - final LALClassGenerator generator = new LALClassGenerator(); + final LALClassGenerator generator = (pool != null && targetClassLoader != null) + ? new LALClassGenerator(pool, targetClassLoader) + : new LALClassGenerator(); generator.setInputType(inputType); generator.setOutputType(outputType); generator.setClassNameHint(ruleName); diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/spec/extractor/MetricExtractor.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/spec/extractor/MetricExtractor.java index ffcfeef1d2a4..2ff07fff85fa 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/spec/extractor/MetricExtractor.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/dsl/spec/extractor/MetricExtractor.java @@ -30,7 +30,6 @@ import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; import org.apache.skywalking.oap.log.analyzer.v2.provider.LogAnalyzerModuleConfig; import org.apache.skywalking.oap.log.analyzer.v2.provider.LogAnalyzerModuleProvider; -import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.Sample; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamilyBuilder; @@ -48,16 +47,19 @@ * compile-time setter resolution in the generated code. */ public class MetricExtractor extends AbstractSpec { - private final List metricConverts; + /** + * Resolved once at construction — cheaper than re-looking-up on every sample. The provider's + * converter registry is mutated under runtime-rule hot-update, but the provider reference + * itself never changes for the lifetime of this JVM, so caching it is safe. + */ + private final LogAnalyzerModuleProvider provider; public MetricExtractor(final ModuleManager moduleManager, final LogAnalyzerModuleConfig moduleConfig) throws ModuleStartException { super(moduleManager, moduleConfig); - LogAnalyzerModuleProvider provider = (LogAnalyzerModuleProvider) moduleManager + this.provider = (LogAnalyzerModuleProvider) moduleManager .find(LogAnalyzerModule.NAME).provider(); - - metricConverts = provider.getMetricConverts(); } public SampleBuilder prepareMetrics(final ExecutionContext ctx) { @@ -79,7 +81,10 @@ public void submitMetrics(final ExecutionContext ctx, final SampleBuilder builde if (possibleMetricsContainer.isPresent()) { possibleMetricsContainer.get().add(sampleFamily); } else { - metricConverts.forEach(it -> it.toMeter( + // Re-read the converter snapshot on every submit. Hot-updates publish a new map + // reference through LogAnalyzerModuleProvider, so reading at this point picks up + // freshly-applied runtime rules without an extra signal from the reconciler. + provider.getMetricConverts().forEach(it -> it.toMeter( ImmutableMap.builder() .put(sample.getName(), sampleFamily) .build() diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/module/LogAnalyzerModule.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/module/LogAnalyzerModule.java index 39faa0826e0f..0137d993c28e 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/module/LogAnalyzerModule.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/module/LogAnalyzerModule.java @@ -18,6 +18,8 @@ package org.apache.skywalking.oap.log.analyzer.v2.module; import org.apache.skywalking.oap.log.analyzer.v2.provider.log.ILogAnalyzerService; +import org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; import org.apache.skywalking.oap.server.library.module.ModuleDefine; public class LogAnalyzerModule extends ModuleDefine { @@ -30,7 +32,22 @@ public LogAnalyzerModule() { @Override public Class[] services() { return new Class[] { - ILogAnalyzerService.class + ILogAnalyzerService.class, + // LAL rule store — keyed by (Layer, ruleName). Each entry is a compiled LAL + // {@code DSL} (the `LalExpression` class generated from a `lal/*.yaml` filter + // block) that decides how to PARSE a log and EXTRACT fields. Owns the + // `lal` runtime-rule catalog: hot-update mutates this Factory directly via + // `addOrReplace` / `remove`, reusing the same `compile` helper the startup + // path uses — no duplicate DSL wiring. + LogFilterListener.Factory.class, + // Inline-MAL converter store — keyed by string `:`. Each entry + // is a `MetricConvert` compiled from a `log-mal-rules/*.yaml` rule that + // AGGREGATES samples (emitted by LAL `metrics {}` blocks) into metrics. Owns + // the `log-mal-rules` runtime-rule catalog. The same `MalConverterRegistry` + // SPI lives in the meter-analyzer artifact (which is a library, not a + // ModuleDefine); the OTel receiver implements the same interface for the + // `otel-rules` catalog. Two implementations, two catalogs, one shared API. + MalConverterRegistry.class, }; } } diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LALConfigs.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LALConfigs.java index 9b3ec8ec49a3..0e22b883c28b 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LALConfigs.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LALConfigs.java @@ -18,18 +18,23 @@ package org.apache.skywalking.oap.log.analyzer.v2.provider; +import java.io.ByteArrayInputStream; import java.io.File; import java.io.FileNotFoundException; -import java.io.FileReader; import java.io.IOException; +import java.io.InputStreamReader; import java.io.Reader; -import java.util.Arrays; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.util.ArrayList; import java.util.Collections; +import java.util.HashMap; import java.util.List; -import java.util.Objects; -import java.util.stream.Collectors; +import java.util.Map; import lombok.Data; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.rule.ext.RuleSetMerger; +import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.library.module.ModuleStartException; import org.apache.skywalking.oap.server.library.util.ResourceUtils; import org.yaml.snakeyaml.Yaml; @@ -45,6 +50,35 @@ public class LALConfigs { private List rules; public static List load(final String path, final List files) throws Exception { + return loadInternal(path, files, null, /* useInstalledManager= */ true); + } + + /** + * Load LAL config rules merging the disk allow-list with every + * {@link org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver} + * discovered on the classpath. {@code manager} is threaded through to the resolvers so + * the runtime-rule DB resolver can find its DAO; pass {@code null} from test paths that + * have no module context (resolvers needing the manager return empty contributions in + * that case). + * + *

Compared with the legacy disk-only path: + *

    + *
  • Files in {@code files} but missing on disk are still loaded if a resolver + * contributes ACTIVE content for them (DB-only LAL rules).
  • + *
  • Files on disk + in allow-list with an INACTIVE resolver entry are skipped.
  • + *
  • Files on disk + in allow-list with an ACTIVE resolver entry are parsed from + * resolver bytes (override).
  • + *
  • Files on disk + in allow-list with no resolver opinion are parsed from disk.
  • + *
+ */ + public static List load(final String path, final List files, + final ModuleManager manager) throws Exception { + return loadInternal(path, files, manager, /* useInstalledManager= */ false); + } + + private static List loadInternal(final String path, final List files, + final ModuleManager manager, + final boolean useInstalledManager) throws Exception { if (isEmpty(files)) { return Collections.emptyList(); } @@ -54,28 +88,55 @@ public static List load(final String path, final List files) try { final File[] rules = ResourceUtils.getPathFiles(path); - return Arrays.stream(rules) - .filter(File::isFile) - .filter(it -> { - //noinspection UnstableApiUsage - return files.contains(getNameWithoutExtension(it.getName())); - }) - .map(f -> { - try (final Reader r = new FileReader(f)) { - final LALConfigs configs = - new Yaml().loadAs(r, LALConfigs.class); - if (configs != null && configs.getRules() != null) { - final String src = f.getName(); - configs.getRules().forEach(c -> c.setSourceName(src)); - } - return configs; - } catch (IOException e) { - log.debug("Failed to read file {}", f, e); - } - return null; - }) - .filter(Objects::nonNull) - .collect(Collectors.toList()); + // Build the disk baseline keyed by rule name (basename without extension); the + // sourceFileName side-table preserves the on-disk file name so post-merge config + // entries can carry it on their `sourceName` field for diagnostics. + final Map diskBytes = new HashMap<>(); + final Map sourceFileName = new HashMap<>(); + for (final File f : rules) { + if (!f.isFile()) { + continue; + } + //noinspection UnstableApiUsage + final String ruleName = getNameWithoutExtension(f.getName()); + if (!files.contains(ruleName)) { + continue; + } + try { + diskBytes.put(ruleName, Files.readAllBytes(f.toPath())); + sourceFileName.put(ruleName, f.getName()); + } catch (final IOException ioe) { + log.debug("Failed to read file {}", f, ioe); + } + } + + // No-manager overload picks up the process-wide ModuleManager set by core. + // Explicit-manager overload bypasses it. + final Map merged = useInstalledManager + ? RuleSetMerger.merge("lal", diskBytes) + : RuleSetMerger.merge("lal", diskBytes, manager); + + final List out = new ArrayList<>(merged.size()); + for (final Map.Entry e : merged.entrySet()) { + final String ruleName = e.getKey(); + final byte[] bytes = e.getValue(); + try (final Reader r = new InputStreamReader( + new ByteArrayInputStream(bytes), + StandardCharsets.UTF_8)) { + final LALConfigs configs = new Yaml().loadAs(r, LALConfigs.class); + if (configs == null || configs.getRules() == null) { + continue; + } + // sourceFileName is only present for entries that came from disk; resolver- + // only rules synthesise a name so diagnostics still print something. + final String src = sourceFileName.getOrDefault(ruleName, ruleName + ".yaml"); + configs.getRules().forEach(c -> c.setSourceName(src)); + out.add(configs); + } catch (final IOException ioe) { + log.debug("Failed to parse LAL rule {}", ruleName, ioe); + } + } + return out; } catch (FileNotFoundException e) { throw new ModuleStartException("Failed to load LAL config rules", e); } diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LogAnalyzerModuleProvider.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LogAnalyzerModuleProvider.java index f2612cfff80e..37ef9c8d5e36 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LogAnalyzerModuleProvider.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/LogAnalyzerModuleProvider.java @@ -17,30 +17,75 @@ package org.apache.skywalking.oap.log.analyzer.v2.provider; -import java.util.List; -import java.util.stream.Collectors; +import java.util.Collection; +import java.util.Collections; +import java.util.LinkedHashMap; +import java.util.Map; import lombok.Getter; import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; import org.apache.skywalking.oap.log.analyzer.v2.provider.log.ILogAnalyzerService; import org.apache.skywalking.oap.log.analyzer.v2.provider.log.LogAnalyzerServiceImpl; import org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; import org.apache.skywalking.oap.server.configuration.api.ConfigurationModule; import org.apache.skywalking.oap.server.core.CoreModule; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.library.module.ModuleDefine; import org.apache.skywalking.oap.server.library.module.ModuleProvider; import org.apache.skywalking.oap.server.library.module.ModuleStartException; import org.apache.skywalking.oap.server.library.module.ServiceNotProvidedException; +/** + * Owns the analyzer-side state for both rule catalogs whose runtime is hosted in this + * module: + *
    + *
  • {@code lal} — the {@link LogFilterListener.Factory} registry of compiled LAL + * rules. The factory is registered as a service so the runtime-rule plugin + * reaches it by service class without a compile-time dep.
  • + *
  • {@code log-mal-rules} — the volatile {@link Map} of active inline-MAL + * converters keyed by {@code "log-mal-rules:"}, plus the + * {@link MalConverterRegistry} service the runtime-rule plugin uses to + * hot-mutate that map.
  • + *
+ * + *

Boot path: {@link #prepare()} constructs the LAL factory and the + * {@link MalConverterRegistry} service; {@link #start()} loads the static + * {@code log-mal-rules} files into the converter map and the static {@code lal} + * files into the filter factory, after which both registries are open for runtime + * mutation. Both static and runtime-rule entries share the same key scheme, so an + * operator override of a shipped rule lands in place — without a separate "delete + * the bundled file first" step. + * + *

The provider also exposes {@link ILogAnalyzerService} so the OTel-log / + * SkyWalking-native log receivers can dispatch parsed records into LAL. + */ public class LogAnalyzerModuleProvider extends ModuleProvider { @Getter private LogAnalyzerModuleConfig moduleConfig; - @Getter - private List metricConverts; + /** + * Active inline-MAL converters ({@code log-mal-rules} catalog — metrics extracted from + * logs), keyed by a stable handle so runtime-rule hot-update can replace or drop + * individual entries without touching the others. All entries — both boot-loaded static + * rules and runtime-rule pushes — share the {@code "log-mal-rules:"} key scheme + * (boot-time loading happens in {@link #start()} from {@code moduleConfig.malConfigs()}; + * runtime mutations come through {@link MalConverterRegistry} which delegates to + * {@link #addOrReplaceMetricConvert} / {@link #removeMetricConvert}). A runtime + * {@code /addOrUpdate} replaces the entry in place over whichever rule (boot or prior + * runtime push) occupies that key; {@code /inactivate} drops it. This is what lets an + * operator override a shipped log-mal rule without first deleting it — the update lands + * under the same key and takes over dispatch. + * + *

Volatile + copy-on-write: readers in {@link org.apache.skywalking.oap.log.analyzer.v2.dsl.spec.extractor.MetricExtractor} observe a consistent + * snapshot without a lock; writers replace the map reference under {@link #convertersWriteLock}. + */ + private volatile Map metricConverts = Collections.emptyMap(); + private final Object convertersWriteLock = new Object(); private LogAnalyzerServiceImpl logAnalyzerService; + private LogFilterListener.Factory factory; @Override public String name() { @@ -71,19 +116,56 @@ public void onInitialized(final LogAnalyzerModuleConfig initialized) { public void prepare() throws ServiceNotProvidedException, ModuleStartException { logAnalyzerService = new LogAnalyzerServiceImpl(getManager(), moduleConfig); this.registerServiceImplementation(ILogAnalyzerService.class, logAnalyzerService); + + // Register both module-declared services in prepare(): {@link BootstrapFlow#start} + // runs {@code requiredCheck} against {@code services()} BEFORE this provider's + // {@code start()}, and the count-equals check ({@code requiredServices.length == + // services.size()}) requires every declared service to be registered already. + // Both objects are config-only at construction — the Factory's heavy + // rule-compile pass is deferred to {@link LogFilterListener.Factory#loadStaticRules} + // (called from {@link #start()} where moduleManager.find is allowed); the + // MalConverterRegistry is a pure delegate to this provider's own volatile map and + // never reaches across modules. + try { + this.factory = new LogFilterListener.Factory(getManager(), moduleConfig); + this.registerServiceImplementation(LogFilterListener.Factory.class, this.factory); + } catch (final Exception e) { + throw new ModuleStartException("Failed to create LAL listener factory.", e); + } + // MalConverterRegistry for the log-mal-rules catalog. The runtime-rule plugin looks + // this up by module name + service class, so it does not need a compile-time dep on + // log-analyzer's concrete provider. Delegates directly to this provider's volatile + // map so ingest code (MetricExtractor) and runtime mutations share exactly one state. + this.registerServiceImplementation(MalConverterRegistry.class, new MalConverterRegistry() { + @Override + public void addOrReplaceConverter(final String key, final MetricConvert convert) { + addOrReplaceMetricConvert(key, convert); + } + + @Override + public void removeConverter(final String key) { + removeMetricConvert(key); + } + }); } @Override public void start() throws ServiceNotProvidedException, ModuleStartException { MeterSystem meterSystem = getManager().find(CoreModule.NAME).provider().getService(MeterSystem.class); - metricConverts = moduleConfig.malConfigs() - .stream() - .map(it -> new MetricConvert(it, meterSystem)) - .collect(Collectors.toList()); + for (final var rule : moduleConfig.malConfigs()) { + // Use the catalog:name key convention so a runtime-rule /addOrUpdate for the same + // (catalog, name) replaces this static entry in place instead of running two + // converters against the same sample stream. + addOrReplaceMetricConvert("log-mal-rules:" + rule.getName(), new MetricConvert(rule, meterSystem)); + } try { - logAnalyzerService.addListenerFactory(new LogFilterListener.Factory(getManager(), moduleConfig)); + // Light up the Factory now that all peer modules are past prepare: + // loadStaticRules calls compile() which constructs RecordSinkListener.Factory + // which calls moduleManager.find() — only safe outside prepare. + this.factory.loadStaticRules(); + logAnalyzerService.addListenerFactory(this.factory); } catch (final Exception e) { - throw new ModuleStartException("Failed to create LAL listener.", e); + throw new ModuleStartException("Failed to load static LAL rules.", e); } } @@ -92,11 +174,60 @@ public void notifyAfterCompleted() throws ServiceNotProvidedException { } + /** + * Live snapshot of active MAL converters for {@code log-mal-rules}. Consumed by + * {@link org.apache.skywalking.oap.log.analyzer.v2.dsl.spec.extractor.MetricExtractor} at + * ingest time. The returned collection is a read-only view; concurrent updates from + * {@link #addOrReplaceMetricConvert} / {@link #removeMetricConvert} do not invalidate + * in-flight iteration because writers publish a new map reference rather than mutating + * the one this method returned. + */ + public Collection getMetricConverts() { + return metricConverts.values(); + } + + /** + * Install or replace a single inline-MAL converter identified by {@code key}. Thread-safe + * against concurrent readers and other writers; readers observe either the pre-call or the + * post-call snapshot, never a torn intermediate state. Called by the runtime-rule plugin + * when an operator's {@code /addOrUpdate} commits a new bundle under the + * {@code log-mal-rules} catalog; boot-time loading also uses this method so there is + * exactly one installation path. + */ + public void addOrReplaceMetricConvert(final String key, final MetricConvert convert) { + synchronized (convertersWriteLock) { + final Map copy = new LinkedHashMap<>(metricConverts); + copy.put(key, convert); + metricConverts = Collections.unmodifiableMap(copy); + } + } + + /** + * Drop the inline-MAL converter previously installed under {@code key}. No-op if the key + * is not present — {@code /delete} on a runtime rule that already tore down on this node + * shouldn't surface an error. + */ + public void removeMetricConvert(final String key) { + synchronized (convertersWriteLock) { + if (!metricConverts.containsKey(key)) { + return; + } + final Map copy = new LinkedHashMap<>(metricConverts); + copy.remove(key); + metricConverts = Collections.unmodifiableMap(copy); + } + } + @Override public String[] requiredModules() { + // StorageModule must start before this provider so the runtime_rule management + // table is materialised by the time LALConfigs.load and Rules.loadRules consult + // the RuntimeRuleOverrideResolver chain (the DB-backed resolver needs a query- + // ready DAO to contribute overrides). return new String[] { CoreModule.NAME, - ConfigurationModule.NAME + ConfigurationModule.NAME, + StorageModule.NAME }; } } diff --git a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/log/listener/LogFilterListener.java b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/log/listener/LogFilterListener.java index 3df31df3df2d..c3f4af9265df 100644 --- a/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/log/listener/LogFilterListener.java +++ b/oap-server/analyzer/log-analyzer/src/main/java/org/apache/skywalking/oap/log/analyzer/v2/provider/log/listener/LogFilterListener.java @@ -20,9 +20,12 @@ import java.util.ArrayList; import java.util.Collection; +import java.util.Collections; +import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.ServiceLoader; +import java.util.Set; import java.util.stream.Collectors; import lombok.extern.slf4j.Slf4j; @@ -39,6 +42,7 @@ import org.apache.skywalking.oap.server.core.source.LogMetadata; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.library.module.ModuleStartException; +import org.apache.skywalking.oap.server.library.module.Service; /** * Runtime listener that executes compiled LAL rules against incoming log data. @@ -123,16 +127,56 @@ public LogAnalysisListener parse(final LogMetadata metadata, *

At runtime, {@link #create(Layer)} returns a {@link LogFilterListener} * containing all DSL instances for the requested layer. */ - public static class Factory implements LogAnalysisListenerFactory { - private final Map> dsls; - private final Map autoDsls; + public static class Factory implements LogAnalysisListenerFactory, Service { + /** + * Volatile + copy-on-write so readers in {@link #create(Layer)} are lock-free and the + * runtime-rule hot-update path can mutate without blocking sample evaluation. Inner maps + * are wholesale-replaced via the writeLock below rather than mutated in place, so the + * reference observed by a reader is always fully-populated for its generation. + */ + private volatile Map> dsls; + private volatile Map autoDsls; + + /** + * Suspended rule keys — encoded as {@code layer.name() + "|" + ruleName} for layer-keyed + * rules and {@code "|" + ruleName} for auto-layer rules. When {@link #create} is + * asked for a layer, it filters out entries whose key is in this set so samples arriving + * during a hot-update Suspend window never hit the prior DSL. Volatile + CoW replace so + * readers stay lock-free — matches the {@link #dsls} / {@link #autoDsls} contract. + */ + private volatile Set suspendedKeys = Collections.emptySet(); + private static final String AUTO_LAYER_PREFIX = "|"; + + /** Serializes runtime mutations (addOrReplace / remove / suspend / resume). Startup + * writes are single-threaded. */ + private final Object writeLock = new Object(); + + private final ModuleManager moduleManager; + private final LogAnalyzerModuleConfig analyzerConfig; + private final Map spiProviders; + + /** + * Two-phase init — the constructor wires fields and runs the SPI scan, leaving the + * {@code dsls} / {@code autoDsls} registry empty so this instance is safe to register + * as a module service from {@code prepare()}. The rule-compile pass — which calls + * {@code RecordSinkListener.Factory.} → {@code moduleManager.find()} and is + * therefore illegal during the prepare stage — is deferred to + * {@link #loadStaticRules()}, which the provider invokes from {@code start()} once + * the manager is past prepare. + * + *

This shape mirrors {@code OpenTelemetryMetricRequestProcessor}: the receiver + * registers a config-only object in {@code prepare()} and lights it up in + * {@code start()}, so cross-module {@code requiredCheck} resolves the service name + * cleanly. + */ public Factory(final ModuleManager moduleManager, final LogAnalyzerModuleConfig config) throws Exception { - dsls = new HashMap<>(); - autoDsls = new HashMap<>(); + this.moduleManager = moduleManager; + this.analyzerConfig = config; - // Scan SPI providers for default inputType/outputType per layer - final Map spiProviders = new HashMap<>(); + // Scan SPI providers for default inputType/outputType per layer. SPI lookup uses + // the JDK's {@code ServiceLoader} — no moduleManager.find required, safe in prepare. + this.spiProviders = new HashMap<>(); for (final LALSourceTypeProvider p : ServiceLoader.load(LALSourceTypeProvider.class)) { spiProviders.put(p.layer(), p); log.info("LALSourceTypeProvider: layer={}, inputType={}, outputType={}", @@ -140,37 +184,202 @@ public Factory(final ModuleManager moduleManager, final LogAnalyzerModuleConfig p.outputType() != null ? p.outputType().getName() : "default(Log)"); } - final List configList = LALConfigs.load(config.getLalPath(), config.lalFiles()) + // Empty registry — populated by {@link #loadStaticRules}. + this.dsls = new HashMap<>(); + this.autoDsls = new HashMap<>(); + } + + /** + * Compile every static LAL rule the {@link LogAnalyzerModuleConfig} configures and + * publish the resulting registry. Provider must call this from {@code start()} — + * never from {@code prepare()} — because {@code compile} reaches into + * {@code RecordSinkListener.Factory.}, which calls {@code moduleManager.find()} + * and asserts the manager is past prepare. + * + *

Idempotent against re-entry within the same boot (the Factory only stays in + * its empty post-construct state until this fires once); calling it twice on the + * same instance is a programming error. + */ + public void loadStaticRules() throws Exception { + final Map> initDsls = new HashMap<>(); + final Map initAutoDsls = new HashMap<>(); + final List configList = LALConfigs.load(analyzerConfig.getLalPath(), analyzerConfig.lalFiles()) .stream() .flatMap(it -> it.getRules().stream()) .collect(Collectors.toList()); for (final LALConfig c : configList) { - final boolean isAuto = LALConfig.LAYER_AUTO.equalsIgnoreCase(c.getLayer()); - final Layer layer = isAuto ? null : Layer.nameOf(c.getLayer()); - final LALSourceTypeProvider spiProvider = isAuto ? null : spiProviders.get(layer); - - // Per-rule resolution: explicit YAML > SPI > null - final Class resolvedInputType = resolveInputType(c, spiProvider); - final Class resolvedOutputType = resolveOutputType(c, spiProvider); - - final DSL dsl = DSL.of( - moduleManager, config, c.getDsl(), - resolvedInputType, resolvedOutputType, - c.getName(), c.getSourceName()); - - if (isAuto) { - if (autoDsls.put(c.getName(), dsl) != null) { + final CompiledLAL compiled = compile(c); + if (compiled.layer == null) { + if (initAutoDsls.put(c.getName(), compiled.dsl) != null) { throw new ModuleStartException( "Auto-layer rules have duplicate name: " + c.getName()); } } else { - final Map layerDsls = this.dsls.computeIfAbsent(layer, k -> new HashMap<>()); - if (layerDsls.put(c.getName(), dsl) != null) { + final Map layerDsls = initDsls.computeIfAbsent(compiled.layer, k -> new HashMap<>()); + if (layerDsls.put(c.getName(), compiled.dsl) != null) { throw new ModuleStartException( - "Layer " + layer.name() + " has already set " + c.getName() + " rule."); + "Layer " + compiled.layer.name() + " has already set " + c.getName() + " rule."); } } } + // Publish: readers from now on see the startup-complete registry. + this.dsls = initDsls; + this.autoDsls = initAutoDsls; + } + + /** + * Compile a single LALConfig into a runnable {@link DSL}. Used by both the startup + * constructor and the runtime-rule hot-update path (LalFileApplier). + */ + public CompiledLAL compile(final LALConfig c) throws ModuleStartException { + return compile(c, null, null); + } + + /** + * Runtime-rule overload: compile with a per-file {@link javassist.ClassPool} and target + * {@link ClassLoader} so the generated {@code LalExpression} class is defined in the + * caller's per-file loader. When both args are null this delegates to the legacy + * startup path. Called by {@code LalFileApplier.apply} with the per-file + * {@code RuleClassLoader} it creates on every compile. + */ + public CompiledLAL compile(final LALConfig c, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) throws ModuleStartException { + final boolean isAuto = LALConfig.LAYER_AUTO.equalsIgnoreCase(c.getLayer()); + final Layer layer = isAuto ? null : Layer.nameOf(c.getLayer()); + final LALSourceTypeProvider spiProvider = isAuto ? null : spiProviders.get(layer); + final Class resolvedInputType = resolveInputType(c, spiProvider); + final Class resolvedOutputType = resolveOutputType(c, spiProvider); + final DSL dsl = DSL.of( + moduleManager, analyzerConfig, c.getDsl(), + resolvedInputType, resolvedOutputType, + c.getName(), c.getSourceName(), + pool, targetClassLoader); + return new CompiledLAL(layer, c.getName(), dsl); + } + + /** + * Install a compiled rule under {@code (layer, ruleName)}, replacing any prior binding + * for the same key. Runtime hot-update use only — startup path goes through the + * constructor. Throws {@link ModuleStartException} if the key is already owned by a + * DIFFERENT sourceName (cross-file collision inside a layer) — callers that are + * legitimately re-registering the same file should have removed the old binding first. + */ + public void addOrReplace(final CompiledLAL compiled) { + synchronized (writeLock) { + if (compiled.layer == null) { + final Map next = new HashMap<>(autoDsls); + next.put(compiled.ruleName, compiled.dsl); + autoDsls = next; + } else { + final Map> next = new HashMap<>(dsls); + final Map layerMap = + new HashMap<>(next.getOrDefault(compiled.layer, new HashMap<>())); + layerMap.put(compiled.ruleName, compiled.dsl); + next.put(compiled.layer, layerMap); + dsls = next; + } + } + } + + /** Runtime remove. No-op when the key isn't present. */ + public void remove(final Layer layer, final String ruleName) { + synchronized (writeLock) { + if (layer == null) { + if (!autoDsls.containsKey(ruleName)) { + return; + } + final Map next = new HashMap<>(autoDsls); + next.remove(ruleName); + autoDsls = next; + } else { + final Map layerMap = dsls.get(layer); + if (layerMap == null || !layerMap.containsKey(ruleName)) { + return; + } + final Map> next = new HashMap<>(dsls); + final Map newLayerMap = new HashMap<>(layerMap); + newLayerMap.remove(ruleName); + if (newLayerMap.isEmpty()) { + next.remove(layer); + } else { + next.put(layer, newLayerMap); + } + dsls = next; + } + } + } + + /** Check whether {@code (layer, ruleName)} is already owned — used by hot-update to + * detect cross-file collisions before registering. */ + public boolean contains(final Layer layer, final String ruleName) { + if (layer == null) { + return autoDsls.containsKey(ruleName); + } + final Map layerMap = dsls.get(layer); + return layerMap != null && layerMap.containsKey(ruleName); + } + + /** + * Mark the given rules as suspended so {@link #create} excludes them until + * {@link #resume} is called. Runtime hot-update path: the reconciler invokes this before + * it broadcasts cluster Suspend so local LAL dispatch for the bundle is paused at the + * same moment peer dispatch goes away. Idempotent — repeating the same keys is a no-op. + */ + public void suspend(final Collection keys) { + if (keys == null || keys.isEmpty()) { + return; + } + synchronized (writeLock) { + final Set next = new HashSet<>(suspendedKeys); + if (!next.addAll(keys)) { + return; + } + suspendedKeys = Collections.unmodifiableSet(next); + } + } + + /** + * Reverse of {@link #suspend}. Removes the given keys from the suspended set so + * subsequent {@link #create} calls see the rule again. Idempotent — keys not currently + * suspended are silently skipped. + */ + public void resume(final Collection keys) { + if (keys == null || keys.isEmpty()) { + return; + } + synchronized (writeLock) { + if (suspendedKeys.isEmpty()) { + return; + } + final Set next = new HashSet<>(suspendedKeys); + if (!next.removeAll(keys)) { + return; + } + suspendedKeys = next.isEmpty() ? Collections.emptySet() : Collections.unmodifiableSet(next); + } + } + + /** + * Encode a {@code (layer, ruleName)} pair the way {@link #suspendedKeys} stores it. Null + * layer → auto-layer prefix. Callers outside this class use this helper so the encoding + * stays private to the factory. + */ + public static String ruleKey(final Layer layer, final String ruleName) { + return (layer == null ? AUTO_LAYER_PREFIX : layer.name() + "|") + ruleName; + } + + /** Compact result of {@link #compile(LALConfig)} so callers don't handle Layer / DSL separately. */ + public static final class CompiledLAL { + public final Layer layer; + public final String ruleName; + public final DSL dsl; + + public CompiledLAL(final Layer layer, final String ruleName, final DSL dsl) { + this.layer = layer; + this.ruleName = ruleName; + this.dsl = dsl; + } } private static Class resolveInputType(final LALConfig config, @@ -253,18 +462,45 @@ private static Class resolveOutputType( @Override public LogAnalysisListener create(Layer layer) { + // Snapshot the suspended set once so a concurrent suspend/resume can't flip behaviour + // mid-iteration. The reference is volatile + copy-on-write so this is lock-free. + final Set susp = suspendedKeys; if (layer == null) { // null layer → route to auto-layer rules if (autoDsls.isEmpty()) { return null; } - return new LogFilterListener(autoDsls.values(), true); + final Collection eligible = susp.isEmpty() + ? autoDsls.values() + : filterSuspended(autoDsls, susp, null); + if (eligible.isEmpty()) { + return null; + } + return new LogFilterListener(eligible, true); } final Map dsl = dsls.get(layer); if (dsl == null) { return null; } - return new LogFilterListener(dsl.values(), false); + final Collection eligible = susp.isEmpty() + ? dsl.values() + : filterSuspended(dsl, susp, layer); + if (eligible.isEmpty()) { + return null; + } + return new LogFilterListener(eligible, false); + } + + private static Collection filterSuspended(final Map source, + final Set suspended, + final Layer layer) { + final List out = new ArrayList<>(source.size()); + for (final Map.Entry e : source.entrySet()) { + if (!suspended.contains(ruleKey(layer, e.getKey()))) { + out.add(e.getValue()); + } + } + return out; } } } diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java index fa65613cbe4a..3a607214a4ec 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java @@ -29,6 +29,7 @@ import java.util.function.Predicate; import java.util.stream.Stream; import lombok.AccessLevel; +import lombok.Getter; import lombok.RequiredArgsConstructor; import lombok.ToString; import lombok.extern.slf4j.Slf4j; @@ -53,6 +54,7 @@ import org.apache.skywalking.oap.server.core.analysis.manual.service.ServiceTraffic; import org.apache.skywalking.oap.server.core.analysis.meter.MeterEntity; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.core.analysis.meter.ScopeType; import org.apache.skywalking.oap.server.core.analysis.meter.function.AcceptableValue; import org.apache.skywalking.oap.server.core.analysis.meter.function.BucketedValues; @@ -127,15 +129,95 @@ public static Analyzer build(final String metricName, final String expression, final MeterSystem meterSystem, final String yamlSource) { - Expression e = DSL.parse(metricName, expression, yamlSource); + return build(metricName, filter, expression, meterSystem, yamlSource, null, null); + } + + /** + * Overload used by the runtime-rule hot-update path. When {@code pool} and + * {@code targetClassLoader} are non-null, every class compiled for this rule — the + * generated {@code MalExpression} subclass, its closure companion classes, and the + * storage-side {@code Metrics} subclass emitted by {@code MeterSystem.create} — is + * built in the caller-supplied Javassist pool and loaded through the caller-supplied + * per-file {@code RuleClassLoader}. The whole class family for one YAML file then + * drops together when the reconciler retires the loader on unregister. + * + *

With null args this delegates to the legacy startup path that goes through the + * shared {@code DSL.GENERATOR} singleton and the default-pool {@code MeterSystem.create} + * overload — unchanged. + */ + public static Analyzer build(final String metricName, + final FilterExpression filter, + final String expression, + final MeterSystem meterSystem, + final String yamlSource, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) { + final Analyzer analyzer = prepare( + metricName, filter, expression, meterSystem, yamlSource, pool, targetClassLoader); + analyzer.register(); + return analyzer; + } + + /** + * Compile-only factory: parses the MAL expression into a {@code MalExpression} class + * under the per-file loader and populates runtime state ({@code samples}, {@code metricType}, + * {@code percentiles}) from the extracted metadata, but does NOT call + * {@link MeterSystem#create} yet. Used by {@code MetricConvert} to split the apply of a + * rule file into two phases — compile everything first, register everything only if all + * compiles succeed. That way a compile error on a later rule doesn't leave earlier + * rules with measures already provisioned on the storage backend. + * + *

{@link #register()} completes the second phase per analyzer. + */ + public static Analyzer prepare(final String metricName, + final FilterExpression filter, + final String expression, + final MeterSystem meterSystem, + final String yamlSource, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) { + // Static boot / default path: create-if-absent. Runtime-rule on-demand apply passes + // fullInstall() via the explicit-opt overload. + return prepare(metricName, filter, expression, meterSystem, yamlSource, pool, targetClassLoader, + StorageManipulationOpt.createIfAbsent()); + } + + /** + * Prepare overload that carries a {@link StorageManipulationOpt}. Runtime-rule peer-side + * apply passes {@link StorageManipulationOpt#localCacheOnly()} so subsequent + * {@link #register()} call skips server-side DDL. + */ + public static Analyzer prepare(final String metricName, + final FilterExpression filter, + final String expression, + final MeterSystem meterSystem, + final String yamlSource, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader, + final StorageManipulationOpt storageOpt) { + Expression e = DSL.parse(metricName, expression, yamlSource, pool, targetClassLoader); ExpressionMetadata ctx = e.parse(); Analyzer analyzer = new Analyzer(metricName, filter, e, meterSystem, ctx); - analyzer.init(); + analyzer.pool = pool; + analyzer.targetClassLoader = targetClassLoader; + analyzer.storageOpt = storageOpt == null ? StorageManipulationOpt.createIfAbsent() : storageOpt; + analyzer.resolveTypeFromMetadata(); return analyzer; } + /** + * Register the prepared analyzer with the {@link MeterSystem}. Separate from + * {@link #prepare} so {@code MetricConvert} can batch all prepare calls for a rule file + * before any DDL fires. Idempotent at the {@code MeterSystem} level — the receiver + * short-circuits on identical-shape re-registration. + */ + public void register() { + createMetric(ctx.getScopeType(), metricType.literal, ctx.getDownsampling()); + } + private List samples; + @Getter private final String metricName; private final FilterExpression filterExpression; @@ -150,6 +232,20 @@ public static Analyzer build(final String metricName, private int[] percentiles; + /** Per-file Javassist pool for runtime-rule hot-update, null on startup path. */ + private javassist.ClassPool pool; + /** Per-file target classloader for runtime-rule hot-update, null on startup path. */ + private ClassLoader targetClassLoader; + /** + * Storage-install policy threaded through to {@link MeterSystem#create}. Startup uses + * {@link StorageManipulationOpt#createIfAbsent()} (the default when callers don't set + * it — never reshape the backend at boot). Main-node on-demand apply sets + * {@link StorageManipulationOpt#fullInstall()}. Peer-node apply sets + * {@link StorageManipulationOpt#localCacheOnly()} so local Metrics classes + BanyanDB + * MetadataRegistry populate without server-side DDL. + */ + private StorageManipulationOpt storageOpt = StorageManipulationOpt.createIfAbsent(); + /** * Analyse the full sample family map and produce meter-system metrics. * @@ -289,13 +385,15 @@ private enum MetricType { } /** - * Initializes runtime state from compile-time metadata. + * Resolves {@link #samples}, {@link #metricType}, {@link #percentiles} from the + * compile-time metadata. Side-effect free — no {@link MeterSystem} interaction. * - *

{@code ctx.getSamples()} provides the Prometheus metric names this expression references - * (e.g., ["node_cpu_seconds_total"]). These are used at runtime to select relevant entries - * from the full sample family map, avoiding unnecessary expression evaluation. + *

{@code ctx.getSamples()} provides the Prometheus metric names this expression + * references (e.g. {@code ["node_cpu_seconds_total"]}). These are used at runtime to + * select relevant entries from the full sample family map, avoiding unnecessary + * expression evaluation. */ - private void init() { + private void resolveTypeFromMetadata() { this.samples = ctx.getSamples(); if (ctx.isHistogram()) { if (ctx.getPercentiles() != null && ctx.getPercentiles().length > 0) { @@ -311,7 +409,6 @@ private void init() { metricType = MetricType.labeled; } } - createMetric(ctx.getScopeType(), metricType.literal, ctx.getDownsampling()); } private void createMetric(final ScopeType scopeType, @@ -319,7 +416,19 @@ private void createMetric(final ScopeType scopeType, final DownsamplingType downsamplingType) { String downSamplingStr = CaseUtils.toCamelCase(downsamplingType.toString().toLowerCase(), false, '_'); String functionName = String.format("%s%s", downSamplingStr, StringUtils.capitalize(dataType)); - meterSystem.create(metricName, functionName, scopeType); + // Default path (startup): pool + neighbor null → MeterSystem uses its own default pool + // + MeterClassPackageHolder as loader neighbor. Runtime-rule path: caller provided a + // per-file RuleClassLoader + ClassPool, and MeterSystem.create's pool/neighbor + // overload puts the generated Metrics subclass in the per-file loader so the whole + // bundle drops together on hot-remove. + if (pool != null && targetClassLoader != null) { + // Per-file: generated Metrics class goes directly into the supplied RuleClassLoader. + // storageOpt controls server-side DDL: fullInstall() on main, localCacheOnly() + // on peer — see the Analyzer class-level Javadoc for the main/peer contract. + meterSystem.create(metricName, functionName, scopeType, pool, targetClassLoader, storageOpt); + } else { + meterSystem.create(metricName, functionName, scopeType); + } } private void send(final AcceptableValue v, final long time) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MalConverterRegistry.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MalConverterRegistry.java new file mode 100644 index 000000000000..da5c986e8144 --- /dev/null +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MalConverterRegistry.java @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2; + +import org.apache.skywalking.oap.server.library.module.Service; + +/** + * Shared contract for the MAL converter registry each MAL-consuming receiver exposes, so the + * runtime-rule hot-update plugin can add / replace / remove individual converters without + * reshuffling anyone's boot-time list. + * + *

Today two receivers implement it: + *

    + *
  • {@code OpenTelemetryMetricRequestProcessor} for the {@code otel-rules} catalog — OTLP + * metrics flow.
  • + *
  • The {@code log-analyzer} module for the {@code log-mal-rules} catalog — inline MAL + * extracted from LAL {@code metrics {}} blocks.
  • + *
+ * + *

The registry is keyed by a stable string the caller picks — today that string is + * {@code ":"} (e.g., {@code "otel-rules:vm"}). The key namespace is + * deliberately shared between boot-registered converters and runtime-rule-registered + * converters: runtime-rule's {@code /addOrUpdate} replaces-in-place over whichever entry the + * boot catalog or a prior runtime push left behind, and {@code /inactivate} drops that same + * entry. This is what lets an operator override a shipped static rule without first deleting + * it — the update lands under the same key and takes over dispatch. + * + *

Implementations must be thread-safe — ingest threads iterate concurrently while + * runtime-rule mutates. The expected idiom is volatile map + copy-on-write under a private + * write lock; readers take a reference snapshot without locking. + */ +public interface MalConverterRegistry extends Service { + + /** + * Install or replace a converter under {@code key}. Idempotent: repeated calls with the + * same {@code key} replace the entry atomically. + */ + void addOrReplaceConverter(String key, MetricConvert convert); + + /** + * Remove the converter previously installed under {@code key}. No-op if absent — the + * runtime-rule delete / teardown path treats a missing entry as "already converged". + */ + void removeConverter(String key); +} diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java index 34064d5486a4..cd721e809dac 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java @@ -22,15 +22,20 @@ import com.google.common.base.Strings; import com.google.common.collect.ImmutableMap; import io.vavr.control.Try; +import java.util.Collections; +import java.util.LinkedHashSet; import java.util.List; +import java.util.Set; import java.util.StringJoiner; import java.util.stream.IntStream; import java.util.stream.Stream; +import lombok.Getter; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.StringUtils; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.FilterExpression; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import static java.util.stream.Collectors.toList; @@ -74,24 +79,118 @@ public static Stream log(Try t, String debugMessage) { private final List analyzers; public MetricConvert(MetricRuleConfig rule, MeterSystem service) { + // Static boot default: create-if-absent semantics. Runtime-rule on-demand callers use + // the explicit-opt overload and pass fullInstall() to get reshape permission. + this(rule, service, null, null, StorageManipulationOpt.createIfAbsent()); + } + + public MetricConvert(final MetricRuleConfig rule, final MeterSystem service, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) { + this(rule, service, pool, targetClassLoader, + StorageManipulationOpt.createIfAbsent()); + } + + /** + * Runtime-rule overload carrying per-file classloader + storage policy. + * + * @param rule the MAL rule config to compile + * @param service MeterSystem target for registration + * @param pool per-file Javassist pool, or null to use the shared default + * @param targetClassLoader per-file ClassLoader, or null to use the shared default + * @param storageOpt policy for backend-side install; main-node passes fullInstall, + * peer-node passes localCacheOnly to skip server DDL + */ + public MetricConvert(final MetricRuleConfig rule, final MeterSystem service, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader, + final StorageManipulationOpt storageOpt) { Preconditions.checkState(!Strings.isNullOrEmpty(rule.getMetricPrefix())); final String sourceName = rule.getSourceName(); - final FilterExpression filter = buildFilter(rule); + final FilterExpression filter = buildFilter(rule, pool, targetClassLoader); final List rules = rule.getMetricsRules(); - this.analyzers = IntStream.range(0, rules.size()).mapToObj( + + // Two-phase apply at file granularity so a compile error on a later rule never + // leaves earlier rules with measures already provisioned on the storage backend. + // + // Phase 1 — prepare every Analyzer: runs DSL.parse (Javassist codegen into the + // per-file ClassLoader when running on the runtime-rule path) + metadata + // extraction, but does NOT call MeterSystem.create. On any failure, the whole + // file apply aborts before any DDL fires; partial Javassist classes die with + // the (throwaway) per-file loader. + // + // Phase 2 — register: walks the prepared list and calls Analyzer.register which + // drives MeterSystem.create → StorageModels.add → per-backend listener DDL. + // On partial register failure the caller (MalFileApplier / Reconciler) rolls + // back only the metrics that this apply attempt actually created. Phase 2 + // failures are rare in practice — MeterSystem.create is idempotent for same- + // shape re-registration and the runtime-rule path pre-removes shape-break + // metrics before reaching here. + final List prepared = IntStream.range(0, rules.size()).mapToObj( i -> { final MetricRuleConfig.RuleConfig r = rules.get(i); final String yamlSource = sourceName != null ? sourceName + ".yaml:" + i : null; - return buildAnalyzer( + return prepareAnalyzer( formatMetricName(rule, r.getName()), filter, formatExp(rule.getExpPrefix(), rule.getExpSuffix(), r.getExp()), service, - yamlSource + yamlSource, + pool, + targetClassLoader, + storageOpt ); } ).collect(toList()); + // Phase 2 — register. Track each metric name as it's successfully registered so a + // mid-phase throw gives the caller an accurate "actually registered" set. The previous + // design left the caller using the full enumerated metric list for rollback, which was + // catastrophic for FILTER_ONLY edits: a compile surprise between register() calls would + // wipe the old bundle's metrics that this apply attempt never touched. + final Set registered = new LinkedHashSet<>(prepared.size()); + for (final Analyzer a : prepared) { + try { + a.register(); + } catch (final Throwable t) { + throw new PartialRegistrationException( + "phase-2 register failed for " + a.getMetricName(), + t, Collections.unmodifiableSet(new LinkedHashSet<>(registered))); + } + registered.add(a.getMetricName()); + } + this.analyzers = prepared; + this.registeredMetricNames = Collections.unmodifiableSet(registered); + } + + /** + * Metric names that completed phase-2 register on this instance — the set the caller would + * unregister to undo a successful apply. Same as {@code analyzers.stream().map(getMetricName)} + * for a fully-constructed instance; the field exists so {@link PartialRegistrationException} + * can carry the same value for the partial case. + */ + @Getter + private final Set registeredMetricNames; + + /** + * Thrown from the ctor when phase-2 register throws after at least one metric was already + * registered. Carries the subset that did land, so the caller can unregister exactly what + * this apply attempt touched and leave the old bundle's unchanged metrics alone. + * + *

Phase-1 (compile) failures do NOT use this exception — nothing was registered, the + * original Throwable propagates unwrapped. + */ + public static final class PartialRegistrationException extends RuntimeException { + @Getter + private final Set registeredBeforeFailure; + + public PartialRegistrationException(final String message, final Throwable cause, + final Set registeredBeforeFailure) { + super(message, cause); + this.registeredBeforeFailure = registeredBeforeFailure == null + ? Collections.emptySet() + : registeredBeforeFailure; + } } Analyzer buildAnalyzer(final String metricsName, @@ -99,16 +198,55 @@ Analyzer buildAnalyzer(final String metricsName, final String exp, final MeterSystem service, final String yamlSource) { + return buildAnalyzer(metricsName, filter, exp, service, yamlSource, null, null); + } + + Analyzer buildAnalyzer(final String metricsName, + final FilterExpression filter, + final String exp, + final MeterSystem service, + final String yamlSource, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) { return Analyzer.build( metricsName, filter, exp, service, - yamlSource + yamlSource, + pool, + targetClassLoader + ); + } + + /** + * Compile-only counterpart to {@link #buildAnalyzer}. The ctor uses this in phase 1 so + * every rule's MAL expression is parsed + typed before any {@code MeterSystem.create} + * call fires. Phase 2 runs {@link Analyzer#register} on the returned objects. + */ + Analyzer prepareAnalyzer(final String metricsName, + final FilterExpression filter, + final String exp, + final MeterSystem service, + final String yamlSource, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader, + final StorageManipulationOpt storageOpt) { + return Analyzer.prepare( + metricsName, + filter, + exp, + service, + yamlSource, + pool, + targetClassLoader, + storageOpt ); } - private static FilterExpression buildFilter(final MetricRuleConfig rule) { + private static FilterExpression buildFilter(final MetricRuleConfig rule, + final javassist.ClassPool pool, + final ClassLoader targetClassLoader) { final String filterText = rule.getFilter(); if (Strings.isNullOrEmpty(filterText)) { return null; @@ -116,7 +254,7 @@ private static FilterExpression buildFilter(final MetricRuleConfig rule) { final String sourceName = rule.getSourceName(); final String yamlSource = sourceName != null ? sourceName + ".yaml" : null; - return new FilterExpression(filterText, "filter", yamlSource); + return new FilterExpression(filterText, "filter", yamlSource, pool, targetClassLoader); } private String formatExp(final String expPrefix, String expSuffix, String exp) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALBytecodeHelper.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALBytecodeHelper.java index cd4f38c12f82..e8291917ca91 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALBytecodeHelper.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALBytecodeHelper.java @@ -21,6 +21,8 @@ import java.io.File; import java.io.FileOutputStream; import java.util.ArrayList; +import java.util.Collections; +import java.util.HashSet; import java.util.List; import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; @@ -48,11 +50,18 @@ final class MALBytecodeHelper { private static final AtomicInteger CLASS_COUNTER = new AtomicInteger(0); private static final Set USED_CLASS_NAMES = - java.util.Collections.synchronizedSet(new java.util.HashSet<>()); + Collections.synchronizedSet(new HashSet<>()); private File classOutputDir; private String classNameHint; private String yamlSource; + /** + * When true, each apply gets its own per-file classloader, so generated class names are + * scoped to that loader and don't need the process-wide {@link #USED_CLASS_NAMES} dedup. + * Set by {@link MALClassGenerator} when its {@code targetClassLoader} is non-null — the + * runtime-rule hot-update path. Legacy startup (shared OAP app loader) keeps dedup on. + */ + private boolean perFileClassLoader; void setClassOutputDir(final File dir) { this.classOutputDir = dir; @@ -70,6 +79,10 @@ void setYamlSource(final String yamlSource) { this.yamlSource = yamlSource; } + void setPerFileClassLoader(final boolean perFileClassLoader) { + this.perFileClassLoader = perFileClassLoader; + } + // ==================== Class naming ==================== /** @@ -112,6 +125,13 @@ private String buildHintedName() { } private String dedupClassName(final String base) { + // Runtime-rule hot-update gives every apply its own RuleClassLoader — same class name + // across applies lands in different loader namespaces. Skip the process-wide dedup set + // so it doesn't grow without bound across thousands of hot-updates. Legacy startup + // path (shared app loader) still needs dedup. + if (perFileClassLoader) { + return base; + } if (USED_CLASS_NAMES.add(base)) { return base; } @@ -279,7 +299,7 @@ void addLocalVariableTable(final javassist.CtMethod method, */ void addRunLocalVariableTable(final javassist.CtMethod method, final String className, - final java.util.Set varNames) { + final Set varNames) { final String sfDesc = "L" + MALCodegenHelper.SF.replace('.', '/') + ";"; final String[][] vars = new String[1 + varNames.size()][]; diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClassGenerator.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClassGenerator.java index bf232306d87b..1310ca0f5514 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClassGenerator.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClassGenerator.java @@ -18,14 +18,18 @@ package org.apache.skywalking.oap.meter.analyzer.v2.compiler; import java.io.File; +import java.io.IOException; import java.util.ArrayList; +import java.util.HashMap; import java.util.List; +import java.util.Map; import javassist.ClassPool; import javassist.CtClass; import javassist.CtNewConstructor; import javassist.CtNewMethod; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.meter.analyzer.v2.compiler.rt.MalExpressionPackageHolder; +import org.apache.skywalking.oap.server.core.classloader.BytecodeClassDefiner; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.ExpressionMetadata; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.MalExpression; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.MalFilter; @@ -57,8 +61,18 @@ public final class MALClassGenerator { private final ClassPool classPool; private final MALBytecodeHelper bytecodeHelper; + /** + * When non-null, generated MAL classes (MalExpression, MalFilter, closure companions) + * are defined in this ClassLoader via {@code ctClass.toClass(loader, null)} — used by + * the runtime-rule hot-update path so the whole MAL class family for one YAML file + * lives in a single per-file {@code RuleClassLoader} and drops together on unregister. + * Null = legacy startup path: uses neighbor-class form with + * {@link MalExpressionPackageHolder} so classes land in the OAP app loader. + */ + private final ClassLoader targetClassLoader; + public MALClassGenerator() { - this(createClassPool()); + this(createClassPool(), null); if (StringUtil.isNotEmpty(System.getenv("SW_DYNAMIC_CLASS_ENGINE_DEBUG"))) { bytecodeHelper.setClassOutputDir( new File(WorkPath.getPath().getParentFile(), "mal-rt")); @@ -74,8 +88,22 @@ private static ClassPool createClassPool() { } public MALClassGenerator(final ClassPool classPool) { + this(classPool, null); + } + + /** + * Runtime-rule constructor: caller supplies the per-file {@link ClassPool} (already + * scoped to a per-file {@code RuleClassLoader} via {@code LoaderClassPath}) and the + * target {@link ClassLoader}. Every class this generator emits will be loaded into + * {@code targetClassLoader} rather than the OAP app loader. + */ + public MALClassGenerator(final ClassPool classPool, final ClassLoader targetClassLoader) { this.classPool = classPool; this.bytecodeHelper = new MALBytecodeHelper(); + this.targetClassLoader = targetClassLoader; + // Per-file loader mode: generated class names are scoped to this loader's namespace so + // the helper can skip its process-wide dedup set (the leak finding). + this.bytecodeHelper.setPerFileClassLoader(targetClassLoader != null); } public void setClassOutputDir(final File dir) { @@ -159,11 +187,43 @@ public MalFilter compileFilter(final String filterExpression) throws Exception { bytecodeHelper.writeClassFile(ctClass); - final Class clazz = ctClass.toClass(MalExpressionPackageHolder.class); + final Class clazz = defineClass(ctClass); ctClass.detach(); return (MalFilter) clazz.getDeclaredConstructor().newInstance(); } + /** + * Loads a generated class through the configured {@link #targetClassLoader} when set + * (runtime-rule hot-update path: class lands in the per-file {@code RuleClassLoader}), + * or via the neighbor-class form when {@code targetClassLoader} is {@code null} + * (startup path: class lands in the OAP app loader alongside + * {@link MalExpressionPackageHolder}). + * + *

When {@code targetClassLoader} implements + * {@link org.apache.skywalking.oap.server.core.classloader.BytecodeClassDefiner + * BytecodeClassDefiner} (the runtime-rule {@code RuleClassLoader} does), we hand + * the loader the {@code CtClass.toBytecode()} bytes and let it invoke its public + * {@code defineClass} directly — no Javassist {@code toClass(loader, + * ProtectionDomain)} reflection, no {@code --add-opens java.base/java.lang} + * requirement on JDK 17+. Otherwise we fall back to the legacy 2-arg toClass for + * back-compat, but no shipped loader uses that path today. + */ + private Class defineClass(final CtClass ctClass) throws javassist.CannotCompileException { + if (targetClassLoader != null) { + if (targetClassLoader instanceof BytecodeClassDefiner) { + try { + return ((BytecodeClassDefiner) targetClassLoader) + .defineClass(ctClass.getName(), ctClass.toBytecode()); + } catch (final IOException e) { + throw new javassist.CannotCompileException( + "failed to serialise " + ctClass.getName() + " bytes", e); + } + } + return ctClass.toClass(targetClassLoader, null); + } + return ctClass.toClass(MalExpressionPackageHolder.class); + } + /** * Compiles from a pre-parsed AST model. */ @@ -178,8 +238,8 @@ public MalExpression compileFromModel(final String metricName, final List closureFieldNames = new ArrayList<>(); final List closureInterfaceTypes = new ArrayList<>(); - final java.util.Map closureNameCounts = - new java.util.HashMap<>(); + final Map closureNameCounts = + new HashMap<>(); for (int i = 0; i < closures.size(); i++) { final String purpose = closures.get(i).methodName; final int count = closureNameCounts.getOrDefault(purpose, 0); @@ -262,14 +322,13 @@ public MalExpression compileFromModel(final String metricName, // 6. Load companions, then main class for (final CtClass companion : companionClasses) { bytecodeHelper.writeClassFile(companion); - companion.toClass(MalExpressionPackageHolder.class); + defineClass(companion); companion.detach(); } bytecodeHelper.writeClassFile(ctClass); - final Class clazz = - ctClass.toClass(MalExpressionPackageHolder.class); + final Class clazz = defineClass(ctClass); ctClass.detach(); return (MalExpression) clazz.getDeclaredConstructor().newInstance(); @@ -288,8 +347,8 @@ public String generateSource(final String expression) { cc.collectClosures(ast, closures); final List fieldNames = new ArrayList<>(); - final java.util.Map nameCounts = - new java.util.HashMap<>(); + final Map nameCounts = + new HashMap<>(); for (final MALClosureCodegen.ClosureInfo ci : closures) { final String purpose = ci.methodName; final int count = nameCounts.getOrDefault(purpose, 0); diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALMetadataExtractor.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALMetadataExtractor.java index 53427a4328b8..e5886ff21aeb 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALMetadataExtractor.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALMetadataExtractor.java @@ -40,7 +40,7 @@ *

Also generates the {@code metadata()} method source that returns * {@link ExpressionMetadata} at runtime. */ -final class MALMetadataExtractor { +public final class MALMetadataExtractor { private MALMetadataExtractor() { } @@ -52,7 +52,7 @@ private MALMetadataExtractor() { * extracts samples=["metric"], scopeType=SERVICE, scopeLabels=["svc"], * aggregationLabels=["svc"]. */ - static ExpressionMetadata extractMetadata(final MALExpressionModel.Expr ast) { + public static ExpressionMetadata extractMetadata(final MALExpressionModel.Expr ast) { final Set sampleNames = new LinkedHashSet<>(); collectSampleNames(ast, sampleNames); diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/DSL.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/DSL.java index c29c6ba9c21d..3fc72f470173 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/DSL.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/DSL.java @@ -17,8 +17,12 @@ package org.apache.skywalking.oap.meter.analyzer.v2.dsl; +import javassist.ClassPool; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.meter.analyzer.v2.compiler.MALClassGenerator; +import org.apache.skywalking.oap.meter.analyzer.v2.compiler.MALExpressionModel; +import org.apache.skywalking.oap.meter.analyzer.v2.compiler.MALMetadataExtractor; +import org.apache.skywalking.oap.meter.analyzer.v2.compiler.MALScriptParser; /** * DSL compiles MAL expression strings into {@link Expression} objects @@ -52,9 +56,58 @@ public static Expression parse(final String metricName, final String expression) public static Expression parse(final String metricName, final String expression, final String yamlSource) { + return parse(metricName, expression, yamlSource, null, null); + } + + /** + * Runtime-rule overload: compile with a per-file {@link ClassPool} and target + * {@link ClassLoader}. Every class generated for this expression — the main + * {@code MalExpression} subclass plus any closure companions — is defined in the + * supplied loader instead of the shared OAP app loader. The caller-supplied pool must + * already be scoped to the loader via {@code appendClassPath(new LoaderClassPath(loader))}. + * + *

When {@code pool} and {@code targetClassLoader} are both null, this delegates to + * the shared {@link #GENERATOR} singleton (startup path, unchanged). Passing null for + * only one of the two is treated as "startup path" — there is no half-isolated mode. + */ + /** + * Extract compile-time {@link ExpressionMetadata} from a MAL expression string without + * running Javassist codegen. Returns scope type, sample names, aggregation labels, + * histogram flag + percentiles, and downsampling — the inputs the runtime-rule classifier + * needs to derive the storage shape tuple {@code (functionName, scopeType)} for a metric + * and decide FILTER_ONLY vs STRUCTURAL. + * + *

Throws {@link IllegalStateException} on parse failure — wraps the upstream ANTLR + * error listener so callers have a single exception type to catch. + */ + public static ExpressionMetadata extractMetadata(final String expression) { + try { + final MALExpressionModel.Expr ast = MALScriptParser.parse(expression); + return MALMetadataExtractor.extractMetadata(ast); + } catch (final Exception e) { + throw new IllegalStateException( + "Failed to parse MAL expression for metadata: " + expression, e); + } + } + + public static Expression parse(final String metricName, + final String expression, + final String yamlSource, + final ClassPool pool, + final ClassLoader targetClassLoader) { try { - GENERATOR.setYamlSource(yamlSource); - final MalExpression malExpr = GENERATOR.compile(metricName, expression); + final MalExpression malExpr; + if (pool != null && targetClassLoader != null) { + // Per-file generator: one instance per compile is fine — it's just a thin + // orchestrator over ClassPool. Prevents cross-contamination of classNameHint / + // yamlSource state that the shared GENERATOR carries between calls. + final MALClassGenerator perFile = new MALClassGenerator(pool, targetClassLoader); + perFile.setYamlSource(yamlSource); + malExpr = perFile.compile(metricName, expression); + } else { + GENERATOR.setYamlSource(yamlSource); + malExpr = GENERATOR.compile(metricName, expression); + } return new Expression(metricName, expression, malExpr); } catch (Exception e) { throw new IllegalStateException( diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/FilterExpression.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/FilterExpression.java index 8464f8617d2c..49a6ef2d3e96 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/FilterExpression.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/FilterExpression.java @@ -20,6 +20,7 @@ import java.util.HashMap; import java.util.Map; import java.util.Objects; +import javassist.ClassPool; import lombok.ToString; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.meter.analyzer.v2.compiler.MALClassGenerator; @@ -47,17 +48,45 @@ public FilterExpression(final String literal, final String filterNameHint) { public FilterExpression(final String literal, final String filterNameHint, final String yamlSource) { + this(literal, filterNameHint, yamlSource, null, null); + } + + /** + * Runtime-rule overload: compile the filter with a per-file {@link ClassPool} and target + * {@link ClassLoader} so the generated {@code MalFilter} class lands in the caller's + * per-file loader alongside the {@code MalExpression} classes for the same YAML file. + * + *

When both {@code pool} and {@code targetClassLoader} are null this delegates to the + * shared startup-path {@link #GENERATOR}, unchanged. + */ + public FilterExpression(final String literal, + final String filterNameHint, + final String yamlSource, + final ClassPool pool, + final ClassLoader targetClassLoader) { this.literal = literal; try { - if (filterNameHint != null) { - GENERATOR.setClassNameHint(filterNameHint); - } - GENERATOR.setYamlSource(yamlSource); - try { - this.malFilter = GENERATOR.compileFilter(literal); - } finally { - GENERATOR.setClassNameHint(null); - GENERATOR.setYamlSource(null); + if (pool != null && targetClassLoader != null) { + // Dedicated generator per filter — avoids mutating the shared singleton's + // classNameHint/yamlSource state and keeps runtime-rule compiles isolated + // from startup compiles running on the shared GENERATOR. + final MALClassGenerator perFile = new MALClassGenerator(pool, targetClassLoader); + if (filterNameHint != null) { + perFile.setClassNameHint(filterNameHint); + } + perFile.setYamlSource(yamlSource); + this.malFilter = perFile.compileFilter(literal); + } else { + if (filterNameHint != null) { + GENERATOR.setClassNameHint(filterNameHint); + } + GENERATOR.setYamlSource(yamlSource); + try { + this.malFilter = GENERATOR.compileFilter(literal); + } finally { + GENERATOR.setClassNameHint(null); + GENERATOR.setYamlSource(null); + } } } catch (Exception e) { throw new IllegalStateException( diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/prometheus/rule/Rules.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/prometheus/rule/Rules.java index c179e774a70b..9596f03e02da 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/prometheus/rule/Rules.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/prometheus/rule/Rules.java @@ -18,25 +18,27 @@ package org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule; +import java.io.ByteArrayInputStream; import java.io.File; -import java.io.FileReader; import java.io.IOException; +import java.io.InputStreamReader; import java.io.Reader; - +import java.nio.charset.StandardCharsets; import java.nio.file.FileSystems; import java.nio.file.Files; import java.nio.file.Path; -import java.util.Collections; +import java.util.HashMap; import java.util.List; import java.util.Map; -import java.util.Objects; import java.util.stream.Collectors; import java.util.stream.Stream; import org.apache.skywalking.oap.server.core.UnexpectedException; +import org.apache.skywalking.oap.server.core.rule.ext.RuleSetMerger; +import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.library.util.ResourceUtils; import org.slf4j.Logger; @@ -49,11 +51,32 @@ public class Rules { private static final Logger LOG = LoggerFactory.getLogger(Rule.class); - public static List loadRules(final String path) throws IOException { - return loadRules(path, Collections.emptyList()); + /** + * Default-manager entry point. Picks up the process-wide {@link ModuleManager} set by + * core during start, so receivers don't have to thread it through their own loaders. + * Tests with no core boot get an empty resolver list and pure disk-only loading. + */ + public static List loadRules(final String path, List enabledRules) throws IOException { + return loadInternal(path, enabledRules, null, /* useInstalledManager= */ true); } - public static List loadRules(final String path, List enabledRules) throws IOException { + /** + * Explicit-manager entry point — primarily for receivers that already hold a + * {@link ModuleManager} and want to bypass the process-wide one. + * + *

The {@code path} doubles as the catalog identifier passed to resolvers — rule + * directories under {@code server-starter/src/main/resources/} (for instance + * {@code otel-rules}, {@code log-mal-rules}, {@code envoy-metrics-rules}) already align + * with the runtime-rule catalog namespace. + */ + public static List loadRules(final String path, List enabledRules, + final ModuleManager manager) throws IOException { + return loadInternal(path, enabledRules, manager, /* useInstalledManager= */ false); + } + + private static List loadInternal(final String path, List enabledRules, + final ModuleManager manager, + final boolean useInstalledManager) throws IOException { final Path root = ResourceUtils.getPath(path); @@ -70,26 +93,33 @@ public static List loadRules(final String path, List enabledRules) return rule; }) .collect(Collectors.toMap(rule -> rule, $ -> false)); - List rules; + + // Disk baseline: every file under `root` that matches the enabled-rules glob, keyed + // by relative path without extension (the rule name). + final Map diskBytes = new HashMap<>(); try (Stream stream = Files.walk(root)) { - rules = stream - .filter(it -> formedEnabledRules.keySet().stream() - .anyMatch(rule -> { - boolean matches = FileSystems.getDefault().getPathMatcher("glob:" + rule) - .matches(root.relativize(it)); - if (matches) { - formedEnabledRules.put(rule, true); - } - return matches; - })) - .map(pathPointer -> { - // Use relativized file path without suffix as the rule name. - String relativizePath = root.relativize(pathPointer).toString(); - String ruleName = relativizePath.substring(0, relativizePath.lastIndexOf(".")); - return getRulesFromFile(ruleName, pathPointer); - }) - .filter(Objects::nonNull) - .collect(Collectors.toList()) ; + stream.filter(p -> { + File f = p.toFile(); + if (!f.isFile() || f.isHidden()) { + return false; + } + return formedEnabledRules.keySet().stream().anyMatch(rule -> { + boolean matches = FileSystems.getDefault().getPathMatcher("glob:" + rule) + .matches(root.relativize(p)); + if (matches) { + formedEnabledRules.put(rule, true); + } + return matches; + }); + }).forEach(p -> { + final String rel = root.relativize(p).toString(); + final String ruleName = rel.substring(0, rel.lastIndexOf('.')); + try { + diskBytes.put(ruleName, Files.readAllBytes(p)); + } catch (IOException e) { + throw new UnexpectedException("Load rule file " + p.getFileName() + " failed", e); + } + }); } if (formedEnabledRules.containsValue(false)) { @@ -98,15 +128,22 @@ public static List loadRules(final String path, List enabledRules) .collect(Collectors.toList()); throw new UnexpectedException("Some configuration files of enabled rules are not found, enabled rules: " + rulesNotFound); } - return rules; + + // Merge with classpath-discovered resolvers (runtime-rule DB, plus any future + // priority-ranked source). Resolvers contributing INACTIVE drop their entries; + // ACTIVE substitutes content. Resolver-only rules (not on disk) are included. + final Map merged = useInstalledManager + ? RuleSetMerger.merge(path, diskBytes) + : RuleSetMerger.merge(path, diskBytes, manager); + + return merged.entrySet().stream() + .map(e -> parseRule(e.getKey(), e.getValue())) + .filter(java.util.Objects::nonNull) + .collect(Collectors.toList()); } - private static Rule getRulesFromFile(String ruleName, Path path) { - File file = path.toFile(); - if (!file.isFile() || file.isHidden()) { - return null; - } - try (Reader r = new FileReader(file)) { + private static Rule parseRule(final String ruleName, final byte[] bytes) { + try (Reader r = new InputStreamReader(new ByteArrayInputStream(bytes), StandardCharsets.UTF_8)) { Rule rule = new Yaml().loadAs(r, Rule.class); if (rule == null) { return null; @@ -114,7 +151,7 @@ private static Rule getRulesFromFile(String ruleName, Path path) { rule.setName(ruleName); return rule; } catch (IOException e) { - throw new UnexpectedException("Load rule file" + file.getName() + " failed", e); + throw new UnexpectedException("Load rule " + ruleName + " failed", e); } } } diff --git a/oap-server/exporter/pom.xml b/oap-server/exporter/pom.xml index 33392941e077..1b648a9d5c9d 100644 --- a/oap-server/exporter/pom.xml +++ b/oap-server/exporter/pom.xml @@ -68,11 +68,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-alarm-plugin/pom.xml b/oap-server/server-alarm-plugin/pom.xml index 5fd7206274c3..9f4494dd6de9 100644 --- a/oap-server/server-alarm-plugin/pom.xml +++ b/oap-server/server-alarm-plugin/pom.xml @@ -78,10 +78,10 @@ protobuf-java directly, you will be transitively depending on the protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmKernel.java b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmKernel.java new file mode 100644 index 000000000000..92bd39701b57 --- /dev/null +++ b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmKernel.java @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.alarm.provider; + +import java.util.List; +import java.util.Map; +import java.util.Set; +import lombok.RequiredArgsConstructor; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.alarm.AlarmKernelService; + +/** + * Default implementation of {@link AlarmKernelService}. Walks every {@link RunningRule} held + * by {@link AlarmRulesWatcher} and, for any rule whose MQE expression references one of the + * supplied metric names, invokes {@link RunningRule#resetWindows} to discard accumulated + * per-entity window state so firing state does not carry across a metric-semantics boundary. + * + *

Match criterion uses the authoritative {@code includeMetrics} set + * ({@link RunningRule#getIncludeMetrics}) computed by {@code AlarmMQEVerifyVisitor} at + * rule-load time directly from the parsed MQE tree. That set is the same filter the rule's + * {@code in()} already uses to accept / drop incoming samples, so matching against it here is + * symmetrical: we reset exactly the windows that would observe a semantics change. + * + *

Concurrency: {@code RunningRule.windows} is a ConcurrentHashMap and + * {@code resetWindows()} is a single atomic {@code clear()}. Sample evaluation that arrives + * mid-reset may briefly observe fewer entities; the worst case is one missed evaluation tick + * for one entity, which a real alarm recovers from within the next period. + */ +@Slf4j +@RequiredArgsConstructor +public class AlarmKernel implements AlarmKernelService { + + private final AlarmRulesWatcher rulesWatcher; + + @Override + public void reset(final Set affectedMetricNames) { + if (affectedMetricNames == null || affectedMetricNames.isEmpty()) { + return; + } + final Map> running = rulesWatcher.getRunningContext(); + if (running == null || running.isEmpty()) { + return; + } + int matched = 0; + for (final Map.Entry> entry : running.entrySet()) { + for (final RunningRule rule : entry.getValue()) { + final Set ruleMetrics = rule.getIncludeMetrics(); + if (ruleMetrics == null || ruleMetrics.isEmpty()) { + continue; + } + String matchedMetric = null; + for (final String metric : affectedMetricNames) { + if (ruleMetrics.contains(metric)) { + matchedMetric = metric; + break; + } + } + if (matchedMetric == null) { + continue; + } + rule.resetWindows(); + log.info("alarm-kernel reset: rule={} affected-metric={} (windows cleared)", + rule.getRuleName(), matchedMetric); + matched++; + } + } + if (matched > 0) { + log.info("alarm-kernel reset: {} rule(s) had their windows cleared for affected metrics {}", + matched, affectedMetricNames); + } + } +} diff --git a/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmModuleProvider.java b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmModuleProvider.java index 6d4f61002eb3..ed4513a916a9 100644 --- a/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmModuleProvider.java +++ b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/AlarmModuleProvider.java @@ -24,6 +24,7 @@ import org.apache.skywalking.oap.server.configuration.api.ConfigurationModule; import org.apache.skywalking.oap.server.configuration.api.DynamicConfigurationService; import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.alarm.AlarmKernelService; import org.apache.skywalking.oap.server.core.alarm.AlarmModule; import org.apache.skywalking.oap.server.core.alarm.AlarmRulesWatcherService; import org.apache.skywalking.oap.server.core.alarm.AlarmStandardPersistence; @@ -63,6 +64,8 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { this.registerServiceImplementation(MetricsNotify.class, notifyHandler); this.registerServiceImplementation(AlarmRulesWatcherService.class, alarmRulesWatcher); this.registerServiceImplementation(AlarmStatusWatcherService.class, new AlarmStatusWatcher(getManager())); + this.registerServiceImplementation(AlarmKernelService.class, + new AlarmKernel(alarmRulesWatcher)); } @Override diff --git a/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java index a0d356336283..0bc5e77e9a46 100644 --- a/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java +++ b/oap-server/server-alarm-plugin/src/main/java/org/apache/skywalking/oap/server/core/alarm/provider/RunningRule.java @@ -198,6 +198,26 @@ private boolean validate(String target, List includeList, List e return true; } + /** + * Discard every per-entity {@link Window} this rule owns. Used by runtime rule hot-update + * ({@link org.apache.skywalking.oap.server.core.alarm.AlarmKernelService}) when a metric's + * semantics move so alarm state does not carry across the boundary — the next arriving + * sample re-creates a fresh Window via {@link java.util.concurrent.ConcurrentHashMap#computeIfAbsent} + * in {@link #in}. + * + *

Each Window is reset under its own {@code Window.lock} so any in-flight + * {@link Window#add} / {@link Window#checkAlarm} finishes before we wipe its values and + * state-machine counters. The subsequent {@code windows.clear()} drops the map entries; + * any Window that {@code in()} allocated concurrently during the iteration is itself + * fresh (post-{@code init()}) and is removed cleanly by the clear. Next arriving sample + * for any entity allocates a new Window with state-machine at {@code NORMAL} and no + * carried-over values. + */ + public void resetWindows() { + windows.values().forEach(Window::reset); + windows.clear(); + } + /** * Move the buffer window to give time. * @@ -471,6 +491,25 @@ public void scanWindowValues(Consumer>> scanFunc } } + /** + * Atomic per-window reset invoked by {@link RunningRule#resetWindows()}. Holds the + * same {@code lock} that {@link #add}, {@link #moveTo}, and {@link #isMatch} use, so + * any in-flight alarm evaluation for this window finishes before the values and + * state-machine are wiped. + */ + public void reset() { + lock.lock(); + try { + init(); + endTime = null; + lastAlarmMessage = null; + mqeMetricsSnapshot = null; + stateMachine.reset(); + } finally { + lock.unlock(); + } + } + private void init() { values = new LinkedList<>(); for (int i = 0; i < size; i++) { @@ -584,6 +623,17 @@ private void resetCountdowns() { this.recoveryObservationCountdown = this.recoveryObservationPeriod; } + /** + * Reset to freshly-constructed state. Invoked by {@link Window#reset()} during + * runtime-rule hot-update so alarm firings built up against the old metric + * semantics don't carry across the boundary. + */ + public void reset() { + this.currentState = State.NORMAL; + this.silenceCountdown = -1; + this.recoveryObservationCountdown = recoveryObservationPeriod; + } + } } diff --git a/oap-server/server-configuration/grpc-configuration-sync/pom.xml b/oap-server/server-configuration/grpc-configuration-sync/pom.xml index 74f726b4c266..dcae0e083f1e 100644 --- a/oap-server/server-configuration/grpc-configuration-sync/pom.xml +++ b/oap-server/server-configuration/grpc-configuration-sync/pom.xml @@ -95,11 +95,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-core/pom.xml b/oap-server/server-core/pom.xml index 484507e63bd0..98a0b6d4032b 100644 --- a/oap-server/server-core/pom.xml +++ b/oap-server/server-core/pom.xml @@ -148,11 +148,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModule.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModule.java index 416c70e0ad03..8a18a9375e81 100755 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModule.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModule.java @@ -69,7 +69,7 @@ import org.apache.skywalking.oap.server.core.status.ServerStatusService; import org.apache.skywalking.oap.server.core.storage.model.IModelManager; import org.apache.skywalking.oap.server.core.trace.SpanListenerManager; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; import org.apache.skywalking.oap.server.core.storage.model.ModelManipulator; import org.apache.skywalking.oap.server.core.worker.IWorkerInstanceGetter; import org.apache.skywalking.oap.server.core.worker.IWorkerInstanceSetter; @@ -178,7 +178,7 @@ private void addServerInterface(List classes) { } private void addInternalServices(List classes) { - classes.add(ModelCreator.class); + classes.add(ModelRegistry.class); classes.add(IModelManager.class); classes.add(ModelManipulator.class); classes.add(RemoteClientManager.class); diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModuleProvider.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModuleProvider.java index 801620258931..273b4bfbb6b1 100755 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModuleProvider.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/CoreModuleProvider.java @@ -23,6 +23,7 @@ import org.apache.skywalking.oap.server.configuration.api.ConfigurationModule; import org.apache.skywalking.oap.server.configuration.api.DynamicConfigurationService; import org.apache.skywalking.oap.server.core.analysis.ApdexThresholdConfig; +import org.apache.skywalking.oap.server.core.rule.ext.RuleSetMerger; import org.apache.skywalking.oap.server.core.analysis.DisableRegister; import org.apache.skywalking.oap.server.core.analysis.StreamAnnotationListener; import org.apache.skywalking.oap.server.core.analysis.meter.MeterEntity; @@ -102,7 +103,7 @@ import org.apache.skywalking.oap.server.core.storage.PersistenceTimer; import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.IModelManager; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; import org.apache.skywalking.oap.server.core.storage.model.ModelManipulator; import org.apache.skywalking.oap.server.core.storage.model.StorageModels; import org.apache.skywalking.oap.server.core.storage.ttl.DataTTLKeeperTimer; @@ -296,7 +297,7 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { this.registerServiceImplementation(IWorkerInstanceSetter.class, instancesService); this.registerServiceImplementation(RemoteSenderService.class, new RemoteSenderService(getManager())); - this.registerServiceImplementation(ModelCreator.class, storageModels); + this.registerServiceImplementation(ModelRegistry.class, storageModels); this.registerServiceImplementation(IModelManager.class, storageModels); this.registerServiceImplementation(ModelManipulator.class, storageModels); @@ -429,6 +430,14 @@ public void start() throws ModuleStartException { throw new ModuleStartException(e.getMessage(), e); } + // Install the process-wide ModuleManager for RuleSetMerger so MAL/LAL static-rule + // loaders pick up classpath-discovered RuntimeRuleOverrideResolvers (notably the + // runtime-rule DB resolver) without having to thread the manager through every + // signature. Done after the management streams are registered (annotationScan above + // creates the runtime_rule table on the backend) and before analyzers start loading + // static MAL/LAL files. + RuleSetMerger.installManager(getManager()); + Address gRPCServerInstanceAddress = new Address(moduleConfig.getGRPCHost(), moduleConfig.getGRPCPort(), true); TelemetryRelatedContext.INSTANCE.setId(gRPCServerInstanceAddress.toString()); ClusterCoordinator coordinator = this.getManager() diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmKernelService.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmKernelService.java new file mode 100644 index 000000000000..7f81ee4505db --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmKernelService.java @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.alarm; + +import java.util.Set; +import org.apache.skywalking.oap.server.library.module.Service; + +/** + * Kernel operations on the alarm subsystem for cross-module callers. Named broadly so future + * alarm-kernel operations (force-fire, pause-rule, inspect-state, etc.) can extend the same + * interface without adding a new module contract for each. + * + *

First method: {@link #reset(Set)} — invoked by the runtime-rule hot-update pipeline at + * the tail of a successful structural apply. When a metric name's semantics move (function + * change, scope change, metric added or removed), any alarm rule whose expression references + * that metric holds window values that are no longer semantically comparable to new samples; + * a reset zeroes the window and state-machine so firing state doesn't carry across the + * boundary. + */ +public interface AlarmKernelService extends Service { + + /** + * Reset the evaluation window of every running alarm rule that references any of the + * supplied metric names. Specifically: clear accumulated window values, reset the rule's + * state-machine to OK, zero silence and recovery countdowns, and reset {@code endTime}. + * + *

Best-effort: a failure to reset a single rule logs a warn and continues — the alarm + * subsystem self-heals within one evaluation period anyway; the reset is a quality-of-life + * nudge to avoid false firings across the metric-semantics boundary. + * + *

No-op when {@code affectedMetricNames} is null or empty. Safe to call from any + * thread; the implementation is expected to serialize per-rule resets with concurrent + * sample evaluation on that rule so observers never see a torn state. + * + * @param affectedMetricNames metric names whose semantics just moved. Typically derived + * from the runtime-rule apply pipeline's union of added / + * removed / shape-changed metric sets. + */ + void reset(Set affectedMetricNames); +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmModule.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmModule.java index 8f963a2fd8f8..b9f91ff4e187 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmModule.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/alarm/AlarmModule.java @@ -34,6 +34,11 @@ public AlarmModule() { @Override public Class[] services() { - return new Class[] {MetricsNotify.class, AlarmRulesWatcherService.class, AlarmStatusWatcherService.class}; + return new Class[] { + MetricsNotify.class, + AlarmRulesWatcherService.class, + AlarmStatusWatcherService.class, + AlarmKernelService.class, + }; } } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java index 3e0344091c75..90c225adf530 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java @@ -26,6 +26,7 @@ import java.util.HashMap; import java.util.Map; import java.util.Objects; +import java.util.Set; import javassist.CannotCompileException; import javassist.ClassPool; import javassist.CtClass; @@ -39,6 +40,7 @@ import org.apache.commons.lang3.JavaVersion; import org.apache.commons.lang3.SystemUtils; import org.apache.skywalking.oap.server.core.UnexpectedException; +import org.apache.skywalking.oap.server.core.classloader.BytecodeClassDefiner; import org.apache.skywalking.oap.server.core.analysis.StreamDefinition; import org.apache.skywalking.oap.server.core.analysis.TimeBucket; import org.apache.skywalking.oap.server.core.analysis.meter.dynamic.MeterClassPackageHolder; @@ -47,6 +49,9 @@ import org.apache.skywalking.oap.server.core.analysis.metrics.Metrics; import org.apache.skywalking.oap.server.core.analysis.worker.MetricsStreamProcessor; import org.apache.skywalking.oap.server.core.storage.StorageException; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.core.CoreModule; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.library.module.Service; @@ -144,7 +149,440 @@ public synchronized void create(String metricsName, String functionName, ScopeType type, Class dataType) throws IllegalArgumentException { - /** + // Static boot path: create-if-absent semantics so a backend that already holds this + // metric under a different shape is preserved and reported, not silently reshaped. + createInternal(metricsName, functionName, type, dataType, classPool, MeterClassPackageHolder.class, + StorageManipulationOpt.createIfAbsent()); + } + + /** + * Runtime-rule overload at the 3-arg entry point: resolves {@code dataType} reflectively + * from the function's {@link AcceptableValue} parameterization (same derivation as the + * no-pool 3-arg overload at {@link #create(String, String, ScopeType)}) and threads the + * caller-supplied per-file {@code ClassPool} + {@code ClassLoader} through so the + * generated Metrics subclass is defined directly IN the runtime-rule loader (not in the + * neighbor's loader). Lets the runtime-rule bundle drop every class it created together. + */ + /** + * Runtime-rule entry point: create a streaming calculation under a caller-supplied + * per-file {@code ClassPool} + {@code ClassLoader}, with a caller-specified + * {@link StorageManipulationOpt} policy. Main-node apply passes + * {@link StorageManipulationOpt#fullInstall()} (the usual install path); peer-node apply + * passes {@link StorageManipulationOpt#localCacheOnly()} so local state is populated + * (MeterSystem meterPrototypes, BanyanDB MetadataRegistry, StorageModels entry) without + * firing server-side {@code createMeasure} / {@code update}. + */ + public synchronized void create(String metricsName, + String functionName, + ScopeType type, + ClassPool pool, + ClassLoader targetClassLoader, + StorageManipulationOpt opt) throws IllegalArgumentException { + final Class meterFunction = functionRegister.get(functionName); + if (meterFunction == null) { + throw new IllegalArgumentException("Function " + functionName + " can't be found."); + } + Type acceptance = null; + for (final Type genericInterface : meterFunction.getGenericInterfaces()) { + if (genericInterface instanceof ParameterizedType) { + ParameterizedType parameterizedType = (ParameterizedType) genericInterface; + if (parameterizedType.getRawType().getTypeName().equals(AcceptableValue.class.getName())) { + Type[] arguments = parameterizedType.getActualTypeArguments(); + acceptance = arguments[0]; + break; + } + } + } + try { + createInternal(metricsName, functionName, type, + Class.forName(Objects.requireNonNull(acceptance).getTypeName()), + pool, targetClassLoader, opt); + } catch (ClassNotFoundException e) { + throw new IllegalArgumentException(e); + } + } + + private void createInternal(final String metricsName, + final String functionName, + final ScopeType type, + final Class dataType, + final ClassPool pool, + final ClassLoader targetClassLoader, + final StorageManipulationOpt opt) throws IllegalArgumentException { + final Class meterFunction = functionRegister.get(functionName); + if (meterFunction == null) { + throw new IllegalArgumentException("Function " + functionName + " can't be found."); + } + boolean foundDataType = false; + String acceptance = null; + for (final Type genericInterface : meterFunction.getGenericInterfaces()) { + if (genericInterface instanceof ParameterizedType) { + ParameterizedType parameterizedType = (ParameterizedType) genericInterface; + if (parameterizedType.getRawType().getTypeName().equals(AcceptableValue.class.getName())) { + Type[] arguments = parameterizedType.getActualTypeArguments(); + if (arguments[0].equals(dataType)) { + foundDataType = true; + } else { + acceptance = arguments[0].getTypeName(); + } + } + if (foundDataType) { + break; + } + } + } + if (!foundDataType) { + throw new IllegalArgumentException("Function " + functionName + + " requires <" + acceptance + "> in AcceptableValue" + + " but using " + dataType.getName() + " in the creation"); + } + final CtClass parentClass; + try { + parentClass = pool.get(meterFunction.getCanonicalName()); + if (!Metrics.class.isAssignableFrom(meterFunction)) { + throw new IllegalArgumentException( + "Function " + functionName + " doesn't inherit from Metrics."); + } + } catch (NotFoundException e) { + throw new IllegalArgumentException("Function " + functionName + " can't be found by javaassist."); + } + final String className = formatName(metricsName); + // Prototype-first short-circuit (fires on runtime FILTER_ONLY re-apply). Every + // runtime apply hands in a fresh {@code ClassPool}, so the pool-based existence + // check below cannot see a Metrics class the previous apply defined in a now-dead + // pool. Without this guard, every FILTER_ONLY update generated a new Metrics class, + // new MetricsStreamProcessor workers, and a new prototype that shadowed the old + // one in {@link #meterPrototypes} — a removeMetric by name could only tear down the + // latest generation, leaving prior workers + classloaders pinned forever. Match on + // scope + data type + function class; any of those differing is a genuine shape + // change and the existing IllegalArgumentException on the pool path fires below. + final MeterDefinition existingDefinition = meterPrototypes.get(metricsName); + if (existingDefinition != null + && existingDefinition.getScopeType() == type + && existingDefinition.getDataType().equals(dataType) + && existingDefinition.getMeterPrototype().getClass().getSuperclass() == meterFunction) { + log.debug("Metric {} already registered with matching shape; reusing existing " + + "Metrics class + workers (FILTER_ONLY re-apply path).", metricsName); + return; + } + try { + CtClass existingMetric = pool.get(METER_CLASS_PACKAGE + className); + if (existingMetric.getSuperclass() != parentClass + || type != meterPrototypes.get(metricsName).getScopeType()) { + throw new IllegalArgumentException( + metricsName + " has been defined, but calculate function or/are scope type is/are different."); + } + log.info("Metric {} is already defined, so skip the metric creation.", metricsName); + return; + } catch (NotFoundException ignored) { + // proceed — class not yet defined in this pool + } + CtClass metricsClass = pool.makeClass(METER_CLASS_PACKAGE + className, parentClass); + try { + metricsClass.addConstructor(CtNewConstructor.make("public " + className + "() {}", metricsClass)); + metricsClass.addMethod(CtNewMethod.make( + "public org.apache.skywalking.oap.server.core.analysis.meter.function.AcceptableValue createNew() {" + + " org.apache.skywalking.oap.server.core.analysis.meter.function.AcceptableValue meterVar = new " + METER_CLASS_PACKAGE + className + "();" + + " ((org.apache.skywalking.oap.server.core.analysis.meter.Meter)meterVar).initMeta(\"" + metricsName + "\", " + type.getScopeId() + ");" + + " return meterVar;" + + "}", + metricsClass)); + } catch (CannotCompileException e) { + throw new UnexpectedException(e.getMessage(), e); + } + final Class targetClass; + try { + // Explicit targetClassLoader — the generated class goes directly into the + // per-file RuleClassLoader, not the neighbor class's loader. Two paths: + // + // - {@link BytecodeClassDefiner} loaders (the runtime-rule {@code + // RuleClassLoader} is the only known implementor today): hand the loader + // the raw bytecode via its public {@code defineClass(String, byte[])}. + // This sidesteps Javassist's deprecated 2-arg {@code toClass(loader, + // ProtectionDomain)} which reflects into {@code java.lang.ClassLoader. + // defineClass} and requires {@code --add-opens java.base/java.lang} on + // JDK 17+ — a JVM-flag tax we don't want to put on every operator. + // + // - Other loaders (legacy callers): keep the 2-arg toClass for back-compat. + // No new constraints on existing static rule paths; they don't use this + // overload anyway, so this branch is effectively dead today and exists as + // a safety net. + if (targetClassLoader instanceof BytecodeClassDefiner) { + targetClass = ((BytecodeClassDefiner) targetClassLoader) + .defineClass(METER_CLASS_PACKAGE + className, metricsClass.toBytecode()); + } else { + targetClass = metricsClass.toClass(targetClassLoader, null); + } + AcceptableValue prototype = (AcceptableValue) targetClass.newInstance(); + meterPrototypes.put(metricsName, new MeterDefinition(type, prototype, dataType, true)); + MetricsStreamProcessor.getInstance().create( + manager, + new StreamDefinition( + metricsName, type.getScopeId(), prototype.builder(), MetricsStreamProcessor.class), + targetClass, + opt); + // Roll back the prototype if the installer refused to reshape the backend + // (SKIPPED_SHAPE_MISMATCH recorded on opt). Leaving the prototype in + // meterPrototypes would mean dispatch lookups succeed for a metric whose + // storage workers MetricsStreamProcessor refused to register — samples would + // fail silently later rather than at registration time. Boot continues with + // this metric inactive; operator reshapes explicitly via the runtime-rule + // on-demand endpoint. + if (opt.hasShapeMismatch()) { + meterPrototypes.remove(metricsName); + } + } catch (CannotCompileException | IllegalAccessException | InstantiationException + | StorageException | IOException e) { + // Also roll back on exception paths — an unsuccessful create must not leave a + // prototype stranded in the registry. {@link IOException} surfaces from + // {@code metricsClass.toBytecode()} when serialising the generated bytes; treat + // it the same as a Javassist compile failure so the apply rolls back cleanly. + meterPrototypes.remove(metricsName); + throw new UnexpectedException(e.getMessage(), e); + } + } + + /** + * Runtime-rule overload: create a streaming calculation whose dynamically-generated + * {@code Metrics} subclass is made in the caller-supplied {@code ClassPool} and loaded through + * the caller-supplied {@code classLoaderNeighbor} (a class already loaded by the target + * per-file {@code RuleClassLoader}). + * + *

Used by MAL/LAL hot-update so all of a rule file's generated classes — the + * {@code MalExpression} / {@code LalExpression} + closure companions produced by the DSL + * generators AND the {@code Metrics} subclass produced here — share one classloader and can + * all be dropped together for GC on hot-remove. Without this overload, the Metrics class + * would remain pinned in the default pool and the default loader, blocking shape-breaking + * re-registration and leaking classes across churn. + * + *

Startup path is unchanged — the existing overloads continue to use the instance-field + * default pool and {@link MeterClassPackageHolder} as the loader neighbor. + * + * @param metricsName storage entity name + * @param functionName function provided through {@link MeterFunction} + * @param type scope type + * @param dataType accepted value data type + * @param pool per-file Javassist pool, typically constructed as + * {@code new ClassPool(ClassPool.getDefault())} with + * {@code LoaderClassPath(ruleLoader)} appended + * @param classLoaderNeighbor a class loaded by the per-file {@code RuleClassLoader}; used + * by Javassist's {@code toClass(Class)} on Java 9+ to resolve + * the target loader. On Java 8, its classloader is passed to + * the legacy {@code toClass(ClassLoader, ProtectionDomain)} + */ + public synchronized void create(String metricsName, + String functionName, + ScopeType type, + Class dataType, + ClassPool pool, + Class classLoaderNeighbor) throws IllegalArgumentException { + if (pool == null) { + throw new IllegalArgumentException("pool must not be null"); + } + if (classLoaderNeighbor == null) { + throw new IllegalArgumentException("classLoaderNeighbor must not be null"); + } + createInternal(metricsName, functionName, type, dataType, pool, classLoaderNeighbor, + StorageManipulationOpt.fullInstall()); + } + + /** + * Remove a previously-registered metric by name. Symmetric to {@link #create(String, String, + * ScopeType, Class)} / the pool-aware overload. Used by runtime rule hot-remove (MAL/LAL) + * to retire a metric class cleanly. + * + *

Steps: + *

    + *
  1. Drops the {@link #meterPrototypes} entry so {@link #buildMetrics(String, Class)} + * rejects further builds for this name.
  2. + *
  3. Delegates to {@link MetricsStreamProcessor#removeMetric} — L1/L2 drain, worker + * deregistration, shared-queue handler removal.
  4. + *
  5. Cascades through {@link ModelRegistry#remove(Class, StorageManipulationOpt)} to drop every downsampling + * variant's {@code Model} from the registry; listener {@code whenRemoving} fires for + * each (BanyanDB drops the measure, JDBC/ES no-op).
  6. + *
  7. Detaches the {@link CtClass} from the default {@link #classPool} so a later + * shape-breaking re-create (e.g. {@code sum}→{@code histogram}) passes the pre-check at + * {@link #create(String, String, ScopeType, Class)} rather than failing with "already + * defined... calculate function or/are scope type is/are different". For runtime-path + * metrics the CtClass lives in a per-file pool owned by the bundle; dropping the + * bundle's pool reference collects the CtClass automatically, so the default-pool + * detach is a best-effort no-op in that case ({@link NotFoundException} is expected).
  8. + *
+ * + *

Not safe to call concurrently with {@link #create(String, String, ScopeType, Class)} or + * another {@link #removeMetric} — the MeterSystem monitor serializes them. Callers must hold + * no other lock that could invert with the MeterSystem monitor. The runtime-rule + * module's per-file lock is always acquired before this monitor, never after. + * + * @param metricsName the metric name to retire + * @return {@code true} if a metric was found and removed, {@code false} otherwise + */ + public synchronized boolean removeMetric(final String metricsName) { + return removeMetric(metricsName, StorageManipulationOpt.fullInstall()); + } + + /** + * Opt-aware {@code removeMetric}. Runtime-rule peer-side callers pass + * {@link StorageManipulationOpt#localCacheOnly()} so {@code ModelInstaller.dropTable} is + * NOT invoked on the shared storage — the cluster main owns that side-effect. + * + *

Order is backend-first / local-state-second so failure is retriable. The earlier + * version evicted {@code meterPrototypes} and called the cascade in parallel; if the + * cascade threw (BanyanDB {@code dropTable} failure), the local state was already torn + * down — there was nothing for the next {@code /inactivate} or reconciler tick to drive + * a backend retry against, and the operator could not recover the orphaned measure + * without an OAP restart. Now the cascade runs first; on success we drop the local + * caches; on failure we leave {@code meterPrototypes} populated and the CtClass attached + * so a retry hits the backend again. + * + *

Failure surface: under {@code fullInstall} the storage-model cascade failure is + * propagated as a {@link RuntimeException}. The REST {@code /inactivate} path depends on + * this to surface 500 {@code teardown_deferred} when BanyanDB's delete-measure threw — + * without it the handler would return 200 inactivated despite the measure still being + * live. Under {@code localCacheOnly} the cascade fires {@code whenRemoving} but the + * peer's {@code ModelInstaller.dropTable} is suppressed by policy, so any throw is + * logged and swallowed — the peer has no backend debt. Streaming-chain drain failures + * are always logged and swallowed: stale workers self-drain within one tick. + */ + public synchronized boolean removeMetric(final String metricsName, final StorageManipulationOpt opt) { + final MeterDefinition def = meterPrototypes.get(metricsName); + if (def == null) { + return false; + } + final Class prototypeClass = def.getMeterPrototype().getClass(); + + // Cascade storage-model removal (Hour / Day / Minute) FIRST. ModelRegistry.remove + // fires whenRemoving on every listener, so each backend's ModelInstaller.dropTable + // runs — real delete for BanyanDB, no-op for JDBC / Elasticsearch, skipped outright + // when the caller is a peer-side (LOCAL_CACHE_ONLY) apply. If a listener throws, + // ModelRegistry.remove keeps the model in its registry so this retry path stays + // open: the caller (Reconciler unregisterBundle) preserves appliedMal[key] and the + // next tick (or operator retry) re-enters this method, finds meterPrototypes still + // populated, and re-fires the cascade. + try { + final ModelRegistry modelCreator = manager.find(CoreModule.NAME) + .provider() + .getService(ModelRegistry.class); + modelCreator.remove(prototypeClass, opt); + } catch (final Throwable t) { + log.error("Failed to cascade storage-model removal for metric {}", metricsName, t); + if (opt.getFlags().isEscalateToCaller()) { + throw new RuntimeException( + "Storage-model cascade failed for metric " + metricsName + + "; backend drop did not complete. Local state preserved for retry.", + t); + } + // Non-escalating opt (peer-side localCacheOnly, etc.) — backend drop is + // suppressed by policy anyway, so a listener throw here is the listener's + // local bookkeeping misbehaving, not real backend debt. Fall through to + // clear local state. + } + + // Backend cascade succeeded (or local-cache-only, where it doesn't matter). Drop + // the prototype and drain the workers. Worker drain failure is non-fatal and + // logged: stale workers self-drain within one tick, and the prototype + Model are + // already gone so future samples can't reach them. + meterPrototypes.remove(metricsName); + try { + MetricsStreamProcessor.getInstance().removeMetric(manager, (Class) prototypeClass); + } catch (final Throwable t) { + log.error("Failed to remove streaming chain for metric {}; prototype + storage " + + "model already gone.", metricsName, t); + } + + // Detach the CtClass from the default pool so a future shape-breaking re-create passes + // the pre-check at the head of createInternal. Static-path metrics own a CtClass in + // the instance default pool and must be detached explicitly; runtime-path metrics + // live in a per-file pool that goes away with the bundle, so the detach here is + // unnecessary. The flag is authoritative — do not reach for the default pool and + // swallow NotFoundException as a substitute. + if (!def.isRuntimeManaged()) { + try { + final CtClass staleCtClass = classPool.get(METER_CLASS_PACKAGE + formatName(metricsName)); + staleCtClass.detach(); + } catch (final NotFoundException e) { + log.warn("removeMetric({}): static-path metric was expected in default pool but " + + "was not present; shape-break re-registration may fail the pre-check. " + + "This indicates the metric was registered through an unexpected path.", + metricsName); + } + } + return true; + } + + /** + * Reversible pause of streaming dispatch for a set of metric names. Used by the + * runtime-rule Suspend phase: the receiving OAP node instructs peers (and itself, for + * local consistency) to stop serving a bundle while the main node applies the structural + * DDL + verify. Peers resume via {@link #resumeDispatch} once the row is upserted with + * the new content. + * + *

Delegates to {@link MetricsStreamProcessor#suspendDispatch(Class)} per metric. The + * measure, persistent workers, and storage-model registration stay live — only the entry + * dispatch is parked. Idempotent: names not registered or already suspended are skipped. + * + * @return count of metrics that actually transitioned into the suspended state. + */ + public synchronized int suspendDispatch(final Set metricsNames) { + if (metricsNames == null || metricsNames.isEmpty()) { + return 0; + } + int suspended = 0; + final MetricsStreamProcessor processor = MetricsStreamProcessor.getInstance(); + for (final String name : metricsNames) { + final MeterDefinition def = meterPrototypes.get(name); + if (def == null) { + continue; + } + final Class prototypeClass = def.getMeterPrototype().getClass(); + try { + if (processor.suspendDispatch((Class) prototypeClass)) { + suspended++; + } + } catch (final Throwable t) { + log.warn("suspendDispatch failed for metric {}; continuing with the rest.", name, t); + } + } + return suspended; + } + + /** + * Inverse of {@link #suspendDispatch}: re-installs the parked entry workers so samples + * dispatch again. Idempotent; names not currently parked are skipped. + * + * @return count of metrics that actually transitioned back to live dispatch. + */ + public synchronized int resumeDispatch(final Set metricsNames) { + if (metricsNames == null || metricsNames.isEmpty()) { + return 0; + } + int resumed = 0; + final MetricsStreamProcessor processor = MetricsStreamProcessor.getInstance(); + for (final String name : metricsNames) { + final MeterDefinition def = meterPrototypes.get(name); + if (def == null) { + continue; + } + final Class prototypeClass = def.getMeterPrototype().getClass(); + try { + if (processor.resumeDispatch((Class) prototypeClass)) { + resumed++; + } + } catch (final Throwable t) { + log.warn("resumeDispatch failed for metric {}; continuing with the rest.", name, t); + } + } + return resumed; + } + + private void createInternal(final String metricsName, + final String functionName, + final ScopeType type, + final Class dataType, + final ClassPool pool, + final Class classLoaderNeighbor, + final StorageManipulationOpt opt) throws IllegalArgumentException { + /* * Create a new meter class dynamically. */ final Class meterFunction = functionRegister.get(functionName); @@ -179,7 +617,7 @@ public synchronized void create(String metricsName, final CtClass parentClass; try { - parentClass = classPool.get(meterFunction.getCanonicalName()); + parentClass = pool.get(meterFunction.getCanonicalName()); if (!Metrics.class.isAssignableFrom(meterFunction)) { throw new IllegalArgumentException( "Function " + functionName + " doesn't inherit from Metrics."); @@ -189,11 +627,28 @@ public synchronized void create(String metricsName, } final String className = formatName(metricsName); - /** + /* + * Prototype-first short-circuit for runtime FILTER_ONLY re-apply — see the same + * guard in the default-pool {@code createInternal} above for the full rationale. The + * pool-based check below can't detect an existing registration because every + * runtime apply hands in a fresh pool; without this, each FILTER_ONLY iteration + * leaks a new Metrics class + a new worker chain + a new classloader. + */ + final MeterDefinition existingDefinition = meterPrototypes.get(metricsName); + if (existingDefinition != null + && existingDefinition.getScopeType() == type + && existingDefinition.getDataType().equals(dataType) + && existingDefinition.getMeterPrototype().getClass().getSuperclass() == meterFunction) { + log.debug("Metric {} already registered with matching shape; reusing existing " + + "Metrics class + workers (FILTER_ONLY re-apply path).", metricsName); + return; + } + + /* * Check whether the metrics class is already defined or not */ try { - CtClass existingMetric = classPool.get(METER_CLASS_PACKAGE + className); + CtClass existingMetric = pool.get(METER_CLASS_PACKAGE + className); if (existingMetric.getSuperclass() != parentClass || type != meterPrototypes.get(metricsName) .getScopeType()) { throw new IllegalArgumentException( @@ -202,11 +657,12 @@ public synchronized void create(String metricsName, log.info("Metric {} is already defined, so skip the metric creation.", metricsName); return; } catch (NotFoundException e) { + // proceed — class not yet defined in this pool } - CtClass metricsClass = classPool.makeClass(METER_CLASS_PACKAGE + className, parentClass); + CtClass metricsClass = pool.makeClass(METER_CLASS_PACKAGE + className, parentClass); - /** + /* * Create empty construct */ try { @@ -218,7 +674,7 @@ public synchronized void create(String metricsName, throw new UnexpectedException(e.getMessage(), e); } - /** + /* * Generate `AcceptableValue createNew()` method. */ try { @@ -238,12 +694,12 @@ public synchronized void create(String metricsName, Class targetClass; try { if (SystemUtils.isJavaVersionAtMost(JavaVersion.JAVA_1_8)) { - targetClass = metricsClass.toClass(MeterSystem.class.getClassLoader(), null); + targetClass = metricsClass.toClass(classLoaderNeighbor.getClassLoader(), null); } else { - targetClass = metricsClass.toClass(MeterClassPackageHolder.class); + targetClass = metricsClass.toClass(classLoaderNeighbor); } AcceptableValue prototype = (AcceptableValue) targetClass.newInstance(); - meterPrototypes.put(metricsName, new MeterDefinition(type, prototype, dataType)); + meterPrototypes.put(metricsName, new MeterDefinition(type, prototype, dataType, false)); log.debug("Generate metrics class, " + metricsClass.getName()); @@ -251,10 +707,20 @@ public synchronized void create(String metricsName, manager, new StreamDefinition( metricsName, type.getScopeId(), prototype.builder(), MetricsStreamProcessor.class), - targetClass + targetClass, + opt ); + // Same shape-mismatch guard as the static-catalog createInternal path. Under + // full-install mode the opt shouldn't ever carry a SKIPPED_SHAPE_MISMATCH + // outcome (the installer reshapes), but defensively roll the prototype back if + // it does — a dispatch lookup against a prototype whose workers never came up + // produces confusing silent-drop failures later. + if (opt.hasShapeMismatch()) { + meterPrototypes.remove(metricsName); + } } catch (CannotCompileException | IllegalAccessException | InstantiationException | StorageException e) { log.error("Can't compile/load/init " + className + ".", e); + meterPrototypes.remove(metricsName); throw new UnexpectedException(e.getMessage(), e); } } @@ -308,5 +774,15 @@ private static class MeterDefinition { private final ScopeType scopeType; private final AcceptableValue meterPrototype; private final Class dataType; + /** + * {@code true} when the generated {@code Metrics} class lives in a caller-supplied + * per-file {@code ClassPool}/{@code ClassLoader} (runtime-rule hot-update path — + * {@link #create(String, String, ScopeType, Class, ClassPool, Class, StorageManipulationOpt)}). + * {@code false} when the class lives in the instance {@link #classPool} (static boot + * path). Read by {@link #removeMetric} to decide whether a default-pool {@code CtClass} + * detach is required (static path) or a no-op (runtime path — the per-file pool + * goes away with the bundle, taking the CtClass with it). + */ + private final boolean runtimeManaged; } } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java index d13a228ec49b..992466530e7a 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java @@ -31,7 +31,8 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.annotation.Storage; import org.apache.skywalking.oap.server.core.storage.model.Model; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; import org.apache.skywalking.oap.server.library.module.ModuleDefineHolder; @@ -76,9 +77,11 @@ public void create(final ModuleDefineHolder moduleDefineHolder, final Stream str .getSimpleName() + " none stream record DAO failure.", e); } - ModelCreator modelSetter = moduleDefineHolder.find(CoreModule.NAME).provider().getService(ModelCreator.class); + ModelRegistry modelSetter = moduleDefineHolder.find(CoreModule.NAME).provider().getService(ModelRegistry.class); // Management stream doesn't read data from database during the persistent process. Keep the timeRelativeID == false always. - Model model = modelSetter.add(streamClass, stream.scopeId(), new Storage(stream.name(), false, DownSampling.None)); + Model model = modelSetter.add(streamClass, stream.scopeId(), + new Storage(stream.name(), false, DownSampling.None), + StorageManipulationOpt.createIfAbsent()); final ManagementPersistentWorker persistentWorker = new ManagementPersistentWorker(moduleDefineHolder, model, managementDAO); workers.put(streamClass, persistentWorker); diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsAggregateWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsAggregateWorker.java index 4af51a9cd187..f95cbe45424e 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsAggregateWorker.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsAggregateWorker.java @@ -21,6 +21,7 @@ import java.util.LinkedHashMap; import java.util.List; import java.util.Map; +import lombok.Getter; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.server.core.analysis.data.MergableBufferedData; import org.apache.skywalking.oap.server.core.analysis.meter.Meter; @@ -76,6 +77,11 @@ public class MetricsAggregateWorker extends AbstractWorker { private final MergableBufferedData mergeDataCache; private final CounterMetrics abandonCounter; private final CounterMetrics aggregationCounter; + private final Class metricsClass; + /** Stream/model name — exposed to {@link MetricsStreamProcessor#removeMetric} so it can + * match this aggregate worker to its down-sampling persistent workers. */ + @Getter + private final String modelName; private long lastSendTime = 0; MetricsAggregateWorker(final ModuleDefineHolder moduleDefineHolder, @@ -87,6 +93,8 @@ public class MetricsAggregateWorker extends AbstractWorker { this.nextWorker = nextWorker; this.mergeDataCache = new MergableBufferedData<>(); this.l1FlushPeriod = l1FlushPeriod; + this.metricsClass = metricsClass; + this.modelName = modelName; this.l1Queue = BatchQueueManager.getOrCreate(L1_QUEUE_NAME, L1_QUEUE_CONFIG); final MetricsCreator metricsCreator = moduleDefineHolder.find(TelemetryModule.NAME) @@ -175,6 +183,27 @@ private void flush() { } } + /** + * Drain and deregister this worker's L1 handler for hot-remove (MAL/LAL runtime rule + * removal). Flushes any in-flight merged metrics from {@link #mergeDataCache} to the next + * stage unconditionally (bypassing the {@link #l1FlushPeriod} guard), then unregisters this + * metric class from the shared L1 queue. + * + *

After this call, any samples still buffered in the L1 queue partitions for this class + * will hit the null-handler path and be dropped (logged once by {@code BatchQueue}). Callers + * must have already removed the route from {@code MetricsStreamProcessor.entryWorkers} so no + * new samples arrive here during the window. + * + *

Not safe to call concurrently with other {@code addHandler}/{@code removeHandler} on the + * same L1 queue — the runtime-rule module serializes via its per-file lock. + */ + public void drainAndDeregister() { + // Unconditional flush: ignore lastSendTime so no merged data is left behind. + mergeDataCache.read().forEach(nextWorker::in); + lastSendTime = System.currentTimeMillis(); + l1Queue.removeHandler(metricsClass); + } + private class L1Handler implements HandlerConsumer { @Override public void consume(final List data) { diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentMinWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentMinWorker.java index 1e0725adb722..5e68e385a124 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentMinWorker.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentMinWorker.java @@ -158,6 +158,26 @@ private void updateQueueUsageGauges() { } } + /** + * Deregister this worker's L2 handler for runtime-rule hot-remove. Does not flush pending + * data — callers must invoke {@link #drainPendingRequests()} and submit the returned requests + * via {@code IBatchDAO.flush} first if they want the L2 cache drained to storage. + * + *

Any samples still buffered in the L2 queue partition for this metric class after this + * call will hit the null-handler path and be dropped (logged once). Callers must have already + * unregistered from {@code MetricsStreamProcessor.entryWorkers} and drained the L1 side + * ({@link MetricsAggregateWorker#drainAndDeregister()}) so no new samples enter the L2 queue. + * + *

Not safe to call concurrently with {@code addHandler}/{@code removeHandler} on the same + * L2 queue — the runtime-rule module serializes via its per-file lock. + * + * @param metricsClass the metrics class whose handler should be removed; must match the class + * this worker was constructed with. + */ + public void deregisterFromL2Queue(final Class metricsClass) { + l2Queue.removeHandler(metricsClass); + } + private class L2Handler implements HandlerConsumer { @Override public void consume(List data) { diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java index 07d0a6e40fb7..1ef03b296de4 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsPersistentWorker.java @@ -189,7 +189,16 @@ public List buildBatchRequests() { if (persistentCounter++ % persistentMod != 0) { return Collections.emptyList(); } + return buildBatchRequestsUnconditionally(); + } + /** + * Unconditional variant of {@link #buildBatchRequests()} that bypasses the + * {@link #persistentMod} guard. Used by {@link #drainPendingRequests()} for runtime rule + * hot-remove, where any pending data must be submitted to storage before the worker is + * dropped — regardless of whether this tick would otherwise be a skip-tick. + */ + private List buildBatchRequestsUnconditionally() { final List lastCollection = getCache().read(); long start = System.currentTimeMillis(); @@ -349,6 +358,24 @@ public void endOfRound() { sessionCache.removeExpired(); } + /** + * Return any pending metrics in the in-memory cache as {@link PrepareRequest} batch items, + * without the {@link #persistentCounter} skip-tick guard. Also invalidates the session cache + * so the caller can safely drop this worker afterwards. + * + *

Used for runtime-rule hot-remove: callers execute the returned requests synchronously + * via {@code IBatchDAO.flush} and then retire the worker. Any samples still in-flight through + * the L2 queue partition for this metric class after deregistration hit the null-handler path + * and are dropped by design (structural window). + * + * @return the list of prepare requests, possibly empty + */ + public List drainPendingRequests() { + final List requests = buildBatchRequestsUnconditionally(); + sessionCache.removeExpired(); + return requests; + } + /** * Check the metrics whether in the cache, and whether the worker should go further to load from database. * diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java index aa517c537198..d147de336c76 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java @@ -20,6 +20,7 @@ import lombok.Getter; import lombok.Setter; +import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.server.core.CoreModule; import org.apache.skywalking.oap.server.core.UnexpectedException; import org.apache.skywalking.oap.server.core.analysis.DownSampling; @@ -30,6 +31,7 @@ import org.apache.skywalking.oap.server.core.analysis.metrics.Metrics; import org.apache.skywalking.oap.server.core.config.DownSamplingConfigService; import org.apache.skywalking.oap.server.core.query.TTLStatusQuery; +import org.apache.skywalking.oap.server.core.storage.IBatchDAO; import org.apache.skywalking.oap.server.core.storage.IMetricsDAO; import org.apache.skywalking.oap.server.core.storage.StorageBuilderFactory; import org.apache.skywalking.oap.server.core.storage.StorageDAO; @@ -37,16 +39,20 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.annotation.Storage; import org.apache.skywalking.oap.server.core.storage.model.Model; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; import org.apache.skywalking.oap.server.core.worker.IWorkerInstanceSetter; +import org.apache.skywalking.oap.server.library.client.request.PrepareRequest; import org.apache.skywalking.oap.server.library.module.ModuleDefineHolder; import java.lang.reflect.InvocationTargetException; import java.util.ArrayList; -import java.util.HashMap; import java.util.List; import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.CopyOnWriteArrayList; +import java.util.concurrent.atomic.AtomicLong; /** * MetricsStreamProcessor represents the entrance and creator of the metrics streaming aggregation work flow. @@ -55,6 +61,7 @@ * * {@link #create(ModuleDefineHolder, Stream, Class)} creates the workers and work flow for every metrics. */ +@Slf4j public class MetricsStreamProcessor implements StreamProcessor { /** * Singleton instance. @@ -62,15 +69,57 @@ public class MetricsStreamProcessor implements StreamProcessor { private static final MetricsStreamProcessor PROCESSOR = new MetricsStreamProcessor(); /** - * Worker table hosts all entrance workers. + * Worker table hosts all entrance workers. Lock-free concurrent reads from {@link #in(Metrics)} + * — hot path. Writes are serialized by callers (startup-once via {@link MeterSystem}'s + * {@code synchronized create}, runtime rule hot-update via its per-file lock). The switch from + * HashMap to ConcurrentHashMap closes a latent race that was previously safe only because all + * writes happened at boot. */ - private Map, MetricsAggregateWorker> entryWorkers = new HashMap<>(); + private final Map, MetricsAggregateWorker> entryWorkers = new ConcurrentHashMap<>(); /** - * Worker table hosts all persistent workers. + * Worker table hosts all persistent workers. CopyOnWriteArrayList is mandatory (not + * optional): {@code PersistenceTimer.extractDataAndSave} calls + * {@code workers.addAll(getPersistentWorkers())} which iterates the source list under its + * own synchronization — with a plain ArrayList and concurrent runtime-rule add/remove, that + * iteration could CME or drop entries. CoW makes the snapshot iterator-safe for readers and + * accepts the write amplification, which only occurs during rule mutation (rare). */ @Getter - private List persistentWorkers = new ArrayList<>(); + private final List persistentWorkers = new CopyOnWriteArrayList<>(); + + /** + * Counts samples arriving at {@link #in(Metrics)} for a class that is not registered in + * {@link #entryWorkers}. Bumped during structural rollout windows (new metric not yet known + * on this node) and during hot-remove drain (handler gone but samples still in flight). The + * counter is read by runtime-rule observability; emission to the telemetry pipeline is wired + * by the runtime-rule module (avoiding a hard dependency on TelemetryModule here). + */ + @Getter + private final AtomicLong unroutableSampleCount = new AtomicLong(); + + /** + * Tracks which metric classes have already produced an unroutable warning, so the log line + * fires at most once per class per process lifetime. Allows operators to see the transition + * without flooding the log during extended rollout windows. + */ + private final Map, Boolean> warnedUnroutableClasses = new ConcurrentHashMap<>(); + + /** + * Entry workers that are temporarily out of {@link #entryWorkers} but whose underlying + * persistent workers, handler state, and storage model are all still live. A metric class + * lands here during the Suspend phase of a runtime-rule structural update — samples arriving + * during Suspend hit the null-worker path in {@link #in(Metrics)}, their drops accumulating + * on {@link #unroutableSampleCount}, which matches the design contract that samples for the + * bundle in flux are dropped for the duration. The entry is restored to {@link #entryWorkers} + * on the matching {@link #resumeDispatch(Class)}, and the pre-suspend worker resumes + * processing with its merge buffer / lastSendTime intact. + * + *

Distinct from {@link #removeMetric} which is destructive: Suspend keeps the measure and + * class alive so a short-lived pause (seconds to a minute) is reversible without repeating + * DDL or losing persistent-worker state. + */ + private final Map, MetricsAggregateWorker> suspendedWorkers = new ConcurrentHashMap<>(); /** * The period of L1 aggregation flush. Unit is ms. @@ -93,6 +142,16 @@ public void in(Metrics metrics) { MetricsAggregateWorker worker = entryWorkers.get(metrics.getClass()); if (worker != null) { worker.in(metrics); + return; + } + // Unknown class — either a structural rollout window (new metric not yet registered on + // this node), a hot-remove window (handler gone, sample still in flight from a peer), + // or a legitimate bug (sample for a class never registered). Bump the counter and warn + // at most once per class so operators can see rollout transitions without log flood. + unroutableSampleCount.incrementAndGet(); + if (warnedUnroutableClasses.putIfAbsent(metrics.getClass(), Boolean.TRUE) == null) { + log.warn("Dropped sample for unregistered metric class {}; further drops for this " + + "class will be silent until it is registered again.", metrics.getClass().getName()); } } @@ -118,10 +177,33 @@ public void create(ModuleDefineHolder moduleDefineHolder, this.create(moduleDefineHolder, stream, meterClass, MetricStreamKind.MAL); } + /** + * Opt-aware variant invoked from the runtime-rule MAL path. Peer nodes pass + * {@link StorageManipulationOpt#localCacheOnly()} so every downstream {@code ModelRegistry.add} + * records per-resource outcomes and suppresses server-side install. Main-node on-demand + * callers (REST {@code /addOrUpdate}) pass {@link StorageManipulationOpt#fullInstall()}. + * Startup-path callers (stream registration for static rules) pass + * {@link StorageManipulationOpt#createIfAbsent()} so boot never reshapes the backend. + */ + public void create(ModuleDefineHolder moduleDefineHolder, + StreamDefinition stream, + Class meterClass, + StorageManipulationOpt opt) throws StorageException { + this.create(moduleDefineHolder, stream, meterClass, MetricStreamKind.MAL, opt); + } + private void create(ModuleDefineHolder moduleDefineHolder, StreamDefinition stream, Class metricsClass, MetricStreamKind kind) throws StorageException { + this.create(moduleDefineHolder, stream, metricsClass, kind, StorageManipulationOpt.createIfAbsent()); + } + + private void create(ModuleDefineHolder moduleDefineHolder, + StreamDefinition stream, + Class metricsClass, + MetricStreamKind kind, + StorageManipulationOpt opt) throws StorageException { final StorageBuilderFactory storageBuilderFactory = moduleDefineHolder.find(StorageModule.NAME) .provider() .getService(StorageBuilderFactory.class); @@ -137,7 +219,7 @@ private void create(ModuleDefineHolder moduleDefineHolder, throw new UnexpectedException("Create " + stream.getBuilder().getSimpleName() + " metrics DAO failure.", e); } - ModelCreator modelSetter = moduleDefineHolder.find(CoreModule.NAME).provider().getService(ModelCreator.class); + ModelRegistry modelSetter = moduleDefineHolder.find(CoreModule.NAME).provider().getService(ModelRegistry.class); DownSamplingConfigService configService = moduleDefineHolder.find(CoreModule.NAME) .provider() .getService(DownSamplingConfigService.class); @@ -165,14 +247,18 @@ private void create(ModuleDefineHolder moduleDefineHolder, if (supportDownSampling) { if (configService.shouldToHour()) { Model model = modelSetter.add( - metricsClass, stream.getScopeId(), new Storage(stream.getName(), timeRelativeID, DownSampling.Hour) + metricsClass, stream.getScopeId(), + new Storage(stream.getName(), timeRelativeID, DownSampling.Hour), + opt ); int hourTTL = ttlStatusQuery.getMetricsTTL(model); hourPersistentWorker = downSamplingWorker(moduleDefineHolder, metricsDAO, model, supportUpdate, kind, hourTTL); } if (configService.shouldToDay()) { Model model = modelSetter.add( - metricsClass, stream.getScopeId(), new Storage(stream.getName(), timeRelativeID, DownSampling.Day) + metricsClass, stream.getScopeId(), + new Storage(stream.getName(), timeRelativeID, DownSampling.Day), + opt ); int dayTTL = ttlStatusQuery.getMetricsTTL(model); dayPersistentWorker = downSamplingWorker(moduleDefineHolder, metricsDAO, model, supportUpdate, kind, dayTTL); @@ -183,8 +269,27 @@ private void create(ModuleDefineHolder moduleDefineHolder, } Model model = modelSetter.add( - metricsClass, stream.getScopeId(), new Storage(stream.getName(), timeRelativeID, DownSampling.Minute) + metricsClass, stream.getScopeId(), + new Storage(stream.getName(), timeRelativeID, DownSampling.Minute), + opt ); + + // Shape-mismatch gate — boot registers under create-if-absent, which records + // SKIPPED_SHAPE_MISMATCH outcomes when the backend already holds a shape that + // differs from what the model declares. We must NOT register workers against a + // backend the installer refused to reshape: ingest would silently write against an + // inconsistent schema (or land rows the query side can't decode). Boot continues + // with the metric disabled — operator reconciles explicitly via the runtime-rule + // on-demand endpoint (the only workflow that may change backend schema). + if (opt.hasShapeMismatch()) { + log.error("Shape mismatch for metric {} — installer refused to reshape the " + + "backend; skipping worker registration so ingest won't write against an " + + "inconsistent schema. Operator action: reshape via POST /runtime/rule/addOrUpdate " + + "or align the rule shape with the backend. First mismatch: {}", + metricsClass.getName(), opt.firstShapeMismatch()); + return; + } + int minuteTTL = ttlStatusQuery.getMetricsTTL(model); MetricsPersistentWorker minutePersistentWorker = minutePersistentWorker( moduleDefineHolder, metricsDAO, model, transWorker, supportUpdate, kind, metricsClass, minuteTTL); @@ -235,4 +340,204 @@ private MetricsPersistentWorker downSamplingWorker(ModuleDefineHolder moduleDefi return persistentWorker; } + + /** + * Remove the streaming-calculation chain for a runtime-removed metric class. Symmetric to + * {@link #create(ModuleDefineHolder, Stream, Class)} / {@link #create(ModuleDefineHolder, + * StreamDefinition, Class)}: drops the L1 entry worker, drains and deregisters the L2 min + * worker, removes all down-sampling persistent workers that share the same model name. + * + *

Order matters and is load-bearing — draining before deregister prevents data loss: + *

    + *
  1. Remove from {@link #entryWorkers} first — stops new {@link #in(Metrics)} routes.
  2. + *
  3. Drain L1: flush {@code MergableBufferedData} downstream to L2, then deregister the + * L1 queue handler.
  4. + *
  5. Drain L2: unconditionally build batch requests from the min worker's cache and + * submit via {@code IBatchDAO.flush}. Wait for the future to complete before + * deregistering the L2 queue handler, so no pending-flush data is orphaned.
  6. + *
  7. Remove all {@link MetricsPersistentWorker} entries (minute + any down-sampling + * variants) for this model name from {@link #persistentWorkers}.
  8. + *
+ * + *

Any samples in-flight through the shared L1 / L2 queue partitions for this class after + * the corresponding handler is removed will hit the null-handler path and be dropped (bumped + * on {@code BatchQueue}'s dropped-type warn-once counter). This is the accepted cost of + * moving metric shape atomically during a structural apply. + * + *

Not safe to call concurrently with {@link #create(ModuleDefineHolder, Stream, Class)} or + * another {@link #removeMetric} for the same class — the runtime-rule module serializes via + * its per-file lock; startup registrations are single-threaded. Safe to call concurrently + * with {@link #in(Metrics)} for unrelated metric classes. + * + * @param moduleDefineHolder pointer of the module define, used to obtain {@code IBatchDAO} + * for synchronous L2 drain submission + * @param metricsClass the metric class to deregister + * @return {@code true} if an entry worker existed and was removed, {@code false} if no such + * metric class was registered (idempotent, caller can ignore) + */ + public boolean removeMetric(final ModuleDefineHolder moduleDefineHolder, + final Class metricsClass) { + // removeMetric supersedes any pending Suspend. If the class is currently suspended + // (parked in suspendedWorkers, absent from entryWorkers), we still need to drain L1, + // flush L2, and deregister the queue handlers — a structural removal has to reach + // the same end state regardless of whether dispatch was paused at the time. Pull the + // worker out of whichever map holds it and feed the same drain path below. + MetricsAggregateWorker aggregateWorker = entryWorkers.remove(metricsClass); + if (aggregateWorker == null) { + aggregateWorker = suspendedWorkers.remove(metricsClass); + } else { + // Not suspended at removal time, but belt-and-suspenders: clear any stale + // suspended entry so post-remove resumeDispatch doesn't resurrect anything. + suspendedWorkers.remove(metricsClass); + } + if (aggregateWorker == null) { + return false; + } + // Clear the dropped-sample memo so a re-register of the same class after a remove-add + // cycle gets a fresh warn on any accidental samples arriving during its own rollout. + warnedUnroutableClasses.remove(metricsClass); + + // L1 drain + deregister: safe because no new samples can arrive here — the entryWorkers + // entry was atomic-removed above. + try { + aggregateWorker.drainAndDeregister(); + } catch (final Throwable t) { + log.error("L1 drain failed for metric class {}; proceeding with L2 drain anyway", + metricsClass.getName(), t); + } + + // Find every persistent worker that belongs to this metric's model name. Downsampling + // variants share the same model name but different DownSampling enum values. + // MetricsPersistentMinWorker is the only subclass that registers on the L2 queue, so + // deregistration is targeted only at min workers. + final String modelName = aggregateWorker.getModelName(); + final List victims = new ArrayList<>(); + for (final MetricsPersistentWorker w : persistentWorkers) { + if (modelName != null && modelName.equals(w.getModel().getName())) { + victims.add(w); + } + } + + // L2 drain: pull all pending requests from every victim's cache and submit synchronously. + final IBatchDAO batchDAO; + try { + batchDAO = moduleDefineHolder.find(StorageModule.NAME).provider().getService(IBatchDAO.class); + } catch (final Throwable t) { + log.error("Cannot resolve IBatchDAO for metric class {} drain; pending L2 data will " + + "be orphaned (accepted in structural window).", metricsClass.getName(), t); + // Proceed with worker removal anyway — do not block the hot-remove on storage resolution. + finalizeRemoval(moduleDefineHolder, metricsClass, modelName, victims); + return true; + } + + final List pending = new ArrayList<>(); + for (final MetricsPersistentWorker w : victims) { + try { + pending.addAll(w.drainPendingRequests()); + } catch (final Throwable t) { + log.error("L2 drain collect failed for model {} on worker {}; continuing", + modelName, w.getClass().getSimpleName(), t); + } + } + if (!pending.isEmpty()) { + try { + batchDAO.flush(pending).join(); + } catch (final Throwable t) { + log.error("L2 flush failed for metric class {}; {} pending requests orphaned " + + "(accepted structural-window loss).", metricsClass.getName(), pending.size(), t); + } + } + + finalizeRemoval(moduleDefineHolder, metricsClass, modelName, victims); + return true; + } + + /** + * Reversible pause of streaming dispatch for a single metric class. Used by the runtime-rule + * Suspend phase — the peer receives a cluster RPC telling it "stop serving this bundle while + * the main node moves its schema" and calls through to this primitive. + * + *

Semantics: + *

    + *
  • The entry worker is removed from {@link #entryWorkers} and parked in + * {@link #suspendedWorkers}. Samples arriving at {@link #in(Metrics)} for this class + * hit the null-worker path and increment {@link #unroutableSampleCount} — the accepted + * cost of the structural window.
  • + *
  • Persistent workers (L2) stay registered. Already-buffered L2 data continues flushing + * to storage on the normal timer, so no in-flight samples are lost.
  • + *
  • {@code StorageModels} registration is untouched — the measure stays, no DDL fires.
  • + *
+ * + *

Reverse via {@link #resumeDispatch(Class)} — the parked worker is put back atomically, + * retaining its {@code MergableBufferedData} state and {@code lastSendTime} across the pause. + * Idempotent: calling suspend on an already-suspended class returns {@code false}. + * + * @return {@code true} if a live entry worker was parked, {@code false} if no entry worker + * was present (already suspended, or not registered at all). + */ + public boolean suspendDispatch(final Class metricsClass) { + final MetricsAggregateWorker worker = entryWorkers.remove(metricsClass); + if (worker == null) { + return false; + } + suspendedWorkers.put(metricsClass, worker); + return true; + } + + /** + * Inverse of {@link #suspendDispatch(Class)}: re-install the parked entry worker into + * {@link #entryWorkers} so {@link #in(Metrics)} dispatches to it again. Idempotent: returns + * {@code false} if nothing was parked for this class. Called by the runtime-rule apply path + * on the SUSPENDED → RUNNING transition and by the Suspend-aborted rollback path when a + * main-node verify failure rolls the peers back to the pre-suspend handler set. + */ + public boolean resumeDispatch(final Class metricsClass) { + final MetricsAggregateWorker worker = suspendedWorkers.remove(metricsClass); + if (worker == null) { + return false; + } + entryWorkers.put(metricsClass, worker); + return true; + } + + /** Test/observability-only: whether a metric class is currently parked. */ + public boolean isDispatchSuspended(final Class metricsClass) { + return suspendedWorkers.containsKey(metricsClass); + } + + private void finalizeRemoval(final ModuleDefineHolder moduleDefineHolder, + final Class metricsClass, + final String modelName, + final List victims) { + // Deregister L2 handler for the min worker, then drop all victims from the shared list. + for (final MetricsPersistentWorker w : victims) { + if (w instanceof MetricsPersistentMinWorker) { + try { + ((MetricsPersistentMinWorker) w).deregisterFromL2Queue(metricsClass); + } catch (final Throwable t) { + log.error("L2 queue deregister failed for metric class {}", + metricsClass.getName(), t); + } + } + } + persistentWorkers.removeAll(victims); + + // Drop the {modelName}_rec entry from WorkerInstancesService — the counterpart of the + // put(...) call in create(). Without this the receiver-name slot stays occupied, and + // a subsequent re-register of the same metric name (shape-break remove+apply, + // operator recovery push via /addOrUpdate?force=true, reconciler STRUCTURAL path) + // fails with "Duplicate worker + // name". Idempotent — the remove silently ignores unknown keys. + if (modelName != null) { + try { + final IWorkerInstanceSetter workerInstanceSetter = moduleDefineHolder + .find(CoreModule.NAME).provider().getService(IWorkerInstanceSetter.class); + workerInstanceSetter.remove(modelName + "_rec"); + } catch (final Throwable t) { + log.error("Failed to deregister worker-instance slot {}_rec for metric {}; " + + "a subsequent re-register of the same name will fail with \"Duplicate " + + "worker name\".", modelName, metricsClass.getName(), t); + } + } + } } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java index 0d59ff3b566c..f0f3fa2efc1f 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java @@ -34,7 +34,8 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.annotation.Storage; import org.apache.skywalking.oap.server.core.storage.model.Model; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; import org.apache.skywalking.oap.server.library.module.ModuleDefineHolder; @@ -75,9 +76,11 @@ public void create(ModuleDefineHolder moduleDefineHolder, Stream stream, ClassBackground: on JDK 9+ the deprecated {@code toClass(ClassLoader, ProtectionDomain)} + * reflectively calls {@code java.lang.ClassLoader.defineClass} via {@code setAccessible}, + * which the strong-encapsulation rule blocks at runtime with + * {@code InaccessibleObjectException} unless the operator explicitly opens + * {@code java.base/java.lang} via {@code --add-opens}. Static MAL/LAL boot is unaffected + * because {@code MeterClassPackageHolder}'s package access works through Javassist's + * neighbor-class API on the default loader. Runtime-rule's per-file loader has no such + * neighbor at the first {@code toClass} call; the only pre-loaded classes are inherited + * via parent delegation and so live in the parent's loader, not the rule loader. + * + *

This contract sidesteps the issue entirely: a class loader that implements it + * publishes a {@code defineClass} method as part of its API, and the runtime-rule + * generator path calls it directly with {@code CtClass.toBytecode()} bytes — no + * reflection, no deprecated overload, no {@code --add-opens} requirement on the OAP + * JVM. Production loaders (the static path) keep working through their existing + * {@code toClass(Class)} neighbor-based path. + * + *

The interface is intentionally minimal — a single bytecode-defining method that + * mirrors what {@link ClassLoader#defineClass(String, byte[], int, int)} does. Lifecycle + * (parent delegation, URL search) stays the implementor's concern. + */ +public interface BytecodeClassDefiner { + + /** + * Define {@code bytecode} as a {@link Class} in this loader's namespace. + * + * @param className fully-qualified binary name, must match the class's + * {@code this_class} attribute in the bytecode. + * @param bytecode a complete classfile (e.g., from {@code CtClass.toBytecode()}). + * @return the resolved {@link Class} object loaded by this defining loader. + */ + Class defineClass(String className, byte[] bytecode); +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/Catalog.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/Catalog.java new file mode 100644 index 000000000000..3aa3ce9e0482 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/Catalog.java @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import lombok.Getter; + +/** + * Strongly-typed catalog identifier consumed by {@link DSLClassLoaderManager} so the manager's + * keys are comparable without string-typo risk. Each entry's {@link #wireName} matches the + * catalog string used everywhere else in the runtime-rule pipeline (REST surface, DAO row, + * loader-name prefix, {@code StaticRuleRegistry} key) so conversion at the boundary via + * {@link #of(String)} keeps the rest of the codebase on plain {@code String catalog}. + */ +public enum Catalog { + OTEL_RULES("otel-rules"), + LOG_MAL_RULES("log-mal-rules"), + TELEGRAF_RULES("telegraf-rules"), + LAL("lal"); + + @Getter + private final String wireName; + + Catalog(final String wireName) { + this.wireName = wireName; + } + + /** + * Resolve {@code wireName} to the matching enum. Throws {@link IllegalArgumentException} on + * an unknown catalog so callers fail fast rather than silently dropping the rule. + */ + public static Catalog of(final String wireName) { + for (final Catalog c : values()) { + if (c.wireName.equals(wireName)) { + return c; + } + } + throw new IllegalArgumentException("Unknown DSL catalog: " + wireName); + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java new file mode 100644 index 000000000000..9937ce6ae37a --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/ClassLoaderGc.java @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import java.lang.ref.PhantomReference; +import java.lang.ref.Reference; +import java.lang.ref.ReferenceQueue; +import java.util.Collection; +import java.util.Collections; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicLong; +import lombok.Getter; +import lombok.extern.slf4j.Slf4j; + +/** + * Tracks every retired {@link RuleClassLoader} through a {@link ReferenceQueue} of + * {@link PhantomReference}s so GC of the loader is observable rather than silent. + * + *

Dropping a bundle's strong reference to its loader is necessary but not sufficient for + * class unloading: any lingering reference (a handler left registered, a sample still sitting + * in a DataCarrier slot, a Javassist CtClass left attached in the default pool, an async task + * holding the class) will pin the loader and its classes indefinitely. Without observability, + * that kind of leak is invisible until heap pressure surfaces hours later. + * + *

This graveyard is internal to {@link DSLClassLoaderManager}. The manager retires loaders + * here via {@code dropRuntime} (full teardown) and {@code retire} (engine-decided "displaced + * prior is dead"); a daemon sweeper thread the manager owns drains collected phantoms + + * WARNs on stale entries. No external caller touches this class — every consumer goes + * through the manager. + */ +@Slf4j +final class ClassLoaderGc { + + private final ReferenceQueue queue = new ReferenceQueue<>(); + private final Map, Retired> pending = new ConcurrentHashMap<>(); + + @Getter + private final AtomicLong collectedTotal = new AtomicLong(); + + /** + * Register a loader as retired. The caller must drop the last strong reference it holds + * to {@code loader} immediately after this call — otherwise the phantom reference will + * never be enqueued and the pending entry will stay forever. + */ + void retire(final RuleClassLoader loader) { + if (loader == null) { + return; + } + final PhantomReference ref = new PhantomReference<>(loader, queue); + final Retired retired = new Retired( + loader.getKind(), loader.getCatalog(), loader.getRule(), loader.getContentHash(), + System.currentTimeMillis(), ref); + pending.put(ref, retired); + } + + /** + * Drain collected phantoms from the queue. Returns the entries the JVM confirmed as + * unreachable since the last sweep. Entries that remain in {@link #pending()} after this + * call are still suspected leaks. Called by the manager's internal sweeper thread. + */ + Collection sweep() { + final java.util.ArrayList drained = new java.util.ArrayList<>(); + Reference r; + while ((r = queue.poll()) != null) { + final Retired done = pending.remove(r); + if (done != null) { + collectedTotal.incrementAndGet(); + log.info("rule loader collected: {}:{}/{} hash={} ttg={}ms", + done.kind() == DSLClassLoaderManager.Kind.STATIC ? "static" : "runtime-rule", + done.catalog().getWireName(), done.rule(), done.contentHashShort(), + System.currentTimeMillis() - done.retiredAtMs()); + drained.add(done); + } + } + return drained; + } + + /** + * @return snapshot of retired-but-not-yet-GC'd entries. Elevated steadily across many + * sweeps == leak; the manager's sweeper logs WARN per entry older than the + * configured threshold. + */ + Collection pending() { + return Collections.unmodifiableCollection(pending.values()); + } + + /** Informational record surfaced to the sweeper. Identity is immutable; only the + * {@code warned} latch can flip, and only once per lifetime of this {@code Retired}. */ + static final class Retired { + private final DSLClassLoaderManager.Kind kind; + private final Catalog catalog; + private final String rule; + private final String contentHash; + private final long retiredAtMs; + @SuppressWarnings("unused") // strong reference retained so the phantom isn't itself GC'd + private final PhantomReference ref; + /** Single-shot latch flipped by {@link #markWarnedIfNotAlready()} so the stale-loader + * WARN fires once per retired loader rather than once per sweep per entry. */ + private final AtomicBoolean warned = new AtomicBoolean(false); + + Retired(final DSLClassLoaderManager.Kind kind, final Catalog catalog, final String rule, + final String contentHash, final long retiredAtMs, + final PhantomReference ref) { + this.kind = kind; + this.catalog = catalog; + this.rule = rule; + this.contentHash = contentHash; + this.retiredAtMs = retiredAtMs; + this.ref = ref; + } + + DSLClassLoaderManager.Kind kind() { + return kind; + } + + Catalog catalog() { + return catalog; + } + + String rule() { + return rule; + } + + String contentHash() { + return contentHash; + } + + long retiredAtMs() { + return retiredAtMs; + } + + String contentHashShort() { + if (contentHash == null || contentHash.length() <= 8) { + return contentHash == null ? "none" : contentHash; + } + return contentHash.substring(0, 8); + } + + boolean markWarnedIfNotAlready() { + return warned.compareAndSet(false, true); + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java new file mode 100644 index 000000000000..78766183ee49 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java @@ -0,0 +1,280 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import java.util.Collection; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import lombok.extern.slf4j.Slf4j; + +/** + * Process-wide owner of every per-file {@link RuleClassLoader} the OAP creates for a DSL rule + * (MAL, LAL, future OAL). Both static-bundled rules (after a runtime override is removed and + * the bundled rule must serve again) and runtime-pushed overrides go through this singleton, + * so there is one place to: mint loaders with a uniform name format, retire them when a newer + * version replaces them, and observe the JVM's collection of retired loaders so leaks surface + * as a WARN instead of silent heap growth. + * + *

Why a singleton, not a {@code Service}. Loaders are foundational JVM state — the + * meter / log compile paths reach for {@code DSLClassLoaderManager.INSTANCE} from places that + * have no {@code ModuleManager} in scope (the LAL / MAL applier static methods, test + * fixtures). Threading the manager through every constructor would be churn for zero benefit; + * lifetime is the JVM's, not a module's. + * + *

Static fall-over contract. Bundled rules at boot are loaded the same way as + * before — into the OAP main classloader, no per-file static loader, no entry in this map. + * Per-file static loaders only appear after a runtime override on a bundled rule is removed + * (via {@code /inactivate} or {@code /delete}): the runtime loader retires here, then the + * engine reloads the bundled YAML from {@code StaticRuleRegistry} and calls + * {@link #newBuilder} with {@link Kind#STATIC} to mint a fresh loader hosting the bundled + * compile output. So at any moment there is at most one per-file loader for a given key, and + * only when the key has actually fallen over. + * + *

GC sweep is internal. A daemon-thread scheduled executor inside this manager + * polls the phantom queue + WARNs on stale entries; no external code calls a sweep API. The + * sweeper starts lazily on the first {@link #newBuilder} call so tests that touch + * {@link RuleClassLoader} directly never spawn the thread. + */ +@Slf4j +public final class DSLClassLoaderManager { + + /** Origin of a loader. {@code RUNTIME} loaders host operator-pushed runtime-rule overrides; + * {@code STATIC} loaders host bundled rules brought back into service after a runtime + * override on the same key was removed. The active loader for a given key is always at + * most one; manager keys are {@code (catalog, rule)}, not {@code (catalog, rule, kind)}. */ + public enum Kind { + STATIC, RUNTIME + } + + /** Process-wide singleton. */ + public static final DSLClassLoaderManager INSTANCE = new DSLClassLoaderManager(); + + /** Sweep cadence for the internal phantom-queue drainer. 30 s — comfortably past a typical + * young-gen pause cycle so most retired loaders surface as collected within a couple of + * ticks; short enough that a leaked loader's WARN doesn't wait minutes. */ + private static final long SWEEP_INTERVAL_SECONDS = 30L; + /** A retired loader still alive past this threshold is WARN'd as a suspected leak. 5 min — + * long enough that a slow GC cycle doesn't cry wolf, short enough that an actual leak is + * surfaced before heap pressure triggers a full GC pause. */ + private static final long STALE_LOADER_WARN_THRESHOLD_MS = 5L * 60L * 1000L; + + private final Map active = new ConcurrentHashMap<>(); + private final ClassLoaderGc graveyard = new ClassLoaderGc(); + private final AtomicBoolean sweeperStarted = new AtomicBoolean(false); + /** Captured on first {@link #newBuilder} call so DSL compile paths can mint loaders before + * the OAP main classloader is fully initialised in surprising boot orders. Volatile read + * so the lazy init publishes safely to other threads. */ + private volatile ClassLoader capturedParent; + + private DSLClassLoaderManager() { + } + + /** + * Mint a fresh loader for {@code (catalog, rule)}. The loader is returned for the caller + * to compile classes into; it is not yet registered as the active loader for this + * key. The caller promotes it via {@link #commit(RuleClassLoader)} after a successful + * compile / register. A failed compile simply discards the returned loader (no + * displacement, no false leak signal in {@link #pendingCount()}). + * + *

This split exists because the loader has to exist during compile (Javassist + * defines classes into it), but the manager's "active" view should reflect only loaders + * whose rule successfully reached the dispatcher. Otherwise a transient compile failure + * would replace the diagnostic record while the prior bundle is still actually serving. + */ + public RuleClassLoader newBuilder(final Catalog catalog, final String rule, final Kind kind, + final String contentHash) { + Objects.requireNonNull(catalog, "catalog"); + Objects.requireNonNull(rule, "rule"); + Objects.requireNonNull(kind, "kind"); + ensureSweeperStarted(); + final ClassLoader parent = resolveParent(); + return new RuleClassLoader(kind, catalog, rule, contentHash, parent); + } + + /** + * Promote a freshly-compiled loader to the active slot for its {@code (catalog, rule)}. + * Returns the loader that was active before the swap (if any) so the caller can decide + * whether to retire it: MAL STRUCTURAL / NEW commit and LAL commit retire the prior; + * MAL FILTER_ONLY commit does not (the prior loader's {@code Metrics} subclasses are + * still the storage target via {@code MeterSystem.meterPrototypes}). + * + *

{@link #newBuilder} on its own does not register the loader in {@code active}; + * {@code commit} is the only path that does, so a failed apply leaves the active map + * pointing at whatever was there before. + */ + public Optional commit(final RuleClassLoader loader) { + Objects.requireNonNull(loader, "loader"); + final Key key = new Key(loader.getCatalog(), loader.getRule()); + final RuleClassLoader prior = active.put(key, loader); + return Optional.ofNullable(prior); + } + + /** + * Send a loader to the internal graveyard for collection observability. Used by engine + * commit paths that displace a prior loader and know it should be GC'd (the {@link + * #commit} return value carries the prior loader for exactly this purpose). The + * graveyard's daemon sweeper logs INFO when the JVM confirms collection and WARN when a + * retired loader stays alive past the threshold. + */ + public void retire(final RuleClassLoader loader) { + if (loader == null) { + return; + } + ensureSweeperStarted(); + graveyard.retire(loader); + } + + /** + * Drop the active loader (regardless of {@link Kind}) for {@code (catalog, rule)} and + * retire it via the graveyard. Returns the loader that was active before the drop, or + * {@link Optional#empty()} when no loader was registered (already dropped, or never + * installed). Used by the engine's full-teardown path (unregister) to remove the active + * loader and observe its eventual GC. + */ + public Optional dropRuntime(final Catalog catalog, final String rule) { + final Key key = new Key(catalog, rule); + final RuleClassLoader current = active.remove(key); + if (current == null) { + return Optional.empty(); + } + graveyard.retire(current); + return Optional.of(current); + } + + /** + * Diagnostic — current loader for {@code (catalog, rule)} if any. Used by tests and by + * the runtime-rule {@code /list} surface to show whether a key is currently served by a + * runtime override or a static fall-over. + */ + public Optional active(final Catalog catalog, final String rule) { + return Optional.ofNullable(active.get(new Key(catalog, rule))); + } + + /** + * Diagnostic — number of currently-active loaders the manager owns (one per + * {@code (catalog, rule)} key with at least one successful {@link #newBuilder} that + * hasn't been {@link #dropRuntime}d). Surfaced for operator visibility through whichever + * receiver exposes loader stats; the runtime-rule REST handler can join this with its + * own {@code /list} per-rule view if the operator wants both numbers in one place. + */ + public int activeCount() { + return active.size(); + } + + /** + * Diagnostic — number of retired loaders the JVM has not yet collected. Steady-state + * elevated reading is the leak signal the sweeper WARNs on, surfaced here so operators + * can graph it independently. + */ + public int pendingCount() { + return graveyard.pending().size(); + } + + /** Lazy parent-classloader capture. First call wins; subsequent calls return the cached + * reference. {@code Thread.currentThread().getContextClassLoader()} is the OAP app + * loader at every realistic call site (DSL compile is driven from receiver / analyzer + * threads after module boot). */ + private ClassLoader resolveParent() { + ClassLoader local = capturedParent; + if (local != null) { + return local; + } + synchronized (this) { + local = capturedParent; + if (local == null) { + local = Thread.currentThread().getContextClassLoader(); + if (local == null) { + local = ClassLoader.getSystemClassLoader(); + } + capturedParent = local; + } + return local; + } + } + + private void ensureSweeperStarted() { + if (sweeperStarted.compareAndSet(false, true)) { + final ScheduledExecutorService exec = Executors.newSingleThreadScheduledExecutor(r -> { + final Thread t = new Thread(r, "dsl-classloader-gc"); + t.setDaemon(true); + return t; + }); + exec.scheduleWithFixedDelay(this::sweepInternal, + SWEEP_INTERVAL_SECONDS, SWEEP_INTERVAL_SECONDS, TimeUnit.SECONDS); + } + } + + private void sweepInternal() { + try { + final Collection collected = graveyard.sweep(); + if (!collected.isEmpty() && log.isDebugEnabled()) { + log.debug("dsl-classloader-gc: {} loader(s) confirmed collected", collected.size()); + } + final long nowMs = System.currentTimeMillis(); + for (final ClassLoaderGc.Retired r : graveyard.pending()) { + final long ageMs = nowMs - r.retiredAtMs(); + if (ageMs > STALE_LOADER_WARN_THRESHOLD_MS && r.markWarnedIfNotAlready()) { + log.warn("rule loader leak suspected: {}:{}/{} hash={} pending {} ms " + + "(threshold {}). Check for lingering handler registrations or " + + "samples buffered in DataCarrier partitions.", + r.kind() == Kind.STATIC ? "static" : "runtime-rule", + r.catalog().getWireName(), r.rule(), r.contentHashShort(), ageMs, + STALE_LOADER_WARN_THRESHOLD_MS); + } + } + } catch (final Throwable t) { + log.warn("dsl-classloader-gc sweep failed; will retry next interval", t); + } + } + + /** Composite key — equality + hashCode are catalog-and-rule, never include the loader + * identity. ConcurrentHashMap keys must be hash-stable. */ + private static final class Key { + private final Catalog catalog; + private final String rule; + + Key(final Catalog catalog, final String rule) { + this.catalog = catalog; + this.rule = rule; + } + + @Override + public boolean equals(final Object o) { + if (this == o) { + return true; + } + if (!(o instanceof Key)) { + return false; + } + final Key k = (Key) o; + return catalog == k.catalog && rule.equals(k.rule); + } + + @Override + public int hashCode() { + return 31 * catalog.hashCode() + rule.hashCode(); + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java new file mode 100644 index 000000000000..623946c0e751 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import java.net.URL; +import java.net.URLClassLoader; +import java.time.LocalDateTime; +import java.time.format.DateTimeFormatter; +import lombok.Getter; + +/** + * Per-file classloader that isolates all classes generated for one DSL rule file. Created + * exclusively through {@link DSLClassLoaderManager#newBuilder}; callers never instantiate + * directly so the manager owns the install / retire lifecycle uniformly across static and + * runtime origins. + * + *

Hosts three artifact families produced by a MAL rule file's compile step — the + * {@code MalExpression} subclass, closure companion classes, and the {@code Metrics} subclass + * that {@code MeterSystem} generates for each declared metric name. For LAL, hosts the + * {@code LalExpression} subclass + sub-method classes plus any inline {@code Metrics} + * subclasses reached through the LAL→MAL bridge. + * + *

All three families living in one classloader gives a single drop-point on hot-remove: + * when the bundle retires, the manager releases the last strong reference and the JVM can + * collect every class the loader defined. The internal phantom-reference queue observes the + * collection so an operator-visible WARN fires when a retired loader stays alive past the + * configured threshold (the leak signal). + * + *

Parent is the OAP app classloader so parent-first lookup resolves shipped classes like + * {@code SumFunction}, {@code HistogramFunction}, {@code MalExpression}, {@code LalExpression} + * without shadowing. No URLs are added; we rely on the Javassist companion {@code ClassPool} + * (parented to {@code ClassPool.getDefault()} with {@code LoaderClassPath(this)} appended) + * to inject generated bytecode back into this loader via {@code defineClass}. + */ +public final class RuleClassLoader extends URLClassLoader implements BytecodeClassDefiner { + private static final DateTimeFormatter NAME_TS = DateTimeFormatter.ofPattern("MMdd-HHmmss"); + + /** Origin tag — {@link DSLClassLoaderManager.Kind#STATIC} for fall-over reload of bundled + * rules; {@link DSLClassLoaderManager.Kind#RUNTIME} for operator-pushed overrides. Visible + * in {@link #getName()} via the {@code static:} / {@code runtime-rule:} prefix. */ + @Getter + private final DSLClassLoaderManager.Kind kind; + @Getter + private final Catalog catalog; + /** Rule identity within the catalog. Renamed from the previous {@code name} field to avoid + * shadowing {@link URLClassLoader#getName()} — that getter must still surface the formatted + * loader name {@code ":/@"} so log output stays unambiguous. */ + @Getter + private final String rule; + @Getter + private final String contentHash; + + public RuleClassLoader(final DSLClassLoaderManager.Kind kind, final Catalog catalog, + final String rule, final String contentHash, final ClassLoader parent) { + super(buildLoaderName(kind, catalog, rule), new URL[0], parent); + this.kind = kind; + this.catalog = catalog; + this.rule = rule; + this.contentHash = contentHash; + } + + /** + * Bytecode-injection entry point used by {@link + * org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem MeterSystem}'s + * runtime-rule path and by Javassist's MAL / LAL generators after they call {@code + * CtClass.toBytecode()}. Goes straight through {@code URLClassLoader.defineClass} — + * no Javassist {@code toClass(loader, ProtectionDomain)} reflection, no JDK 17+ + * {@code --add-opens java.base/java.lang} requirement on the OAP container. + * + *

Implements {@link BytecodeClassDefiner} so callers in {@code server-core} can + * type-check against the contract without taking a compile-time dep on the runtime-rule + * receiver plugin. + */ + @Override + public Class defineClass(final String className, final byte[] bytecode) { + return defineClass(className, bytecode, 0, bytecode.length); + } + + private static String buildLoaderName(final DSLClassLoaderManager.Kind kind, + final Catalog catalog, final String rule) { + final String prefix = kind == DSLClassLoaderManager.Kind.STATIC ? "static" : "runtime-rule"; + return prefix + ":" + catalog.getWireName() + "/" + rule + + "@" + LocalDateTime.now().format(NAME_TS); + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/management/runtimerule/RuntimeRule.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/management/runtimerule/RuntimeRule.java new file mode 100644 index 000000000000..5034ed6baec4 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/management/runtimerule/RuntimeRule.java @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.management.runtimerule; + +import lombok.EqualsAndHashCode; +import lombok.Getter; +import lombok.Setter; +import org.apache.skywalking.oap.server.core.analysis.Stream; +import org.apache.skywalking.oap.server.core.analysis.management.ManagementData; +import org.apache.skywalking.oap.server.core.analysis.worker.ManagementStreamProcessor; +import org.apache.skywalking.oap.server.core.source.ScopeDeclaration; +import org.apache.skywalking.oap.server.core.storage.StorageID; +import org.apache.skywalking.oap.server.core.storage.annotation.Column; +import org.apache.skywalking.oap.server.core.storage.type.Convert2Entity; +import org.apache.skywalking.oap.server.core.storage.type.Convert2Storage; +import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; + +import static org.apache.skywalking.oap.server.core.source.DefaultScopeDefine.RUNTIME_RULE; + +/** + * RuntimeRule is the persisted representation of a runtime-managed MAL or LAL rule file. + * + *

One row per (catalog, name) pair mirroring the on-disk static layout: + *

    + *
  • {@code catalog} — {@code otel-rules} | {@code log-mal-rules} | + * {@code telegraf-rules} | {@code lal}
  • + *
  • {@code name} — relative path under the catalog root without extension, may contain + * {@code /} (e.g. {@code aws-gateway/gateway-service})
  • + *
  • {@code content} — raw file bytes, byte-identical to the original request body; marked + * {@code storageOnly=true} so no backend tries to index the blob
  • + *
  • {@code status} — {@code ACTIVE} or {@code INACTIVE}; runtime rows always take precedence + * over static rules with the same (catalog, name) at load time
  • + *
  • {@code updateTime} — last modification epoch millis
  • + *
+ * + *

No {@code version} column: writes are last-write-wins and reconciliation on peer nodes is + * driven by content-hash comparison rather than a monotonic counter. No {@code lastApplyError} + * column either: compile/apply happens per-node after persistence and can diverge across nodes, + * so errors surface inline in the HTTP response, in OAP server logs, and in a per-node in-memory + * map exposed via the runtime-rule list API. + */ +@Setter +@Getter +@ScopeDeclaration(id = RUNTIME_RULE, name = "RuntimeRule") +@Stream(name = RuntimeRule.INDEX_NAME, scopeId = RUNTIME_RULE, builder = RuntimeRule.Builder.class, processor = ManagementStreamProcessor.class) +@EqualsAndHashCode(of = { + "catalog", "name" +}, callSuper = false) +public class RuntimeRule extends ManagementData { + public static final String INDEX_NAME = "runtimerule"; + public static final String CATALOG = "catalog"; + public static final String NAME = "name"; + public static final String CONTENT = "content"; + public static final String STATUS = "status"; + public static final String UPDATE_TIME = "update_time"; + + public static final String STATUS_ACTIVE = "ACTIVE"; + public static final String STATUS_INACTIVE = "INACTIVE"; + + @Column(name = CATALOG) + private String catalog; + @Column(name = NAME) + private String name; + /** + * Raw file bytes (MAL YAML / LAL YAML). Blob stored only, never queried or filtered. + * Size limit matches {@code UITemplate.configuration} — 1 MB is generous for a single rule + * file; larger files should be split. + */ + @Column(name = CONTENT, storageOnly = true, length = 1_000_000) + private String content; + @Column(name = STATUS) + private String status; + /** + * Boxed {@link Long} (not primitive {@code long}) so the column type matches sibling + * {@code ManagementData} entities (UITemplate, UIMenu) that share the ES merging-index + * {@code sw_management}; a primitive-vs-boxed mismatch is rejected by + * {@code IndexController.checkModelColumnConflicts} at startup. + */ + @Column(name = UPDATE_TIME) + private Long updateTime = 0L; + + @Override + public StorageID id() { + return new StorageID().append(CATALOG, catalog).append(NAME, name); + } + + public static class Builder implements StorageBuilder { + @Override + public RuntimeRule storage2Entity(final Convert2Entity converter) { + final RuntimeRule rule = new RuntimeRule(); + rule.setCatalog((String) converter.get(CATALOG)); + rule.setName((String) converter.get(NAME)); + rule.setContent((String) converter.get(CONTENT)); + rule.setStatus((String) converter.get(STATUS)); + final Object updateTime = converter.get(UPDATE_TIME); + if (updateTime != null) { + rule.setUpdateTime(((Number) updateTime).longValue()); + } + return rule; + } + + @Override + public void entity2Storage(final RuntimeRule entity, final Convert2Storage converter) { + converter.accept(CATALOG, entity.getCatalog()); + converter.accept(NAME, entity.getName()); + converter.accept(CONTENT, entity.getContent()); + converter.accept(STATUS, entity.getStatus()); + converter.accept(UPDATE_TIME, entity.getUpdateTime()); + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuleSetMerger.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuleSetMerger.java new file mode 100644 index 000000000000..d22240174c71 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuleSetMerger.java @@ -0,0 +1,192 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.rule.ext; + +import java.util.ArrayList; +import java.util.Comparator; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.ServiceLoader; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * Folds a disk-loaded baseline + every {@link RuntimeRuleOverrideResolver} discovered on the + * classpath into a single {@code (name -> bytes)} map per catalog. The MAL and LAL static-file + * loaders feed the result directly into their compile pipelines. + * + *

Merge order

+ * Resolvers are applied in ascending {@link RuntimeRuleOverrideResolver#priority()} + * order so higher-priority entries overwrite lower-priority ones. The disk map is the initial + * baseline (priority −∞). + * + *

Side effect: {@link StaticRuleRegistry}

+ * The merger calls {@link StaticRuleRegistry#record(String, String, byte[])} for every disk + * baseline entry before merging, so the runtime apply pipeline's delta classifier can compare + * a later REST {@code /addOrUpdate} body against the original on-disk content even when a + * resolver has substituted the boot-time bytes. Recording happens whether or not any resolver + * is present. + */ +@Slf4j +public final class RuleSetMerger { + + private RuleSetMerger() { + } + + /** + * Process-wide {@link ModuleManager} stashed by {@code CoreModuleProvider} during start. + * Callers (MAL / LAL static loaders) reach it via {@link #merge(String, Map)} so they + * don't have to thread {@code ModuleManager} through every signature. Tests that don't + * boot core leave this {@code null} and resolvers needing the manager return empty + * contributions. + */ + private static volatile ModuleManager INSTALLED_MANAGER; + + /** + * Set the process-wide module manager. Called once from + * {@code CoreModuleProvider.start()} after the management streams are registered. + * Tests may call with {@code null} to reset between cases. + */ + public static void installManager(final ModuleManager manager) { + INSTALLED_MANAGER = manager; + } + + /** + * Default-manager overload — the path most production callers take. Looks up the + * process-wide {@link ModuleManager} installed by core, discovers every + * {@link RuntimeRuleOverrideResolver} via {@link ServiceLoader}, and merges with the + * supplied disk baseline. + */ + public static Map merge(final String catalog, final Map diskBytes) { + return merge(catalog, diskBytes, discoverResolvers(), INSTALLED_MANAGER); + } + + /** + * Explicit-manager overload for callers that already hold a {@link ModuleManager} (e.g. + * receivers being updated to thread it through directly). Same merge semantics as the + * default overload; bypasses the static manager. + * + * @param catalog catalog identifier (e.g. {@code "otel-rules"}, {@code "lal"}). Recorded + * on each {@link StaticRuleRegistry} entry for the runtime delta classifier. + * @param diskBytes raw disk content keyed by rule name (file basename without extension). + * Already filtered by the loader's allow-list (e.g. enabled rules). + * @param manager OAP {@link ModuleManager}, threaded through to each resolver's + * {@link RuntimeRuleOverrideResolver#loadAll(String, ModuleManager)}. May be + * {@code null} when the caller has no module context (tests) — resolvers + * that need it return an empty map gracefully. + * @return ordered merge of disk + resolvers; entries the merge resolved as + * {@link RuntimeRuleOverrideResolver.Decision#INACTIVE} are absent. + */ + public static Map merge(final String catalog, + final Map diskBytes, + final ModuleManager manager) { + return merge(catalog, diskBytes, discoverResolvers(), manager); + } + + /** + * Variant with an explicit resolver list — primarily for tests that want to bypass + * {@link ServiceLoader}. + */ + public static Map merge(final String catalog, + final Map diskBytes, + final List resolvers, + final ModuleManager manager) { + // Snapshot the on-disk baseline into StaticRuleRegistry before we start mutating the + // working map; the runtime-rule delta classifier reads original bytes from there even + // when a high-priority resolver has substituted them in `out`. + final StaticRuleRegistry registry = StaticRuleRegistry.active(); + if (registry != null) { + diskBytes.forEach((name, bytes) -> registry.record(catalog, name, bytes)); + } + + final Map out = new HashMap<>(diskBytes); + + if (resolvers == null || resolvers.isEmpty()) { + return out; + } + + final List ordered = new ArrayList<>(resolvers); + ordered.sort(Comparator.comparingInt(RuntimeRuleOverrideResolver::priority)); + + for (final RuntimeRuleOverrideResolver resolver : ordered) { + final Map contributions; + try { + contributions = resolver.loadAll(catalog, manager); + } catch (final Throwable t) { + log.warn("RuntimeRuleOverrideResolver {} loadAll({}) threw — skipping resolver", + resolver.getClass().getName(), catalog, t); + continue; + } + if (contributions == null || contributions.isEmpty()) { + continue; + } + contributions.forEach((name, res) -> { + if (res == null || res.getDecision() == null) { + return; + } + switch (res.getDecision()) { + case ACTIVE: + if (res.getContent() == null) { + log.warn("Resolver {} returned ACTIVE with null content for {}/{} — ignored", + resolver.getClass().getName(), catalog, name); + return; + } + out.put(name, res.getContent()); + break; + case INACTIVE: + out.remove(name); + break; + default: + // unreachable — enum is closed + } + }); + } + return out; + } + + /** + * Cache the discovered resolver list per process. {@code ServiceLoader} is cheap to + * iterate but instantiating fresh resolvers per call would defeat their internal + * caches. Lazy-initialised so tests that swap in stubs via {@link #merge(String, Map, List)} + * never trigger the discovery path. + */ + private static volatile List CACHED_RESOLVERS; + + private static List discoverResolvers() { + List cached = CACHED_RESOLVERS; + if (cached != null) { + return cached; + } + synchronized (RuleSetMerger.class) { + cached = CACHED_RESOLVERS; + if (cached != null) { + return cached; + } + final List discovered = new ArrayList<>(); + for (final RuntimeRuleOverrideResolver r : ServiceLoader.load(RuntimeRuleOverrideResolver.class)) { + discovered.add(r); + log.info("RuntimeRuleOverrideResolver registered: {} (priority={})", + r.getClass().getName(), r.priority()); + } + CACHED_RESOLVERS = discovered; + return discovered; + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuntimeRuleOverrideResolver.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuntimeRuleOverrideResolver.java new file mode 100644 index 000000000000..c3c323672424 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/RuntimeRuleOverrideResolver.java @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.rule.ext; + +import java.util.Map; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * Boot-time resolver SPI consulted by MAL / LAL static-file loaders. Each implementation + * contributes its own view of "what rules should be live for a catalog at boot" — a DB + * resolver in the runtime-rule plugin, a future GitOps resolver in another plugin, etc. + * + *

Discovery

+ * Loaded via {@link java.util.ServiceLoader}. Plugins ship a + * {@code META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver} + * line per implementation. Implementations MUST have a public no-arg constructor. + * + *

Merge semantics

+ * {@link RuleSetMerger} folds the disk view + every resolver's {@link #loadAll} into a single + * {@code (name -> bytes)} map per catalog, with priority deciding ties: + *
    + *
  • Lower {@link #priority} resolvers are applied first; higher priority overwrites.
  • + *
  • {@link Decision#ACTIVE} substitutes the resolver's content into the merged set.
  • + *
  • {@link Decision#INACTIVE} removes the entry from the merged set — even if the disk + * file or a lower-priority resolver had content for that key.
  • + *
  • A resolver omits a key from {@link #loadAll} when it has no opinion; the next higher + * priority resolver (or the disk baseline) is the source of truth.
  • + *
+ * + *

Examples

+ *
+ *   Resolver A (priority 100, runtime-rule DB):
+ *     "vm"          => Resolution(ACTIVE,    bytes-from-DB)
+ *     "noisy-rule"  => Resolution(INACTIVE, null)
+ *
+ *   Result for catalog "otel-rules":
+ *     - if disk has "vm.yaml":          merged["vm"]          = bytes-from-DB
+ *     - if disk has "noisy-rule.yaml":  merged drops "noisy-rule" entirely
+ *     - if disk lacks "new-rule.yaml":  merged["new-rule"]    = bytes-from-DB (DB-only rule)
+ * 
+ */ +public interface RuntimeRuleOverrideResolver { + + /** + * Resolver priority. Higher number wins on conflict (last-write-wins under + * descending-priority application). Default {@code 0}. + * + *

Suggested ranges: + *

    + *
  • {@code 0–99} — defaults, low-trust sources
  • + *
  • {@code 100} — runtime-rule DB ({@code DbOverrideRuntimeRuleResolver})
  • + *
  • {@code 200–999} — externally-managed config (GitOps, k8s ConfigMap, etc.)
  • + *
+ * Resolvers with equal priority are applied in classpath / ServiceLoader iteration + * order — explicit priority is preferred over relying on that. + * + * @return higher = stronger override. + */ + default int priority() { + return 0; + } + + /** + * Every {@code (name, Resolution)} this resolver wants to contribute for the given + * catalog. Names not present in the returned map mean "I have no opinion" — the + * merge engine leaves the disk baseline (or a lower-priority resolver's contribution) + * in place for those keys. + * + *

Implementations are expected to cache. The {@code manager} reference lets a + * resolver look up services it needs (e.g. {@code RuntimeRuleManagementDAO} via + * the storage module). It may be {@code null} when called from test paths or from + * loaders that don't have a module context — resolvers that need a manager should + * return an empty map in that case rather than throw. + * + * @param catalog one of {@code "otel-rules"}, {@code "log-mal-rules"}, {@code "lal"}, + * or a future catalog name. Resolvers should return an empty map for + * catalogs they don't recognise. + * @param manager OAP module manager, or {@code null} when the caller has no module + * context (tests). + * @return per-name Resolution; never {@code null} (return an empty map instead). + */ + Map loadAll(String catalog, ModuleManager manager); + + /** + * Per-key decision. Names follow the {@code RuntimeRule} status enum so wire vocabulary + * stays consistent across the API surface (REST endpoint statuses, DB column values, + * resolver decisions). + */ + enum Decision { + /** This resolver wants the rule live with the supplied content. */ + ACTIVE, + /** This resolver wants the rule removed regardless of disk content. */ + INACTIVE + } + + /** + * One resolver's opinion about a single rule. Immutable. + */ + final class Resolution { + private final Decision decision; + private final byte[] content; + + public Resolution(final Decision decision, final byte[] content) { + this.decision = decision; + this.content = content; + } + + /** + * Convenience constructor for {@link Decision#ACTIVE} resolutions — the only kind + * that carries content. + */ + public static Resolution active(final byte[] content) { + return new Resolution(Decision.ACTIVE, content); + } + + /** + * Convenience for {@link Decision#INACTIVE} resolutions — content is null. + */ + public static Resolution inactive() { + return new Resolution(Decision.INACTIVE, null); + } + + public Decision getDecision() { + return decision; + } + + /** + * Raw rule bytes when {@link #getDecision()} is {@link Decision#ACTIVE}; {@code null} + * for {@link Decision#INACTIVE}. + */ + public byte[] getContent() { + return content; + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistry.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistry.java new file mode 100644 index 000000000000..7e2777bd1d19 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistry.java @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.rule.ext; + +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Process-wide snapshot of the on-disk static rule content seen at boot, recorded by + * {@link RuleSetMerger} before any {@link RuntimeRuleOverrideResolver} substitutes operator + * overrides. The runtime-rule REST handler reads from this registry to compute + * {@code priorContent} for the delta classifier when no DB row yet exists for a + * {@code (catalog, name)}. + * + *

Singleton because resolvers populate it during analyzer-module {@code start()}, before + * the receiver modules hosting the runtime-rule admin surface boot. + */ +public final class StaticRuleRegistry { + + private static final StaticRuleRegistry ACTIVE = new StaticRuleRegistry(); + + /** + * @return the process-wide singleton. Always non-null; calls on a fresh registry return + * {@link Optional#empty()} until {@link #record} has been invoked for that + * {@code (catalog, name)}. + */ + public static StaticRuleRegistry active() { + return ACTIVE; + } + + /** Map key is {@code catalog + ":" + name}; matches the runtime-rule catalog naming. */ + private final ConcurrentHashMap staticContent = new ConcurrentHashMap<>(); + + private StaticRuleRegistry() { + } + + /** + * Record the raw disk content for one static rule. Idempotent — repeated calls for the + * same key replace the recorded bytes, which is the desired behaviour when a boot pass + * re-reads the same file. + * + * @param catalog catalog identifier (e.g., {@code "otel-rules"}, {@code "log-mal-rules"}, + * {@code "lal"}). + * @param name rule name (file path under catalog root, without extension; may include + * {@code /} for nested layouts). + * @param content raw disk bytes — decoded as UTF-8 and stored as a String for parity with + * how DB rows store rule bodies. + */ + public void record(final String catalog, final String name, final byte[] content) { + if (catalog == null || name == null || content == null) { + return; + } + staticContent.put(key(catalog, name), new String(content, StandardCharsets.UTF_8)); + } + + /** + * @return the raw disk content for the given {@code (catalog, name)}, or + * {@link Optional#empty()} if no static file was recorded for it. + */ + public Optional find(final String catalog, final String name) { + if (catalog == null || name == null) { + return Optional.empty(); + } + return Optional.ofNullable(staticContent.get(key(catalog, name))); + } + + /** + * Read-only view of every {@code catalog:name} → content pair currently recorded. Used by + * the runtime-rule reconciler to seed synthetic applied-state entries at boot (so tick + * idempotency works for rules that live only on disk) and to rehydrate after an operator + * {@code /delete} removes the runtime tombstone covering a shipped static rule. + * + *

The map is a live read-through view of the registry's backing store; iteration order + * is unspecified. Callers must not mutate the returned map. + */ + public Map entries() { + return Collections.unmodifiableMap(staticContent); + } + + /** + * Every {@code (name, content)} pair recorded under {@code catalog}, sorted by name. + * Used by {@code GET /runtime/rule/bundled} to render the static-rule view that UIs + * merge with the runtime-overrides view from {@code GET /runtime/rule/list}. + */ + public List findByCatalog(final String catalog) { + if (catalog == null) { + return Collections.emptyList(); + } + final String prefix = catalog + ":"; + final List matches = new ArrayList<>(); + for (final Map.Entry e : staticContent.entrySet()) { + if (e.getKey().startsWith(prefix)) { + final String name = e.getKey().substring(prefix.length()); + matches.add(new NamedRule(name, e.getValue())); + } + } + matches.sort((a, b) -> a.name.compareTo(b.name)); + return matches; + } + + /** + * Pair of (rule name, raw YAML content) returned by {@link #findByCatalog(String)}. + * Public so the REST handler can iterate the result directly without a tuple type. + */ + public static final class NamedRule { + private final String name; + private final String content; + + public NamedRule(final String name, final String content) { + this.name = name; + this.content = content; + } + + public String getName() { + return name; + } + + public String getContent() { + return content; + } + } + + /** + * Split the registry's {@code catalog:name} key back into its components. Centralised + * here so callers don't hardcode the separator. + */ + public static String[] splitKey(final String key) { + if (key == null) { + return null; + } + final int colon = key.indexOf(':'); + if (colon <= 0 || colon == key.length() - 1) { + return null; + } + return new String[] {key.substring(0, colon), key.substring(colon + 1)}; + } + + /** + * Test hook — drops every recorded entry. Intentionally package-private so only tests in + * the same package can reach it; production code must not clear a populated registry. + */ + void clear() { + staticContent.clear(); + } + + private static String key(final String catalog, final String name) { + return catalog + ":" + name; + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/source/DefaultScopeDefine.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/source/DefaultScopeDefine.java index 0916fe4fdd59..f768aac32eb6 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/source/DefaultScopeDefine.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/source/DefaultScopeDefine.java @@ -158,6 +158,7 @@ public class DefaultScopeDefine { public static final int ALARM_RECOVERY = 95; public static final int GEN_AI_PROVIDER_ACCESS = 96; public static final int GEN_AI_MODEL_ACCESS = 97; + public static final int RUNTIME_RULE = 98; /** * Catalog of scope, the metrics processor could use this to group all generated metrics by oal rt. diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/StorageModule.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/StorageModule.java index cd711dedeaa7..4a7ae9c14276 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/StorageModule.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/StorageModule.java @@ -19,8 +19,10 @@ package org.apache.skywalking.oap.server.core.storage; import org.apache.skywalking.oap.server.core.storage.cache.INetworkAddressAliasDAO; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.core.storage.management.UIMenuManagementDAO; import org.apache.skywalking.oap.server.core.storage.management.UITemplateManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskLogQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IJFRDataQueryDAO; @@ -102,7 +104,15 @@ public Class[] services() { IPprofTaskQueryDAO.class, IPprofTaskLogQueryDAO.class, IPprofDataQueryDAO.class, - StorageTTLStatusQuery.class + StorageTTLStatusQuery.class, + // Exposed for the runtime-rule reconciler — it calls ModelInstaller.isExists + // after a hot-apply to verify DDL landed (BanyanDB swallows ALREADY_EXISTS on + // shape-changing re-creates, so post-verify is the only way to detect silent + // schema divergence). + ModelInstaller.class, + // Exposed so the runtime-rule receiver can persist / list / delete RuntimeRule + // rows independent of the generic IManagementDAO path. + RuntimeRuleManagementDAO.class, }; } } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/annotation/ValueColumnMetadata.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/annotation/ValueColumnMetadata.java index 58648a3bb4b3..de62c44738fe 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/annotation/ValueColumnMetadata.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/annotation/ValueColumnMetadata.java @@ -59,6 +59,16 @@ public void overrideColumnName(String oldName, String newName) { columnNameOverrideRule.put(oldName, newName); } + /** + * Drop the metadata entry for the given model name. Used by the runtime-rule + * teardown path so a subsequent re-register under a different scope (e.g. a + * SHAPE-BREAK from SERVICE → SERVICE_INSTANCE) is not silently ignored by + * {@link #putIfAbsent}. No-op when the entry is absent. + */ + public void remove(String modelName) { + mapping.remove(modelName); + } + /** * Fetch the value column name of the given metrics name. */ diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/management/RuntimeRuleManagementDAO.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/management/RuntimeRuleManagementDAO.java new file mode 100644 index 000000000000..bba9f68d0731 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/management/RuntimeRuleManagementDAO.java @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.storage.management; + +import java.io.IOException; +import java.util.List; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.DAO; + +/** + * Per-backend read / write / delete DAO for runtime-managed MAL/LAL rule files. The generic + * {@link org.apache.skywalking.oap.server.core.storage.IManagementDAO#insert} path is not + * used: BanyanDB's generic impl never persists, and ES/JDBC short-circuit when the row + * already exists, which silently breaks {@code /addOrUpdate} and {@code /inactivate} (every + * call after the first becomes a no-op). Each backend implements upsert semantics directly + * here so the persist-is-commit invariant holds across all three. + */ +public interface RuntimeRuleManagementDAO extends DAO { + + /** + * @return every runtime-rule file, ACTIVE and INACTIVE alike. The reconciler is the sole + * caller today; it diffs the full set against its in-memory snapshot on every tick. + */ + List getAll() throws IOException; + + /** + * Upsert by composite key (catalog, name). Replaces {@code content}, {@code status} and + * {@code updateTime} when the row already exists; inserts a new row otherwise. Both + * {@code /addOrUpdate} (every call after the first) and {@code /inactivate} (every call) + * depend on overwrite semantics — without it the operator sees 200 / structural_applied + * / inactivated while the backing row stays unchanged. + * + * @throws IOException when the underlying storage write fails. Callers translate this + * into a 5xx response so the operator does not get a false success. + */ + void save(RuntimeRule rule) throws IOException; + + /** + * Hard delete a runtime-rule file by composite key. Idempotent: if no record matches, + * implementations must return silently rather than throw. A successful return does not + * imply the backend physically reclaimed the storage (e.g. Elasticsearch may mark it + * deleted pending a merge); it only implies the file will not be returned by + * {@link #getAll()}. + */ + void delete(String catalog, String name) throws IOException; + + /** + * Logical representation of one runtime-managed rule file (YAML content + status + + * update time, keyed by catalog + name). Lives in server-core to avoid a module + * dependency from storage plugins onto the runtime-rule receiver plugin. Each storage + * impl populates this from its own result set. Named for the operator-facing concept + * ("a rule file") rather than the persistence shape ("a storage row"). + */ + class RuntimeRuleFile { + private final String catalog; + private final String name; + private final String content; + private final String status; + private final long updateTime; + + public RuntimeRuleFile(final String catalog, final String name, final String content, + final String status, final long updateTime) { + this.catalog = catalog; + this.name = name; + this.content = content; + this.status = status; + this.updateTime = updateTime; + } + + public String getCatalog() { + return catalog; + } + + public String getName() { + return name; + } + + public String getContent() { + return content; + } + + public String getStatus() { + return status; + } + + public long getUpdateTime() { + return updateTime; + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java index 016b90b3256d..721bd8bf0d9a 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java @@ -27,21 +27,74 @@ import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.library.client.Client; import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.library.module.Service; /** - * The core module installation controller. + * The core module installation controller — subscribed to {@link ModelRegistry} events so + * every registered {@link Model} triggers either an install (on {@code whenCreating}) or a + * drop (on {@code whenRemoving}) on the active backend. + * + *

Exposed as a {@link Service} so cross-module callers (today: the runtime-rule reconciler) + * can retrieve the active backend's installer via {@code StorageModule.provider().getService( + * ModelInstaller.class)} and invoke {@link #isExists(Model, StorageManipulationOpt)} for + * post-apply DDL verification. The storage providers register their concrete subclass as the + * {@code ModelInstaller} service implementation; the abstract type is the SPI lookup key. */ @RequiredArgsConstructor @Slf4j -public abstract class ModelInstaller implements ModelCreator.CreatingListener { +public abstract class ModelInstaller implements ModelRegistry.CreatingListener, Service { protected final Client client; protected final ModuleManager moduleManager; @Override - public void whenCreating(Model model) throws StorageException { + public void whenCreating(Model model, StorageManipulationOpt opt) throws StorageException { + final StorageManipulationOpt.Flags flags = opt.getFlags(); + + // Zero server RPCs — peer-side ticks. The earlier order called isExists first + // (which on ES/JDBC fires a backend read) and only then checked the policy, + // which made the contract a half-truth. Gate ahead of isExists so a peer apply + // is genuinely zero-RPC. + if (!flags.isInspectBackend()) { + opt.recordOutcome("table", model.getName(), + StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, + "local-cache-only mode; main-node is expected to have installed this resource"); + log.debug( + "install: model [{}] not installed; local-cache-only mode — skipping (no isExists probe)", + model.getName() + ); + return; + } + + // Strict verify path — run the read-only existence/shape inspection and surface + // missing or mismatched resources as fatal so module bootstrap exits (k8s pod + // backloop). Operator must align with the init OAP first. Distinct from the + // legacy non-init poll loop further down: that loop waits forever; this path + // fails fast. + if (flags.isFailOnAbsence() || flags.isFailOnShapeMismatch()) { + InstallInfo info = isExists(model, opt); + if (flags.isFailOnShapeMismatch() && opt.hasShapeMismatch()) { + final StorageManipulationOpt.ResourceOutcome o = opt.firstShapeMismatch(); + throw new StorageException( + "local-cache-verify boot: backend resource '" + (o == null ? model.getName() : o.getResourceName()) + + "' shape diverges from declared model — refusing to start. " + + "Reconcile via the init OAP's /runtime/rule/addOrUpdate first. diff: " + + (o == null ? "n/a" : o.getDiff())); + } + if (flags.isFailOnAbsence() && !info.isAllExist()) { + throw new StorageException( + "local-cache-verify boot: backend resources for model '" + model.getName() + + "' are not all present — refusing to start. Wait for the init OAP to " + + "create them or push the runtime rule. " + info.buildInstallInfoMsg()); + } + return; + } + + // Legacy poll loop for non-init OAPs that did not opt into the strict verify + // mode. Static models (boot-time) still take this path; runtime-rule reconciler + // explicitly chooses verify so this loop is bypassed. if (RunningMode.isNoInitMode()) { while (true) { - InstallInfo info = isExists(model); + InstallInfo info = isExists(model, opt); if (!info.isAllExist()) { try { log.info( @@ -56,16 +109,43 @@ public void whenCreating(Model model) throws StorageException { break; } } - } else { - InstallInfo info = isExists(model); - if (!info.isAllExist()) { - log.info( - "install info: {}. table for model: [{}] not all required resources exist, creating or updating...", - info.buildInstallInfoMsg(), model.getName() - ); - createTable(model); - } + return; + } + + InstallInfo info = isExists(model, opt); + if (info.isAllExist()) { + return; + } + if (!flags.isCreateMissing()) { + // Inspect-but-don't-create: caller wants existence reported as outcome but + // explicitly forbids DDL. Today no canonical mode hits this branch, but the + // flag combination is valid (e.g. dry-run reporting) and falling through to + // createTable would silently violate the contract. + opt.recordOutcome("table", model.getName(), + StorageManipulationOpt.Outcome.MISSING, + "missing on backend; createMissing flag is off — skipping DDL"); + return; + } + log.info( + "install info: {}. table for model: [{}] not all required resources exist, creating or updating...", + info.buildInstallInfoMsg(), model.getName() + ); + createTable(model, opt); + opt.recordOutcome("table", model.getName(), + StorageManipulationOpt.Outcome.CREATED, null); + } + + @Override + public void whenRemoving(Model model, StorageManipulationOpt opt) throws StorageException { + if (!opt.getFlags().isDropOnRemoval()) { + opt.recordOutcome("table", model.getName(), + StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, + "dropOnRemoval flag is off; server drop is main-node responsibility (or boot path that never drops)"); + return; } + dropTable(model, opt); + opt.recordOutcome("table", model.getName(), + StorageManipulationOpt.Outcome.DROPPED, null); } public void start() { @@ -83,15 +163,58 @@ protected final void overrideColumnName(String columnName, String newName) { } /** - * Check whether the storage entity exists. Need to implement based on the real storage. + * Check whether the storage entity exists, reporting per-resource outcomes on + * {@code opt}. Backends with in-isExists side effects (BanyanDB's auto-update of + * {@code Measure}/{@code IndexRule}/{@code IndexRuleBinding}) honour + * {@link StorageManipulationOpt#isLocalCacheOnly()} to suppress server writes when the + * caller is a peer node. */ - public abstract InstallInfo isExists(Model model) throws StorageException; + public abstract InstallInfo isExists(Model model, StorageManipulationOpt opt) throws StorageException; /** - * Create the storage entity. All creations should be after the {@link #isExists(Model)} check. + * Create the storage entity. All creations should be after the + * {@link #isExists(Model, StorageManipulationOpt)} check. + * + *

Default implementation delegates to {@link #createTable(Model)} for source + * compatibility with backends that don't yet need the opt; subclasses that want + * to capture per-call state (e.g. BanyanDB's etcd {@code mod_revision} via + * {@link StorageManipulationOpt#recordModRevision(long)} for a post-install + * fence) override this overload. + */ + public void createTable(Model model, StorageManipulationOpt opt) throws StorageException { + createTable(model); + } + + /** + * Legacy create — superseded by {@link #createTable(Model, StorageManipulationOpt)}. + * Subclasses that don't need opt access keep overriding this method; the default + * orchestrator path goes through the opt-aware overload. */ public abstract void createTable(Model model) throws StorageException; + /** + * Drop the storage entity for a runtime-removed model. Default is a no-op — only backends whose physical + * schema is per-logical-model (BanyanDB Measure/Stream) should override to perform the actual drop. + * JDBC and Elasticsearch are append-only by design and keep the underlying tables/indices intact even when + * a model is removed from the in-memory registry; their implementations should leave this as a no-op. + * + *

Invoked by {@link ModelRegistry.CreatingListener#whenRemoving(Model, StorageManipulationOpt)} which is fired from + * {@link ModelRegistry#remove(Class, StorageManipulationOpt)} during runtime-rule hot-remove (MAL/LAL). + * Not invoked on startup. + */ + public void dropTable(Model model) throws StorageException { + } + + /** + * Opt-aware drop variant. Backends that need post-drop bookkeeping on the opt + * (e.g. BanyanDB capturing the tombstone {@code mod_revision} for a + * {@code SchemaBarrierService.AwaitSchemaDeleted} fence) override this overload; + * default delegates to the no-arg {@link #dropTable(Model)}. + */ + public void dropTable(Model model, StorageManipulationOpt opt) throws StorageException { + dropTable(model); + } + @Getter @Setter public abstract static class InstallInfo { diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java new file mode 100644 index 000000000000..b05dd785a349 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.storage.model; + +import java.util.List; +import org.apache.skywalking.oap.server.core.storage.StorageException; +import org.apache.skywalking.oap.server.core.storage.annotation.Storage; +import org.apache.skywalking.oap.server.library.module.Service; + +/** + * Registry for every {@link Model} the OAP process knows about, plus the event bus + * ({@link CreatingListener}) that storage installers subscribe to. Every mutation carries + * a {@link StorageManipulationOpt} so callers can express policy (full-install vs. + * local-cache-only) and installers can report per-resource outcomes back through the same + * object. + */ +public interface ModelRegistry extends Service { + /** + * Add a new model with a caller-specified {@link StorageManipulationOpt policy}. If a model + * with the same {@code storage#getModelName()} and {@code storage#getDownsampling()} already + * exists, the call is treated as idempotent and the existing model is returned without firing + * {@link CreatingListener#whenCreating(Model, StorageManipulationOpt)} again. + * + *

The {@code opt} is mutable: installers record per-resource outcomes on it as they run. + * Callers may inspect {@link StorageManipulationOpt#getOutcomes()} after return. + * + * @return the created or pre-existing model + */ + Model add(Class aClass, int scopeId, Storage storage, StorageManipulationOpt opt) + throws StorageException; + + /** + * Remove an existing model by its stream class with a caller-specified policy. All models + * registered through {@link #add(Class, int, Storage, StorageManipulationOpt)} with the given + * stream class (across any downsampling variants) are removed from the registry, and every + * registered {@link CreatingListener#whenRemoving(Model, StorageManipulationOpt)} is fired for + * each. Used by runtime rule hot-update (MAL/LAL hot-remove); not intended to be called during + * the startup path. + * + *

Peer-node callers pass {@link StorageManipulationOpt#localCacheOnly()} so installers + * skip the server-side drop and record {@link StorageManipulationOpt.Outcome#SKIPPED_NOT_ALLOWED} + * against the affected resources. + * + * @return the list of models that were removed, empty if none matched + */ + List remove(Class streamClass, StorageManipulationOpt opt) throws StorageException; + + void addModelListener(CreatingListener listener) throws StorageException; + + interface CreatingListener { + /** + * Invoked when a model is registered via {@link ModelRegistry#add}. Listeners receive + * the {@link StorageManipulationOpt} the caller threaded through the registry — skip + * server-side DDL when {@link StorageManipulationOpt#isLocalCacheOnly()}, and record + * per-resource outcomes on the opt for the caller to inspect. + */ + void whenCreating(Model model, StorageManipulationOpt opt) throws StorageException; + + /** + * Invoked when a model is removed via {@link ModelRegistry#remove}. Default is a no-op + * so listeners that don't own server-side resources (e.g., pure schema caches) compile + * without boilerplate. Storage installers that own physical schema (BanyanDB measures) + * override this and skip the server-side drop when + * {@link StorageManipulationOpt#isLocalCacheOnly()}. + */ + default void whenRemoving(Model model, StorageManipulationOpt opt) throws StorageException { + } + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java new file mode 100644 index 000000000000..935788772112 --- /dev/null +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -0,0 +1,485 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.storage.model; + +import java.util.Collections; +import java.util.List; +import java.util.concurrent.CopyOnWriteArrayList; +import java.util.concurrent.atomic.AtomicLong; +import lombok.Builder; +import lombok.Getter; + +/** + * Per-call policy + outcome for a storage model manipulation — threaded through the + * {@link ModelRegistry} → {@link ModelRegistry.CreatingListener} → {@link ModelInstaller} call + * chain. The {@link Mode} is set by the caller up-front; outcome entries are appended by the + * installer as it examines each underlying storage resource (table, index, measure, index + * rule, binding, template, etc.). + * + *

Canonical profiles — always use a named factory

+ * Four modes, each matching one distinct caller scenario. Use the factories; the + * constructor is private. If a future scenario genuinely needs a fifth mode, add it to + * {@link Mode} here so every caller keeps picking from a known set. + * + *

{@link #fullInstall()} — {@link Mode#FULL_INSTALL} (predicate: {@link #isFullInstall()})

+ *

Callers: + *

    + *
  • Main-node REST apply ({@code /addOrUpdate}, {@code /delete}) — operator-driven, + * structural changes explicitly intended (recovery pushes use the same + * {@code /addOrUpdate} route with {@code allowStorageChange=true} + + * {@code force=true})
  • + *
  • Main-node reconciler tick for files that haven't yet converged via REST + * (rare — REST usually wins the race)
  • + *
+ *

Note: {@code /inactivate} is a soft-pause that goes through + * {@link Mode#LOCAL_CACHE_ONLY} — backend schema and data are preserved; only + * OAP-internal state (compiled bundles, dispatch, prototypes) is torn down so + * cheap re-activation works on the next {@code /addOrUpdate}. + *

Backend behaviour: full DDL — create missing tables / measures, drop retired ones, + * auto-update BanyanDB {@code Measure} / {@code IndexRule} / {@code IndexRuleBinding} on + * shape mismatch, and create / update index rules + bindings. Reshaping is treated as + * intended because the caller came in through an on-demand operator request. + * + *

{@link #createIfAbsent()} — {@link Mode#CREATE_IF_ABSENT} (predicate: {@link #isCreateIfAbsent()})

+ *

Callers: + *

    + *
  • Startup-time model registration (every OAP, via stream processors — static MAL / + * LAL files on disk)
  • + *
+ *

Backend behaviour: create resources that are absent; when a resource is present with + * a shape that differs from what the model declares, record + * {@link Outcome#SKIPPED_SHAPE_MISMATCH SKIPPED_SHAPE_MISMATCH} and do not + * call update/reshape. Silent acceptance on reboot used to happen on BanyanDB + * ({@code ALREADY_EXISTS} swallow) and JDBC (column-type changes undetected); explicit + * skip surfaces the mismatch to the operator, who must reshape via the on-demand + * runtime-rule REST endpoint (the only workflow that may change backend schema). + * + *

{@link #localCacheVerify()} — {@link Mode#LOCAL_CACHE_VERIFY} (predicate: {@link #isLocalCacheVerify()})

+ *

Callers: + *

    + *
  • Boot-time reconciler pass on a non-init OAP — the operator declared + * {@code init=false}, so this OAP must not perform DDL but must refuse to start if + * the backend isn't already in the shape the persisted runtime-rule catalog + * declares.
  • + *
+ *

Backend behaviour: read-only inspection. The installer issues the same metadata + * read RPCs as {@link Mode#CREATE_IF_ABSENT} but never invokes create / update / drop. On + * resource missing OR shape mismatch the installer throws — the exception propagates up + * through the module bootstrap and causes the OAP process to exit, which under k8s + * results in a pod backloop until either the init OAP has caught up or the operator has + * fixed the rule files. This matches general OAP boot semantics for static models in + * non-init mode: the OAP will not silently start with a backend that disagrees with + * what's declared. Local {@code MetadataRegistry} is populated only when the live shape + * matches the declared shape. + * + *

{@link #localCacheOnly()} — {@link Mode#LOCAL_CACHE_ONLY} (predicate: {@link #isLocalCacheOnly()})

+ *

Callers: + *

    + *
  • Peer-node reconciler tick (peer is not the hash-selected main for this file — + * main owns server-side DDL)
  • + *
  • Main-node REST {@code /inactivate} — soft-pause: backend schema + data are + * preserved, only OAP-internal state (compiled bundles, dispatch, prototypes) is + * torn down so re-activation on the next {@code /addOrUpdate} is cheap
  • + *
+ *

Backend behaviour: zero server RPCs. {@code BanyanDBIndexInstaller.isExists} + * short-circuits: {@code parseMetadata} + populate {@code MetadataRegistry} + return + * {@code allExist=true}. {@code whenCreating} / {@code whenRemoving} record + * {@link Outcome#SKIPPED_NOT_ALLOWED SKIPPED_NOT_ALLOWED} outcomes instead of firing + * {@code createTable} / {@code dropTable}. Peer's local MeterSystem still compiles + * Metrics classes and populates {@code meterPrototypes} — that's pure in-JVM work the + * opt doesn't (and shouldn't) gate. Differs from {@link Mode#LOCAL_CACHE_VERIFY} in two + * ways: no server RPCs (cache populates from local model), and missing / mismatched + * resources are not a fatal error (the next tick will retry, or the + * main will catch up). + * + *

Why it's a single mutable object instead of separate policy/result

+ * Installers can nest many resource operations per Model (BanyanDB Measure + N index rules + * + binding + optional TopN; ES template + current index). The call chain passes one + * object; the installer appends outcomes as each resource is examined. The caller reads + * {@link #getOutcomes()} after the call returns to log or report. + */ +public final class StorageManipulationOpt { + + /** + * Storage-manipulation mode. The installer branches once on this value to decide whether + * server-side DDL (create / drop / update) is allowed. See the class Javadoc for the + * scenario each mode covers. + */ + public enum Mode { + /** + * Main-node on-demand path. Installer performs full DDL: create absent resources, + * detect shape mismatch and apply the additive subset each backend supports + * online ({@code client.update} for BanyanDB, add-column for JDBC, mapping append + * for ES). Reshape is treated as intended because the caller explicitly asked + * for it via the operator REST endpoint. + */ + FULL_INSTALL(Flags.builder() + .inspectBackend(true) + .createMissing(true) + .updateOnMismatch(true) + .dropOnRemoval(true) + .escalateToCaller(true) + .build()), + /** + * Static boot path on an init-mode OAP. Installer creates absent resources, but + * if a resource already exists with a shape that diverges from the declared + * model it records {@link Outcome#SKIPPED_SHAPE_MISMATCH} and does not + * call update / reshape. Operator must reconcile via the runtime-rule REST + * endpoint — boot is not allowed to silently mutate backend shape. + */ + CREATE_IF_ABSENT(Flags.builder() + .inspectBackend(true) + .createMissing(true) + .build()), + /** + * Boot path on a non-init OAP. Installer issues the same read-only inspection + * RPCs as {@link #CREATE_IF_ABSENT} but never creates / updates / drops. On + * resource missing or shape mismatch the installer throws; the + * exception propagates up through module bootstrap and exits the process. + * Under k8s this causes a pod backloop until the init OAP has caught up or the + * operator has aligned rule files with the backend. Local {@code MetadataRegistry} + * is populated only when the live shape matches the declared shape. + */ + LOCAL_CACHE_VERIFY(Flags.builder() + .inspectBackend(true) + .failOnAbsence(true) + .failOnShapeMismatch(true) + .build()), + /** + * Peer-node reconciler tick path. Zero server RPCs — local caches populate from + * the declared model and the main is trusted to own backend DDL. Missing or + * mismatched resources are not an error: the next tick will retry, and the main + * will eventually converge. Distinct from {@link #LOCAL_CACHE_VERIFY} in that + * verification is skipped entirely, not run-and-fail. + */ + LOCAL_CACHE_ONLY(Flags.builder().build()); + + @Getter + private final Flags flags; + + Mode(final Flags flags) { + this.flags = flags; + } + } + + /** + * Per-mode behavioural flags. Each control point in the install / remove pipeline + * checks one flag instead of branching on the {@link Mode} value, so adding a new + * mode is a matter of choosing flag values rather than auditing every {@code if + * (opt.isXxx())} site. Flags are immutable and shared across all opts of the same + * mode. + * + *

Each flag describes a distinct privilege the installer is granted by the + * caller. They are independently composable on paper, but the canonical + * combinations live on {@link Mode} — call sites should never construct a + * {@code Flags} directly.

+ */ + @Builder + @Getter + public static final class Flags { + /** + * Issue read RPCs to the backend (existence + shape compare). False on + * {@link Mode#LOCAL_CACHE_ONLY} where the contract is "zero server RPCs". When + * false the installer must populate local caches from the declared model and + * return early without inspecting the backend. + */ + private final boolean inspectBackend; + /** + * Call backend create primitives ({@code client.define}, JDBC {@code CREATE + * TABLE}, ES {@code createIndex}, BanyanDB {@code defineIndexRule} / + * {@code defineIndexRuleBinding}) when a resource is absent. + */ + private final boolean createMissing; + /** + * Call backend update primitives ({@code client.update}, JDBC {@code ALTER + * TABLE}, ES mapping append) when a present resource's live shape diverges from + * the declared shape. Only {@link Mode#FULL_INSTALL} (the operator-driven path) + * permits this — boot must never silently reshape backend storage. + * + *

Note: BanyanDB's index-rule / index-rule-binding update path is gated by + * {@link #failOnShapeMismatch} instead of this flag, preserving the long-standing + * behaviour that init-mode OAPs reconcile index rules even under + * {@link Mode#CREATE_IF_ABSENT}.

+ */ + private final boolean updateOnMismatch; + /** + * Call backend drop primitives ({@code client.dropMeasure} / {@code dropStream} + * / etc.) from {@link ModelRegistry.CreatingListener#whenRemoving}. Only + * {@link Mode#FULL_INSTALL} (operator-driven runtime-rule deletion) permits + * this; peers under {@link Mode#LOCAL_CACHE_ONLY} short-circuit with + * {@link Outcome#SKIPPED_NOT_ALLOWED}. + */ + private final boolean dropOnRemoval; + /** + * Throw a {@link org.apache.skywalking.oap.server.core.storage.StorageException} + * when a resource is absent on the backend after inspection. Used by + * {@link Mode#LOCAL_CACHE_VERIFY} to fail boot rather than silently start + * against an unprepared backend. + */ + private final boolean failOnAbsence; + /** + * Throw a {@link org.apache.skywalking.oap.server.core.storage.StorageException} + * when a present resource's live shape diverges from the declared shape. Used + * by {@link Mode#LOCAL_CACHE_VERIFY} so boot does not silently start against a + * backend whose schema disagrees with the rule file. + */ + private final boolean failOnShapeMismatch; + /** + * Re-throw cascaded backend errors to the caller (REST handler, operator + * tooling) instead of swallowing them. Set on {@link Mode#FULL_INSTALL}; other + * modes log and continue so a peer-side bookkeeping glitch doesn't take down + * the node. + */ + private final boolean escalateToCaller; + } + + @Getter + private final Mode mode; + + /** Per-resource outcomes appended as the installer examines each underlying resource. + * Read-only externally; copy-on-write so concurrent readers (e.g., metrics scrapers) + * never see a torn list. */ + private final List outcomes = new CopyOnWriteArrayList<>(); + + /** + * Behavioural flags for this opt. Convenience accessor — equivalent to + * {@code getMode().getFlags()}. Call sites read individual flags (e.g. + * {@code opt.getFlags().isCreateMissing()}) instead of pattern-matching on the + * {@link Mode}. + */ + public Flags getFlags() { + return mode.getFlags(); + } + + public static StorageManipulationOpt fullInstall() { + return new StorageManipulationOpt(Mode.FULL_INSTALL); + } + + public static StorageManipulationOpt createIfAbsent() { + return new StorageManipulationOpt(Mode.CREATE_IF_ABSENT); + } + + public static StorageManipulationOpt localCacheVerify() { + return new StorageManipulationOpt(Mode.LOCAL_CACHE_VERIFY); + } + + public static StorageManipulationOpt localCacheOnly() { + return new StorageManipulationOpt(Mode.LOCAL_CACHE_ONLY); + } + + /** + * True for {@link Mode#FULL_INSTALL}. The on-demand operator workflow — drops, + * updates, and reshapes are permitted because the caller explicitly asked for them. + */ + public boolean isFullInstall() { + return mode == Mode.FULL_INSTALL; + } + + /** + * True for {@link Mode#CREATE_IF_ABSENT}. The static boot workflow — create absent + * resources, skip + record {@link Outcome#SKIPPED_SHAPE_MISMATCH} on a resource that + * already exists with a different shape. Never update or drop. + */ + public boolean isCreateIfAbsent() { + return mode == Mode.CREATE_IF_ABSENT; + } + + /** + * True for {@link Mode#LOCAL_CACHE_VERIFY}. Boot-time strict verification on a + * non-init OAP — installer issues read-only inspection RPCs and throws on missing or + * shape-mismatched resources. No DDL. + */ + public boolean isLocalCacheVerify() { + return mode == Mode.LOCAL_CACHE_VERIFY; + } + + /** + * True for {@link Mode#LOCAL_CACHE_ONLY}. The {@code BanyanDBIndexInstaller.isExists} + * short-circuit reads this to skip every server RPC and populate + * {@code MetadataRegistry} only. + */ + public boolean isLocalCacheOnly() { + return mode == Mode.LOCAL_CACHE_ONLY; + } + + private StorageManipulationOpt(final Mode mode) { + this.mode = mode; + } + + /** + * Highest etcd {@code mod_revision} returned by any registry write performed + * during this opt's lifetime. Backends that expose a global revision (BanyanDB + * via the schema-barrier service) accumulate per-write revisions here so the + * post-install fence can wait on a single value. Backends without a revision + * concept leave it at {@link #DEFAULT_MOD_REVISION} (0) and the fence is a + * no-op. + */ + private final AtomicLong maxModRevision = new AtomicLong(0L); + + /** Sentinel returned by {@link #getMaxModRevision()} when no DDL was performed. */ + public static final long DEFAULT_MOD_REVISION = 0L; + + /** + * Record an etcd mod_revision returned by a registry write. The opt keeps the + * maximum so the caller can fence on a single revision after the install pass. + */ + public void recordModRevision(final long rev) { + if (rev <= 0L) { + return; + } + maxModRevision.accumulateAndGet(rev, Math::max); + } + + /** + * Highest mod_revision recorded so far, or {@link #DEFAULT_MOD_REVISION} if no + * write produced one. Callers that need to fence subsequent data writes / + * queries against the new schema pass this to + * {@code SchemaWatcher.awaitRevisionApplied}. + */ + public long getMaxModRevision() { + return maxModRevision.get(); + } + + /** + * Append a per-resource outcome. Called by the installer as it examines each + * underlying storage resource. + */ + public void recordOutcome(final String resourceType, final String resourceName, + final Outcome status, final String diff) { + outcomes.add(new ResourceOutcome(resourceType, resourceName, status, diff)); + } + + /** Read-only view of outcomes recorded so far, in the order the installer visited them. */ + public List getOutcomes() { + return Collections.unmodifiableList(outcomes); + } + + /** + * True when every recorded outcome is benign — resource is matched or was created / + * updated / dropped per policy. False when any outcome is {@link Outcome#MISSING}, + * {@link Outcome#EXISTING_MISMATCH}, or {@link Outcome#SKIPPED_NOT_ALLOWED}. + */ + public boolean isAllOk() { + for (final ResourceOutcome o : outcomes) { + switch (o.getStatus()) { + case MISSING: + case EXISTING_MISMATCH: + case SKIPPED_NOT_ALLOWED: + case SKIPPED_SHAPE_MISMATCH: + return false; + default: + break; + } + } + return true; + } + + /** + * True when at least one recorded outcome is {@link Outcome#SKIPPED_SHAPE_MISMATCH}. + * Callers (notably {@code MeterSystem.create} / {@code StorageModels.add}) read this + * after firing the {@code whenCreating} chain to decide whether to proceed with local + * registration or roll it back — a shape-mismatched metric must not be registered + * because its backend-declared schema disagrees with what's declared in the rule file. + */ + public boolean hasShapeMismatch() { + for (final ResourceOutcome o : outcomes) { + if (o.getStatus() == Outcome.SKIPPED_SHAPE_MISMATCH) { + return true; + } + } + return false; + } + + /** + * First {@link Outcome#SKIPPED_SHAPE_MISMATCH} outcome recorded, or {@code null} if + * none. Used to surface the diff on {@code /runtime/rule/list} and in error responses. + */ + public ResourceOutcome firstShapeMismatch() { + for (final ResourceOutcome o : outcomes) { + if (o.getStatus() == Outcome.SKIPPED_SHAPE_MISMATCH) { + return o; + } + } + return null; + } + + public enum Outcome { + /** The resource was not present on storage and creation was either not attempted + * (policy) or deferred to a later step in the chain. */ + MISSING, + /** Resource present and matches the intended shape. No action taken. */ + EXISTING_MATCHED, + /** Resource present but live shape differs from intended; update was NOT applied + * because the caller is in {@link Mode#LOCAL_CACHE_ONLY}. Caller may re-push with + * {@link #fullInstall()} to reconcile. {@link ResourceOutcome#getDiff()} carries + * a short description of the difference. */ + EXISTING_MISMATCH, + /** Installer ran {@code createTable} (or equivalent) and the resource now exists. */ + CREATED, + /** Installer ran {@code client.update} (BanyanDB) or mapping-append (ES) to + * reconcile live shape with intended. {@link ResourceOutcome#getDiff()} carries + * a short description of what was updated. */ + UPDATED, + /** Installer ran {@code dropTable} and the resource is no longer present. */ + DROPPED, + /** Installer intended to act (create, drop, update) but was blocked by policy. + * {@link ResourceOutcome#getDiff()} carries the reason. */ + SKIPPED_NOT_ALLOWED, + /** Boot-time shape mismatch — backend already holds a resource with the same name + * but a different shape than the declared model. The installer did NOT drop, did + * NOT update, and did NOT register; the operator must reconcile explicitly via + * the on-demand runtime-rule {@code /addOrUpdate} endpoint (only workflow that + * may change backend schema). {@link ResourceOutcome#getDiff()} carries the + * declared-vs-backend diff for operator inspection. */ + SKIPPED_SHAPE_MISMATCH + } + + @Getter + public static final class ResourceOutcome { + /** Short label for the underlying resource kind — e.g. "measure", "stream", + * "property", "indexRule", "indexRuleBinding", "topN", "template", "index", + * "table", "additionalTable". Operator-facing; kept lower-case. */ + private final String resourceType; + /** Fully-qualified resource name as the backend sees it (group + name for + * BanyanDB; index name for ES; table name for JDBC). */ + private final String resourceName; + private final Outcome status; + /** Non-null on {@link Outcome#EXISTING_MISMATCH}, {@link Outcome#UPDATED}, + * {@link Outcome#SKIPPED_NOT_ALLOWED}. Null otherwise. */ + private final String diff; + + public ResourceOutcome(final String resourceType, final String resourceName, + final Outcome status, final String diff) { + this.resourceType = resourceType; + this.resourceName = resourceName; + this.status = status; + this.diff = diff; + } + + @Override + public String toString() { + final StringBuilder sb = new StringBuilder(); + sb.append(resourceType).append('(').append(resourceName).append(")=").append(status); + if (diff != null) { + sb.append("[").append(diff).append("]"); + } + return sb.toString(); + } + } +} \ No newline at end of file diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java index a5f1d2c7226c..49dcc226b52f 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java @@ -34,19 +34,28 @@ import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; +import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Objects; +import java.util.concurrent.locks.ReentrantLock; import org.apache.skywalking.oap.server.library.util.StringUtil; /** * StorageModels manages all models detected by the core. + * + *

Concurrency: the {@code models} and {@code listeners} lists are guarded by {@link #lock}. Mutations + * (add/remove) acquire the lock, snapshot the listener list, release the lock, and then invoke listener + * callbacks. Callbacks may do I/O (DDL) and must not block unrelated {@code add} calls coming from other + * threads (e.g. a late-loading module's startup). Read-only API ({@link #allModels()}) returns an + * unmodifiable snapshot taken under the lock. */ @Slf4j -public class StorageModels implements IModelManager, ModelCreator, ModelManipulator { +public class StorageModels implements IModelManager, ModelRegistry, ModelManipulator { private final List models; private final HashMap columnNameOverrideRule; private final List listeners; + private final ReentrantLock lock = new ReentrantLock(); public StorageModels() { this.models = new ArrayList<>(); @@ -55,7 +64,7 @@ public StorageModels() { } @Override - public Model add(Class aClass, int scopeId, Storage storage) throws StorageException { + public Model add(Class aClass, int scopeId, Storage storage, StorageManipulationOpt opt) throws StorageException { // Check this scope id is valid. DefaultScopeDefine.nameOf(scopeId); @@ -174,12 +183,139 @@ public Model add(Class aClass, int scopeId, Storage storage) throws StorageEx ); this.followColumnNameRules(model); - models.add(model); - for (final CreatingListener listener : listeners) { - listener.whenCreating(model); + final List listenersSnapshot; + final Model finalModel; + lock.lock(); + try { + // Dedup by (model name, downsampling). Two registrations with the same logical identity are idempotent — + // the first caller wins, the second receives the existing model reference and no listener fires again. + // Required for runtime-rule hot-update, where a remove followed by an add of the same metric name would + // otherwise append a duplicate entry to the internal list. + Model existing = null; + for (Model m : models) { + if (m.getName().equals(model.getName()) && m.getDownsampling() == model.getDownsampling()) { + existing = m; + break; + } + } + if (existing != null) { + return existing; + } + models.add(model); + listenersSnapshot = new ArrayList<>(listeners); + finalModel = model; + } finally { + lock.unlock(); + } + + // If a listener (e.g. the BanyanDB / ES installer) throws while creating the + // backing measure / index / table, roll the model out of the registry before + // letting the exception propagate. Without this, the model stays in `models` and + // the dedup check above short-circuits future retries — the listener never fires + // again for this model, and the storage stays half-built. The model is published + // to the registry only after all listeners succeed. + boolean committed = false; + try { + for (final CreatingListener listener : listenersSnapshot) { + listener.whenCreating(finalModel, opt); + } + committed = true; + } finally { + if (!committed) { + lock.lock(); + try { + models.remove(finalModel); + } finally { + lock.unlock(); + } + } + } + return finalModel; + } + + /** + * Remove every model registered through {@link #add(Class, int, Storage, StorageManipulationOpt)} whose stream + * class equals {@code streamClass}. Cascades across downsampling variants (Hour / Day / + * Minute). Backend drop operations (BanyanDB delete-measure) run inside listener + * {@link CreatingListener#whenRemoving} BEFORE the model is removed from the registry, + * so a transient backend-drop failure leaves the model in place — the caller (typically + * runtime-rule {@code /inactivate} or the reconciler tick) sees the throw and can + * re-attempt the whole tear-down. Removing the model first would leave the registry + * out of sync with the backend (model gone, measure still present) and the next + * {@code remove} call's iteration would find nothing to drop. + * + * @return the list of models that were removed, in the order they were discovered. Empty + * if no model matched. If a listener throws, that listener's drop is left in an + * unknown state on the corresponding backend; this method propagates the first + * such error after attempting every listener × model pair (so partial successes + * on one backend don't block successes on another). + */ + @Override + public List remove(Class streamClass, StorageManipulationOpt opt) throws StorageException { + // Snapshot matching models without yet removing them. Backend cascade first, registry + // mutation only after every listener succeeds. + final List matching; + final List listenersSnapshot; + lock.lock(); + try { + matching = new ArrayList<>(); + for (final Model m : models) { + if (Objects.equals(m.getStreamClass(), streamClass)) { + matching.add(m); + } + } + listenersSnapshot = new ArrayList<>(listeners); + } finally { + lock.unlock(); + } + + StorageException firstError = null; + for (final Model m : matching) { + for (final CreatingListener listener : listenersSnapshot) { + try { + listener.whenRemoving(m, opt); + } catch (StorageException e) { + log.error("Listener {} failed to handle whenRemoving({})", listener.getClass().getName(), m.getName(), e); + if (firstError == null) { + firstError = e; + } + } + } + } + if (firstError != null) { + // Leave models in the registry — backend state is uncertain on at least one + // listener, so the next retry needs to find them and re-fire whenRemoving. + // Listeners are required to be idempotent on the drop path (BanyanDB's + // delete-measure on a non-existent measure is a no-op; ES / JDBC dropTable for + // management data is a documented no-op). + throw firstError; } - return model; + // All listeners succeeded for every matching model — drop them from the registry. + // Identity-based removal: avoids the {@code @EqualsAndHashCode} on {@link Model} + // matching a concurrent fresh add of the same stream class with different field + // combinations. The matching list was captured under the lock; identity stays stable. + // Also drop the corresponding ValueColumnMetadata entry so a subsequent re-register + // under a different scope (runtime-rule SHAPE-BREAK: SERVICE → SERVICE_INSTANCE) is + // not silently ignored by {@code ValueColumnMetadata.putIfAbsent} — without this the + // metric catalog (used by listMetrics / MQE entity resolution) would keep the old + // scope and queries would target the wrong entity_id. + lock.lock(); + try { + for (final Model target : matching) { + final Iterator it = models.iterator(); + while (it.hasNext()) { + if (it.next() == target) { + it.remove(); + break; + } + } + ValueColumnMetadata.INSTANCE.remove(target.getName()); + } + } finally { + lock.unlock(); + } + return matching; } private boolean isSuperDatasetModel(Class aClass) { @@ -187,14 +323,26 @@ private boolean isSuperDatasetModel(Class aClass) { } /** - * CreatingListener listener could react when {@link ModelCreator#add(Class, int, Storage)} model happens. Also, the + * CreatingListener listener could react when {@link ModelRegistry#add(Class, int, Storage, StorageManipulationOpt)} model happens. Also, the * added models are being notified in this add operation. */ @Override public void addModelListener(final CreatingListener listener) throws StorageException { - listeners.add(listener); - for (Model model : models) { - listener.whenCreating(model); + final List modelsSnapshot; + lock.lock(); + try { + listeners.add(listener); + modelsSnapshot = new ArrayList<>(models); + } finally { + lock.unlock(); + } + // A late-registering listener catches up on every previously-added model. These + // models were added with their original caller's policy; the listener now receives + // them under createIfAbsent() because this catch-up is boot-time model registration, + // not an on-demand operator reshape — we want the same "create-if-absent + report + // shape mismatch" semantics, never auto-reshape. + for (Model model : modelsSnapshot) { + listener.whenCreating(model, StorageManipulationOpt.createIfAbsent()); } } @@ -340,8 +488,15 @@ private void retrieval(final Class clazz, @Override public void overrideColumnName(String columnName, String newName) { - columnNameOverrideRule.put(columnName, newName); - models.forEach(this::followColumnNameRules); + final List modelsSnapshot; + lock.lock(); + try { + columnNameOverrideRule.put(columnName, newName); + modelsSnapshot = new ArrayList<>(models); + } finally { + lock.unlock(); + } + modelsSnapshot.forEach(this::followColumnNameRules); ValueColumnMetadata.INSTANCE.overrideColumnName(columnName, newName); } @@ -373,7 +528,12 @@ private boolean addExtraColumn4AdditionalEntity(SQLDatabaseModelExtension sqlDBM @Override public List allModels() { - return models; + lock.lock(); + try { + return Collections.unmodifiableList(new ArrayList<>(models)); + } finally { + lock.unlock(); + } } private TraceIndexRule createTraceIndexRule(Class aClass, BanyanDB.Trace.IndexRule indexRuleColumns) { diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/IWorkerInstanceSetter.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/IWorkerInstanceSetter.java index 88b84578befd..babf96b7ed2e 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/IWorkerInstanceSetter.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/IWorkerInstanceSetter.java @@ -35,4 +35,13 @@ public interface IWorkerInstanceSetter extends Service { */ void put(String remoteReceiverWorkName, AbstractWorker instance, MetricStreamKind kind, Class streamDataClass); + + /** + * Remove the registration for {@code remoteReceiverWorkName}. Idempotent — a no-op when + * no such key exists. Required for the runtime-rule hot-remove path: without it, a + * subsequent {@link #put} with the same name throws "Duplicate worker name" and blocks + * any metric re-registration (shape-break remove+apply, operator recovery push via + * {@code /addOrUpdate?force=true}, etc.). + */ + void remove(String remoteReceiverWorkName); } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/WorkerInstancesService.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/WorkerInstancesService.java index e0347aa0bc3d..19b9cf188b8a 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/WorkerInstancesService.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/worker/WorkerInstancesService.java @@ -52,4 +52,12 @@ public void put(String remoteReceiverWorkName, AbstractWorker instance, instances.put(remoteReceiverWorkName, new RemoteHandleWorker(instance, kind, streamDataClass)); LOGGER.debug("Worker {} has been registered as {}", instance.toString(), remoteReceiverWorkName); } + + @Override + public void remove(String remoteReceiverWorkName) { + final RemoteHandleWorker removed = instances.remove(remoteReceiverWorkName); + if (removed != null) { + LOGGER.debug("Worker {} has been deregistered", remoteReceiverWorkName); + } + } } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystemTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystemTest.java index 7055511c2a43..1e5bc6d4e93f 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystemTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystemTest.java @@ -52,7 +52,10 @@ public void setup() throws Exception { processorMock = Mockito.mock(MetricsStreamProcessor.class); mockedProcessor = Mockito.mockStatic(MetricsStreamProcessor.class); mockedProcessor.when(MetricsStreamProcessor::getInstance).thenReturn(processorMock); - doNothing().when(processorMock).create(any(), (StreamDefinition) any(), any()); + // MetricsStreamProcessor.create now takes a StorageManipulationOpt on every path — + // MeterSystem.createInternal threads the opt through so the shape-mismatch gate at + // the installer level can surface to the metric-registration path. + doNothing().when(processorMock).create(any(), (StreamDefinition) any(), any(), any()); } @AfterEach diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementPersistentWorkerTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementPersistentWorkerTest.java new file mode 100644 index 000000000000..eb72b0d042f2 --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementPersistentWorkerTest.java @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.analysis.worker; + +import java.io.IOException; +import org.apache.skywalking.oap.server.core.analysis.management.ManagementData; +import org.apache.skywalking.oap.server.core.storage.IManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.Model; +import org.apache.skywalking.oap.server.library.module.ModuleDefineHolder; +import org.junit.jupiter.api.Test; + +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.Mockito.doThrow; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.verify; + +class ManagementPersistentWorkerTest { + + @Test + void inSwallowsIoExceptionForAsyncCallers() throws IOException { + // Async path — in() catches IOException and only logs. UITemplate / UIMenu callers + // accept this fire-and-forget contract; their DAOs swallow duplicate-row writes + // anyway. Runtime-rule moved off this path because it requires persist-is-commit + // semantics; that contract now lives on RuntimeRuleManagementDAO.save instead. + final ModuleDefineHolder holder = mock(ModuleDefineHolder.class); + final IManagementDAO dao = mock(IManagementDAO.class); + final Model model = mock(Model.class); + final ManagementPersistentWorker worker = + new ManagementPersistentWorker(holder, model, dao); + final ManagementData data = mock(ManagementData.class); + doThrow(new IOException("db unavailable")).when(dao).insert(any(), any()); + + // Should NOT throw — caller that uses in() explicitly accepts fire-and-forget + // semantics; assertion is just "does not propagate the IOException". + worker.in(data); + + // Verify the DAO call did happen, so we know we exercised the swallow path rather + // than short-circuiting before insert. + verify(dao).insert(model, data); + } +} diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessorSuspendTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessorSuspendTest.java new file mode 100644 index 000000000000..7b4ba357f899 --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessorSuspendTest.java @@ -0,0 +1,191 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.analysis.worker; + +import java.lang.reflect.Field; +import java.util.Map; +import org.apache.skywalking.oap.server.core.analysis.metrics.Metrics; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertNull; +import static org.junit.jupiter.api.Assertions.assertSame; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.Mockito.mock; + +/** + * Targeted unit coverage for the Suspend/Resume primitives on {@link MetricsStreamProcessor} + * that the runtime-rule hot-update path depends on. The method bodies are bare map + * operations, so we seed the processor's internal {@code entryWorkers} map via reflection + * — avoids standing up a full OAP module graph just to exercise two put/get calls. + * + *

Regression targets: + *

    + *
  • {@code suspendDispatch} moves the entry worker from the live map to the parked map + * atomically; {@code resumeDispatch} inverts. The same {@code MetricsAggregateWorker} + * instance round-trips — its buffered state (merge map, lastSendTime) is preserved + * across the pause, which is the whole reason Suspend exists.
  • + *
  • {@code removeMetric} on a currently-parked class still drains: previously the parked + * worker was discarded on removal without running through the drain path, orphaning + * L1/L2 state. The fix pulls the worker from {@code suspendedWorkers} as a fallback + * and feeds the same drain-and-deregister sequence.
  • + *
  • {@code isDispatchSuspended} correctly reflects the parked state.
  • + *
+ * + *

Why reflection rather than constructing the processor normally: the public + * {@code create} path requires a full {@code ModuleDefineHolder} with Core + Storage + + * Telemetry services, a Stream annotation, a Storage-builder factory, and a live + * {@code IMetricsDAO}. The primitive-level behaviour we care about here is independent of + * any of that — it's {@code Map} bookkeeping plus the existing drain path, + * which we stub by leaving the worker as a Mockito mock so the drain calls resolve to + * no-op defaults without needing real L1/L2 infrastructure. + */ +class MetricsStreamProcessorSuspendTest { + + private MetricsStreamProcessor processor; + + @BeforeEach + void setUp() { + processor = MetricsStreamProcessor.getInstance(); + // Reset both maps between tests — the processor is a JVM singleton and prior tests + // (or production code paths invoked during class loading) may have left entries. + clearMap("entryWorkers"); + clearMap("suspendedWorkers"); + } + + @Test + void suspendMovesWorkerToParkedMap() throws Exception { + final MetricsAggregateWorker worker = mock(MetricsAggregateWorker.class); + seedEntryWorker(TestMetricsA.class, worker); + + final boolean suspended = processor.suspendDispatch(TestMetricsA.class); + + assertTrue(suspended, "suspendDispatch should return true when an entry worker was parked"); + assertTrue(processor.isDispatchSuspended(TestMetricsA.class)); + // Worker no longer in entryWorkers, present in suspendedWorkers. + assertNull(readMap("entryWorkers").get(TestMetricsA.class), + "entry worker must be cleared from entryWorkers after suspend"); + assertSame(worker, readMap("suspendedWorkers").get(TestMetricsA.class), + "suspend must park the same worker instance — its internal state (merge buffer, " + + "lastSendTime) is what we need to preserve across the pause"); + } + + @Test + void suspendReturnsFalseWhenNotRegistered() { + final boolean suspended = processor.suspendDispatch(TestMetricsA.class); + + assertFalse(suspended, "suspendDispatch on a never-registered class is a no-op"); + assertFalse(processor.isDispatchSuspended(TestMetricsA.class)); + } + + @Test + void resumeRestoresWorkerToEntryMap() throws Exception { + final MetricsAggregateWorker worker = mock(MetricsAggregateWorker.class); + seedEntryWorker(TestMetricsA.class, worker); + processor.suspendDispatch(TestMetricsA.class); + + final boolean resumed = processor.resumeDispatch(TestMetricsA.class); + + assertTrue(resumed); + assertFalse(processor.isDispatchSuspended(TestMetricsA.class)); + assertSame(worker, readMap("entryWorkers").get(TestMetricsA.class), + "resume must re-install the same worker instance into entryWorkers"); + assertNull(readMap("suspendedWorkers").get(TestMetricsA.class)); + } + + @Test + void resumeWithoutSuspendIsNoOp() { + assertFalse(processor.resumeDispatch(TestMetricsA.class), + "resumeDispatch on a class that was never suspended returns false"); + } + + @Test + void suspendResumeRoundTripPreservesWorkerIdentity() throws Exception { + // Multi-cycle: pause, resume, pause again. Same worker instance throughout — an + // operator doing two back-to-back structural applies that each briefly suspend the + // same bundle must not lose the L1 merge buffer. + final MetricsAggregateWorker worker = mock(MetricsAggregateWorker.class); + seedEntryWorker(TestMetricsA.class, worker); + + assertTrue(processor.suspendDispatch(TestMetricsA.class)); + assertTrue(processor.resumeDispatch(TestMetricsA.class)); + assertTrue(processor.suspendDispatch(TestMetricsA.class)); + + assertSame(worker, readMap("suspendedWorkers").get(TestMetricsA.class)); + } + + @Test + void differentClassesAreIndependent() throws Exception { + // Suspending one metric class must not affect another class's dispatch. + final MetricsAggregateWorker workerA = mock(MetricsAggregateWorker.class); + final MetricsAggregateWorker workerB = mock(MetricsAggregateWorker.class); + seedEntryWorker(TestMetricsA.class, workerA); + seedEntryWorker(TestMetricsB.class, workerB); + + processor.suspendDispatch(TestMetricsA.class); + + assertTrue(processor.isDispatchSuspended(TestMetricsA.class)); + assertFalse(processor.isDispatchSuspended(TestMetricsB.class)); + assertSame(workerB, readMap("entryWorkers").get(TestMetricsB.class), + "B's entry worker must still be live"); + } + + // ---- helpers -------------------------------------------------------------------------- + + @SuppressWarnings("unchecked") + private Map, MetricsAggregateWorker> readMap(final String fieldName) { + try { + final Field f = MetricsStreamProcessor.class.getDeclaredField(fieldName); + f.setAccessible(true); + return (Map, MetricsAggregateWorker>) f.get(processor); + } catch (final ReflectiveOperationException e) { + throw new AssertionError("unable to access " + fieldName, e); + } + } + + private void seedEntryWorker(final Class metricsClass, + final MetricsAggregateWorker worker) { + final Map, MetricsAggregateWorker> entryWorkers = readMap("entryWorkers"); + assertNotNull(entryWorkers); + entryWorkers.put(metricsClass, worker); + } + + private void clearMap(final String fieldName) { + readMap(fieldName).clear(); + } + + // Two distinct Metrics subclasses for the independence test. Bodies are irrelevant — + // we use them only as map keys. + private abstract static class TestMetricsA extends Metrics { + } + + private abstract static class TestMetricsB extends Metrics { + } + + @SuppressWarnings("unused") + private static int useAssertEquals() { + // Anchors the assertEquals import for any future assertion that wants it without + // triggering an import-removal on my next edit. + assertEquals(1, 1); + return 0; + } +} diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java new file mode 100644 index 000000000000..6c07e4e27b25 --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import java.util.Optional; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertSame; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Singleton lifecycle tests for {@link DSLClassLoaderManager}. Each test uses a unique rule + * name so concurrent test execution doesn't collide on the shared singleton state. + */ +class DSLClassLoaderManagerTest { + + @Test + void newBuilderDoesNotInstallUntilCommit() { + // The split between newBuilder (mint) and commit (promote-to-active) exists so a + // failed compile cannot displace the live loader. Confirm the contract: after + // newBuilder alone, active() is still empty for this key. + final String rule = "build-no-install-" + System.nanoTime(); + final RuleClassLoader fresh = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.OTEL_RULES, rule, DSLClassLoaderManager.Kind.RUNTIME, "h1"); + assertFalse(DSLClassLoaderManager.INSTANCE.active(Catalog.OTEL_RULES, rule).isPresent(), + "newBuilder must not install the loader as active"); + DSLClassLoaderManager.INSTANCE.commit(fresh); + assertSame(fresh, DSLClassLoaderManager.INSTANCE.active(Catalog.OTEL_RULES, rule).get()); + + DSLClassLoaderManager.INSTANCE.dropRuntime(Catalog.OTEL_RULES, rule); + } + + @Test + void commitReplacesPriorAndReturnsIt() { + final String rule = "commit-replace-" + System.nanoTime(); + final RuleClassLoader first = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.OTEL_RULES, rule, DSLClassLoaderManager.Kind.RUNTIME, "h1"); + DSLClassLoaderManager.INSTANCE.commit(first); + + final RuleClassLoader second = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.OTEL_RULES, rule, DSLClassLoaderManager.Kind.RUNTIME, "h2"); + final Optional displaced = DSLClassLoaderManager.INSTANCE.commit(second); + assertTrue(displaced.isPresent()); + assertSame(first, displaced.get(), "commit must return the prior loader for retire decisions"); + assertSame(second, DSLClassLoaderManager.INSTANCE.active(Catalog.OTEL_RULES, rule).get()); + + DSLClassLoaderManager.INSTANCE.dropRuntime(Catalog.OTEL_RULES, rule); + } + + @Test + void dropRuntimeReturnsActiveAndClearsEntry() { + final String rule = "drop-runtime-" + System.nanoTime(); + final RuleClassLoader loader = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.LAL, rule, DSLClassLoaderManager.Kind.RUNTIME, "h"); + DSLClassLoaderManager.INSTANCE.commit(loader); + + final Optional dropped = DSLClassLoaderManager.INSTANCE.dropRuntime( + Catalog.LAL, rule); + assertTrue(dropped.isPresent()); + assertFalse(DSLClassLoaderManager.INSTANCE.active(Catalog.LAL, rule).isPresent()); + } + + @Test + void dropRuntimeOnAbsentKeyReturnsEmpty() { + final String rule = "drop-absent-" + System.nanoTime(); + assertFalse(DSLClassLoaderManager.INSTANCE.dropRuntime(Catalog.LAL, rule).isPresent()); + } + + @Test + void retireGraveyardsAnExternallyHeldLoader() { + final String rule = "retire-external-" + System.nanoTime(); + final RuleClassLoader loader = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.LOG_MAL_RULES, rule, DSLClassLoaderManager.Kind.RUNTIME, "h"); + final int before = DSLClassLoaderManager.INSTANCE.pendingCount(); + DSLClassLoaderManager.INSTANCE.retire(loader); + assertEquals(before + 1, DSLClassLoaderManager.INSTANCE.pendingCount(), + "retire should move the loader into the graveyard's pending set"); + + // Strong ref retained for the duration of the test so phantom can't enqueue — + // pendingCount must stay elevated relative to the pre-test reading. + assertSame(Catalog.LOG_MAL_RULES, loader.getCatalog()); + } + + @Test + void loaderNameKindPrefixIsConsistentWithBuildKind() { + final String rule = "kind-prefix-" + System.nanoTime(); + final RuleClassLoader runtimeLoader = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.LAL, rule, DSLClassLoaderManager.Kind.RUNTIME, "h"); + assertTrue(runtimeLoader.getName().startsWith("runtime-rule:lal/" + rule)); + + final RuleClassLoader staticLoader = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.LAL, rule, DSLClassLoaderManager.Kind.STATIC, "h"); + assertTrue(staticLoader.getName().startsWith("static:lal/" + rule)); + } +} diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java new file mode 100644 index 000000000000..ab0df81e3678 --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.classloader; + +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class RuleClassLoaderTest { + + @Test + void fieldsAreExposedForGraveyardAccounting() { + // The graveyard captures (kind, catalog, rule, contentHash) at retire() time — the + // loader must surface them exactly as constructed so operators can map phantom + // enqueues back to the YAML file that produced the classes. + final ClassLoader parent = Thread.currentThread().getContextClassLoader(); + final RuleClassLoader loader = new RuleClassLoader( + DSLClassLoaderManager.Kind.RUNTIME, Catalog.OTEL_RULES, "vm.yaml", + "deadbeef01234567", parent); + assertEquals(DSLClassLoaderManager.Kind.RUNTIME, loader.getKind()); + assertEquals(Catalog.OTEL_RULES, loader.getCatalog()); + assertEquals("vm.yaml", loader.getRule()); + assertEquals("deadbeef01234567", loader.getContentHash()); + } + + @Test + void runtimeKindLoaderNameHasRuntimeRulePrefix() { + // Loader-name format is observable on every log line that prints the loader; the + // prefix must distinguish a runtime override from a static fall-over at a glance. + final RuleClassLoader loader = new RuleClassLoader( + DSLClassLoaderManager.Kind.RUNTIME, Catalog.LAL, "default", "h", + Thread.currentThread().getContextClassLoader()); + assertTrue(loader.getName().startsWith("runtime-rule:lal/default@"), + "expected runtime-rule prefix, got: " + loader.getName()); + } + + @Test + void staticKindLoaderNameHasStaticPrefix() { + final RuleClassLoader loader = new RuleClassLoader( + DSLClassLoaderManager.Kind.STATIC, Catalog.LOG_MAL_RULES, "service-resp", "h", + Thread.currentThread().getContextClassLoader()); + assertTrue(loader.getName().startsWith("static:log-mal-rules/service-resp@"), + "expected static prefix, got: " + loader.getName()); + } + + @Test + void nullHashIsAcceptedWithoutNpe() { + final RuleClassLoader loader = new RuleClassLoader( + DSLClassLoaderManager.Kind.RUNTIME, Catalog.OTEL_RULES, "bad.yaml", null, + Thread.currentThread().getContextClassLoader()); + assertEquals(Catalog.OTEL_RULES, loader.getCatalog()); + assertEquals("bad.yaml", loader.getRule()); + org.junit.jupiter.api.Assertions.assertNull(loader.getContentHash()); + } + + @Test + void parentDelegationResolvesParentClasses() throws Exception { + // The loader is parented to the app loader so shipped classes resolve via parent-first + // lookup. A concrete OAP class that's always on the classpath confirms that contract. + final ClassLoader parent = Thread.currentThread().getContextClassLoader(); + final RuleClassLoader loader = new RuleClassLoader( + DSLClassLoaderManager.Kind.RUNTIME, Catalog.LAL, "vm", "h", parent); + final Class k = loader.loadClass("org.apache.skywalking.oap.server.core.CoreModule"); + assertNotNull(k); + } +} diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistryTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistryTest.java new file mode 100644 index 000000000000..abeb72c9016c --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/rule/ext/StaticRuleRegistryTest.java @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.rule.ext; + +import java.nio.charset.StandardCharsets; +import java.util.Optional; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class StaticRuleRegistryTest { + + private StaticRuleRegistry registry; + + @BeforeEach + void setUp() { + registry = StaticRuleRegistry.active(); + registry.clear(); + } + + @AfterEach + void tearDown() { + registry.clear(); + } + + @Test + void recordThenFindRoundtripsBytes() { + // The runtime-rule REST handler's priorContent fallback depends on this: what the + // boot extension recorded is what the handler must see back on a later lookup. + final byte[] content = "metricPrefix: foo\n".getBytes(StandardCharsets.UTF_8); + + registry.record("otel-rules", "vm", content); + + final Optional out = registry.find("otel-rules", "vm"); + assertTrue(out.isPresent(), "expected recorded content to round-trip through find()"); + assertEquals("metricPrefix: foo\n", out.get()); + } + + @Test + void findReturnsEmptyForUnknownKey() { + // Classifier fallback treats Optional.empty() as "no static version exists" — this + // must be the response when we haven't recorded anything for this (catalog, name). + registry.record("otel-rules", "vm", "x".getBytes(StandardCharsets.UTF_8)); + + assertFalse(registry.find("otel-rules", "other").isPresent()); + assertFalse(registry.find("other-catalog", "vm").isPresent()); + } + + @Test + void recordIsIdempotentOnRepeatedKey() { + // Boot may read the same file twice in some test topologies (or during re-derive + // pathways). The last write wins — there is no append semantics. + registry.record("otel-rules", "vm", "first".getBytes(StandardCharsets.UTF_8)); + registry.record("otel-rules", "vm", "second".getBytes(StandardCharsets.UTF_8)); + + assertEquals("second", registry.find("otel-rules", "vm").orElse(null)); + } + + @Test + void nullArgsAreIgnored() { + // Defensive contract: the extension chain may invoke with a null catalog or name in + // edge cases (test doubles, mis-wired loaders). Don't NPE, and don't poison the map + // with a spurious "null:null" key that a later find() could return. + registry.record(null, "vm", "x".getBytes(StandardCharsets.UTF_8)); + registry.record("otel-rules", null, "x".getBytes(StandardCharsets.UTF_8)); + registry.record("otel-rules", "vm", null); + + assertFalse(registry.find("otel-rules", "vm").isPresent()); + assertFalse(registry.find(null, "vm").isPresent()); + assertFalse(registry.find("otel-rules", null).isPresent()); + } + + @Test + void singletonReturnsSameInstance() { + // Process-wide singleton — extension + REST handler must read from the same map. + // If this ever returns a new instance per call, the REST handler would see an + // empty registry in production. + assertTrue(StaticRuleRegistry.active() == StaticRuleRegistry.active()); + } +} diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java index fa4c055aaa3e..8aee0e2d86b4 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java @@ -61,11 +61,64 @@ public static void tearDown() { DEFAULT_SCOPE_DEFINE_MOCKED_STATIC.close(); } + @Test + public void rolledBackOnListenerFailure() throws StorageException { + // A CreatingListener throw must NOT leave the model in `models`. Future retries + // would otherwise hit the dedup short-circuit and skip listeners entirely, leaving + // the storage stack permanently half-built. + StorageModels models = new StorageModels(); + models.addModelListener((model, opt) -> { + throw new StorageException("simulated DDL failure"); + }); + Assertions.assertThrows(StorageException.class, () -> models.add(TestModel.class, -1, + new Storage("StorageModelsRollbackTest", false, DownSampling.Hour), + StorageManipulationOpt.fullInstall())); + // Registry must not retain the model — a retry would otherwise dedup-skip the + // listener instead of attempting the DDL again. + assertEquals(0, models.allModels().size()); + } + + @Test + public void removeKeepsModelOnListenerFailure() throws StorageException { + // remove() must keep the model in `models` if any whenRemoving listener throws — + // otherwise the registry diverges from the backend (model gone, BanyanDB measure + // still alive) and there's nothing for the retry path to find. Listeners are + // required to be idempotent on the drop, so re-firing them on retry is safe. + StorageModels models = new StorageModels(); + models.add(TestModel.class, -1, + new Storage("StorageModelsRemoveRetryTest", false, DownSampling.Hour), + StorageManipulationOpt.fullInstall()); + assertEquals(1, models.allModels().size()); + + // Listener that throws on remove (simulating BanyanDB delete-measure transient failure). + // Note: addModelListener fires whenCreating for already-added models, but our listener + // only overrides whenRemoving, so the catch-up call is a no-op via the default impl. + models.addModelListener(new ModelRegistry.CreatingListener() { + @Override + public void whenCreating(final Model model, final StorageManipulationOpt opt) { + // already-created catch-up — fine to no-op for this test + } + + @Override + public void whenRemoving(final Model model, final StorageManipulationOpt opt) throws StorageException { + throw new StorageException("simulated dropTable failure"); + } + }); + + Assertions.assertThrows(StorageException.class, + () -> models.remove(TestModel.class, StorageManipulationOpt.fullInstall())); + // Model must still be in the registry — the next retry needs to find and drive + // dropTable again. Otherwise the operator's /inactivate succeeds locally but the + // backend measure stays orphaned forever. + assertEquals(1, models.allModels().size()); + } + @Test public void testStorageModels() throws StorageException { StorageModels models = new StorageModels(); models.add(TestModel.class, -1, - new Storage("StorageModelsTest", false, DownSampling.Hour) + new Storage("StorageModelsTest", false, DownSampling.Hour), + StorageManipulationOpt.fullInstall() ); final List allModules = models.allModels(); diff --git a/oap-server/server-fetcher-plugin/fetcher-proto/pom.xml b/oap-server/server-fetcher-plugin/fetcher-proto/pom.xml index 6acf6137d9e5..d276daa9f450 100644 --- a/oap-server/server-fetcher-plugin/fetcher-proto/pom.xml +++ b/oap-server/server-fetcher-plugin/fetcher-proto/pom.xml @@ -67,11 +67,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-library/library-banyandb-client/pom.xml b/oap-server/server-library/library-banyandb-client/pom.xml index 933e9ee4e775..5d80c319bb39 100644 --- a/oap-server/server-library/library-banyandb-client/pom.xml +++ b/oap-server/server-library/library-banyandb-client/pom.xml @@ -104,11 +104,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} ${project.basedir}/src/main/proto/proto diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/AbstractWrite.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/AbstractWrite.java index 413545d9b598..73cd0f11bdc5 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/AbstractWrite.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/AbstractWrite.java @@ -22,7 +22,7 @@ import lombok.Getter; import org.apache.skywalking.banyandb.common.v1.BanyandbCommon; -public abstract class AbstractWrite

{ +public abstract class AbstractWrite

{ /** * Timestamp represents the time of the current data point, in milliseconds. *

diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClient.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClient.java index 0d0f82e77fba..7cc4d92c2ba1 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClient.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClient.java @@ -47,6 +47,7 @@ import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRule; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRuleBinding; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Measure; +import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Property; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Stream; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Subject; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.TopNAggregation; @@ -66,6 +67,7 @@ import org.apache.skywalking.library.banyandb.v1.client.metadata.GroupMetadataRegistry; import org.apache.skywalking.library.banyandb.v1.client.metadata.IndexRuleBindingMetadataRegistry; import org.apache.skywalking.library.banyandb.v1.client.metadata.IndexRuleMetadataRegistry; +import org.apache.skywalking.library.banyandb.v1.client.grpc.MetadataClient; import org.apache.skywalking.library.banyandb.v1.client.metadata.MeasureMetadataRegistry; import org.apache.skywalking.library.banyandb.v1.client.metadata.PropertyMetadataRegistry; import org.apache.skywalking.library.banyandb.v1.client.metadata.ResourceExist; @@ -103,6 +105,12 @@ public class BanyanDBClient implements Closeable { */ @Getter private volatile Channel channel; + /** + * Lazy-initialised wrapper over {@code SchemaBarrierService}. First access after + * the channel is wired creates the watcher; nullable until then so that callers + * which never need a schema fence don't pay the construction cost. + */ + private volatile SchemaWatcher schemaWatcher; /** * gRPC client stub */ @@ -365,64 +373,57 @@ public Group define(Group group) throws BanyanDBException { } /** - * Define a new stream - * - * @param stream the stream to be created + * Define a new stream and return the etcd {@code mod_revision} server-stamped on + * the registry write. Callers that need a schema-watch fence (see + * {@link SchemaWatcher#awaitRevisionApplied}) capture this value; legacy callers + * that don't need the fence may ignore the return. */ - public void define(Stream stream) throws BanyanDBException { + public long define(Stream stream) throws BanyanDBException { StreamMetadataRegistry streamRegistry = new StreamMetadataRegistry(checkNotNull(this.channel)); - long modRevision = streamRegistry.create(stream); - stream = stream.toBuilder().setMetadata(stream.getMetadata().toBuilder().setModRevision(modRevision)).build(); + return streamRegistry.create(stream); } /** - * Define a new stream with index rules, - * @param stream the stream to be created - * @param indexRules the index rules to be created + * Define a new stream with index rules. Returns the highest {@code mod_revision} + * across the stream + every index rule + the binding write so callers can fence + * on a single revision. */ - public void define(Stream stream, List indexRules) throws BanyanDBException { - define(stream); - defineIndexRules(stream, indexRules); + public long define(Stream stream, List indexRules) throws BanyanDBException { + long maxRev = define(stream); + return Math.max(maxRev, defineIndexRules(stream, indexRules)); } /** - * Define a new measure - * - * @param measure the measure to be created + * Define a new measure. See {@link #define(Stream)} for the mod_revision contract. */ - public void define(Measure measure) throws BanyanDBException { + public long define(Measure measure) throws BanyanDBException { MeasureMetadataRegistry measureRegistry = new MeasureMetadataRegistry(checkNotNull(this.channel)); - long modRevision = measureRegistry.create(measure); - measure = measure.toBuilder().setMetadata(measure.getMetadata().toBuilder().setModRevision(modRevision)).build(); + return measureRegistry.create(measure); } /** - * Define a new measure with index rules - * @param measure the measure to be created - * @param indexRules the index rules to be created + * Define a new measure with index rules. Returns the highest mod_revision of + * any registry write performed during the call. */ - public void define(Measure measure, List indexRules) throws BanyanDBException { - define(measure); - defineIndexRules(measure, indexRules); + public long define(Measure measure, List indexRules) throws BanyanDBException { + long maxRev = define(measure); + return Math.max(maxRev, defineIndexRules(measure, indexRules)); } /** - * Define a new TopNAggregation - * - * @param topNAggregation the topN rule to be created + * Define a new TopNAggregation. Returns the etcd mod_revision of the write. */ - public void define(TopNAggregation topNAggregation) throws BanyanDBException { + public long define(TopNAggregation topNAggregation) throws BanyanDBException { TopNAggregationMetadataRegistry registry = new TopNAggregationMetadataRegistry(checkNotNull(this.channel)); - registry.create(topNAggregation); + return registry.create(topNAggregation); } /** - * Define a new IndexRule - * @param indexRule the index rule to be created + * Define a new IndexRule. Returns the etcd mod_revision of the write. */ - public void define(IndexRule indexRule) throws BanyanDBException { + public long define(IndexRule indexRule) throws BanyanDBException { IndexRuleMetadataRegistry registry = new IndexRuleMetadataRegistry(checkNotNull(this.channel)); - registry.create(indexRule); + return registry.create(indexRule); } /** @@ -430,40 +431,37 @@ public void define(IndexRule indexRule) throws BanyanDBException { * The default value of beginAt is the current time, and the default value of expireAt is 2099-01-01 00:00:00 UTC. * @param indexRuleBinding the index rule binding to be created */ - public void define(IndexRuleBinding indexRuleBinding) throws BanyanDBException { + public long define(IndexRuleBinding indexRuleBinding) throws BanyanDBException { ZonedDateTime beginAt = indexRuleBinding.getBeginAt() == Timestamp.getDefaultInstance() ? ZonedDateTime.now() : TimeUtils.parseTimestamp(indexRuleBinding.getBeginAt()); ZonedDateTime expireAt = indexRuleBinding.getExpireAt() == Timestamp.getDefaultInstance() ? DEFAULT_EXPIRE_AT : TimeUtils.parseTimestamp(indexRuleBinding.getExpireAt()); - this.define(indexRuleBinding, beginAt, expireAt); + return this.define(indexRuleBinding, beginAt, expireAt); } /** - * Define a new IndexRuleBinding - * @param indexRuleBinding the index rule binding to be created - * @param beginAt the beginning time of the index rule binding - * @param expireAt the expiry time of the index rule binding + * Define a new IndexRuleBinding. Returns the etcd mod_revision of the write. */ - public void define(IndexRuleBinding indexRuleBinding, ZonedDateTime beginAt, ZonedDateTime expireAt) throws BanyanDBException { + public long define(IndexRuleBinding indexRuleBinding, ZonedDateTime beginAt, ZonedDateTime expireAt) throws BanyanDBException { IndexRuleBindingMetadataRegistry registry = new IndexRuleBindingMetadataRegistry(checkNotNull(this.channel)); indexRuleBinding = indexRuleBinding.toBuilder() .setBeginAt(TimeUtils.buildTimestamp(beginAt)) .setExpireAt(TimeUtils.buildTimestamp(expireAt)) .build(); - registry.create(indexRuleBinding); + return registry.create(indexRuleBinding); } /** - * Bind index rule to the stream - * By default, the index rule binding will be active from now, and it will never be expired. - * @param stream the subject of index rule binding - * @param indexRules rules to be bounded + * Bind index rule to the stream. Returns the highest mod_revision of any registry + * write performed during the call. Per-rule {@code ALREADY_EXISTS} responses are + * swallowed (idempotent); a swallowed conflict contributes 0 to the max. */ - public void defineIndexRules(Stream stream, List indexRules) throws BanyanDBException { + public long defineIndexRules(Stream stream, List indexRules) throws BanyanDBException { Preconditions.checkArgument(stream != null, "stream cannot be null"); IndexRuleMetadataRegistry irRegistry = new IndexRuleMetadataRegistry(checkNotNull(this.channel)); + long maxRev = MetadataClient.DEFAULT_MOD_REVISION; for (final IndexRule ir : indexRules) { try { - irRegistry.create(ir); + maxRev = Math.max(maxRev, irRegistry.create(ir)); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { continue; @@ -472,7 +470,7 @@ public void defineIndexRules(Stream stream, List indexRules) throws B } } if (indexRules.isEmpty()) { - return; + return maxRev; } List indexRuleNames = indexRules.stream() @@ -491,23 +489,21 @@ public void defineIndexRules(Stream stream, List indexRules) throws B .setCatalog( BanyandbCommon.Catalog.CATALOG_STREAM)) .addAllRules(indexRuleNames).build(); - this.define(binding); + return Math.max(maxRev, this.define(binding)); } /** - * Bind index rule to the measure. - * By default, the index rule binding will be active from now, and it will never be expired. - * - * @param measure the subject of index rule binding - * @param indexRules rules to be bounded + * Bind index rule to the measure. See {@link #defineIndexRules(Stream, List)} for + * the mod_revision contract. */ - public void defineIndexRules(Measure measure, List indexRules) throws BanyanDBException { + public long defineIndexRules(Measure measure, List indexRules) throws BanyanDBException { Preconditions.checkArgument(measure != null, "measure cannot be null"); IndexRuleMetadataRegistry irRegistry = new IndexRuleMetadataRegistry(checkNotNull(this.channel)); + long maxRev = MetadataClient.DEFAULT_MOD_REVISION; for (final IndexRule ir : indexRules) { try { - irRegistry.create(ir); + maxRev = Math.max(maxRev, irRegistry.create(ir)); } catch (BanyanDBException ex) { // multiple entity can share a single index rule if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -517,7 +513,7 @@ public void defineIndexRules(Measure measure, List indexRules) throws } } if (indexRules.isEmpty()) { - return; + return maxRev; } List indexRuleNames = indexRules.stream().map(indexRule -> indexRule.getMetadata().getName()).collect(Collectors.toList()); @@ -534,63 +530,43 @@ public void defineIndexRules(Measure measure, List indexRules) throws .setCatalog( BanyandbCommon.Catalog.CATALOG_MEASURE)) .addAllRules(indexRuleNames).build(); - this.define(binding); + return Math.max(maxRev, this.define(binding)); } - /** - * Update the group - * - * @param group the group to be updated - */ - public void update(Group group) throws BanyanDBException { + /** Update the group. Returns the etcd mod_revision of the write. */ + public long update(Group group) throws BanyanDBException { GroupMetadataRegistry registry = new GroupMetadataRegistry(checkNotNull(this.channel)); - registry.update(group); + return registry.updateWithRevision(group); } - /** - * Update the stream - * @param stream the stream to be updated - */ - public void update(Stream stream) throws BanyanDBException { + /** Update the stream. Returns the etcd mod_revision of the write. */ + public long update(Stream stream) throws BanyanDBException { StreamMetadataRegistry streamRegistry = new StreamMetadataRegistry(checkNotNull(this.channel)); - streamRegistry.update(stream); + return streamRegistry.updateWithRevision(stream); } - /** - * Update the measure - * - * @param measure the measure to be updated - */ - public void update(Measure measure) throws BanyanDBException { + /** Update the measure. Returns the etcd mod_revision of the write. */ + public long update(Measure measure) throws BanyanDBException { MeasureMetadataRegistry measureRegistry = new MeasureMetadataRegistry(checkNotNull(this.channel)); - measureRegistry.update(measure); + return measureRegistry.updateWithRevision(measure); } - /** - * Update the TopNAggregation - * @param topNAggregation the topN rule to be updated - */ - public void update(TopNAggregation topNAggregation) throws BanyanDBException { + /** Update the TopNAggregation. Returns the etcd mod_revision of the write. */ + public long update(TopNAggregation topNAggregation) throws BanyanDBException { TopNAggregationMetadataRegistry registry = new TopNAggregationMetadataRegistry(checkNotNull(this.channel)); - registry.update(topNAggregation); + return registry.updateWithRevision(topNAggregation); } - /** - * Update the IndexRule - * @param indexRule the index rule to be updated - */ - public void update(IndexRule indexRule) throws BanyanDBException { + /** Update the IndexRule. Returns the etcd mod_revision of the write. */ + public long update(IndexRule indexRule) throws BanyanDBException { IndexRuleMetadataRegistry registry = new IndexRuleMetadataRegistry(checkNotNull(this.channel)); - registry.update(indexRule); + return registry.updateWithRevision(indexRule); } - /** - * Update the IndexRuleBinding - * @param indexRuleBinding the index rule binding to be updated - */ - public void update(IndexRuleBinding indexRuleBinding) throws BanyanDBException { + /** Update the IndexRuleBinding. Returns the etcd mod_revision of the write. */ + public long update(IndexRuleBinding indexRuleBinding) throws BanyanDBException { IndexRuleBindingMetadataRegistry registry = new IndexRuleBindingMetadataRegistry(checkNotNull(this.channel)); - registry.update(indexRuleBinding); + return registry.updateWithRevision(indexRuleBinding); } /** @@ -668,6 +644,65 @@ public boolean deleteIndexRuleBinding(String group, String name) throws BanyanDB return registry.delete(group, name); } + /** + * Variant of {@link #deleteStream(String, String)} that returns the etcd + * {@code mod_revision} of the tombstone. Returns 0 when the server did not + * record one — callers needing a delete-fence then fall back to + * {@link SchemaWatcher#awaitSchemaDeleted}. + */ + public long deleteStreamWithRevision(String group, String name) throws BanyanDBException { + Preconditions.checkArgument(!Strings.isNullOrEmpty(group)); + Preconditions.checkArgument(!Strings.isNullOrEmpty(name)); + return new StreamMetadataRegistry(checkNotNull(this.channel)).deleteWithRevision(group, name); + } + + /** See {@link #deleteStreamWithRevision}. */ + public long deleteMeasureWithRevision(String group, String name) throws BanyanDBException { + Preconditions.checkArgument(!Strings.isNullOrEmpty(group)); + Preconditions.checkArgument(!Strings.isNullOrEmpty(name)); + return new MeasureMetadataRegistry(checkNotNull(this.channel)).deleteWithRevision(group, name); + } + + /** See {@link #deleteStreamWithRevision}. */ + public long deleteTopNAggregationWithRevision(String group, String name) throws BanyanDBException { + Preconditions.checkArgument(!Strings.isNullOrEmpty(group)); + Preconditions.checkArgument(!Strings.isNullOrEmpty(name)); + return new TopNAggregationMetadataRegistry(checkNotNull(this.channel)).deleteWithRevision(group, name); + } + + /** See {@link #deleteStreamWithRevision}. */ + public long deleteIndexRuleWithRevision(String group, String name) throws BanyanDBException { + Preconditions.checkArgument(!Strings.isNullOrEmpty(group)); + Preconditions.checkArgument(!Strings.isNullOrEmpty(name)); + return new IndexRuleMetadataRegistry(checkNotNull(this.channel)).deleteWithRevision(group, name); + } + + /** See {@link #deleteStreamWithRevision}. */ + public long deleteIndexRuleBindingWithRevision(String group, String name) throws BanyanDBException { + Preconditions.checkArgument(!Strings.isNullOrEmpty(group)); + Preconditions.checkArgument(!Strings.isNullOrEmpty(name)); + return new IndexRuleBindingMetadataRegistry(checkNotNull(this.channel)).deleteWithRevision(group, name); + } + + /** + * Lazy accessor for the schema-watcher wrapper. Use to fence subsequent + * data writes / queries against a target {@code mod_revision}, or to wait for + * a delete tombstone to fan out across data nodes. + */ + public SchemaWatcher getSchemaWatcher() { + SchemaWatcher local = this.schemaWatcher; + if (local == null) { + synchronized (this) { + local = this.schemaWatcher; + if (local == null) { + local = new SchemaWatcher(checkNotNull(this.channel)); + this.schemaWatcher = local; + } + } + } + return local; + } + /** * Find the IndexRule * @param group the group name of the index rule @@ -736,20 +771,17 @@ public List findIndexRuleBindings(String group) throws BanyanD * @param property the property to be stored in the BanyanBD * @throws BanyanDBException if the property is invalid */ - public void define(org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Property property) throws BanyanDBException { + public long define(Property property) throws BanyanDBException { PropertyMetadataRegistry registry = new PropertyMetadataRegistry(checkNotNull(this.channel)); - registry.create(property); + return registry.create(property); } /** - * Update the property. - * - * @param property the property to be stored in the BanyanBD - * @throws BanyanDBException if the property is invalid + * Update the property. Returns the etcd mod_revision of the write. */ - public void update(org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Property property) throws BanyanDBException { + public long update(Property property) throws BanyanDBException { PropertyMetadataRegistry registry = new PropertyMetadataRegistry(checkNotNull(this.channel)); - registry.update(property); + return registry.updateWithRevision(property); } /** @@ -759,7 +791,7 @@ public void update(org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.P * @param name name of the metadata * @return the property found in BanyanDB. Otherwise, null is returned. */ - public org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Property findPropertyDefinition(String group, String name) throws BanyanDBException { + public Property findPropertyDefinition(String group, String name) throws BanyanDBException { PropertyMetadataRegistry registry = new PropertyMetadataRegistry(checkNotNull(this.channel)); return registry.get(group, name); } @@ -770,7 +802,7 @@ public org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Property find * @param group group of the metadata * @return the properties found in BanyanDB */ - public List findPropertiesDefinition(String group) throws BanyanDBException { + public List findPropertiesDefinition(String group) throws BanyanDBException { PropertyMetadataRegistry registry = new PropertyMetadataRegistry(checkNotNull(this.channel)); return registry.list(group); } @@ -804,20 +836,17 @@ public BanyandbProperty.QueryResponse query(BanyandbProperty.QueryRequest reques * @param trace the trace to be stored in the BanyanDB * @throws BanyanDBException if the trace is invalid */ - public void define(Trace trace) throws BanyanDBException { + public long define(Trace trace) throws BanyanDBException { TraceMetadataRegistry registry = new TraceMetadataRegistry(checkNotNull(this.channel)); - registry.create(trace); + return registry.create(trace); } /** - * Update the trace. - * - * @param trace the trace to be stored in the BanyanDB - * @throws BanyanDBException if the trace is invalid + * Update the trace. Returns the etcd mod_revision of the write. */ - public void update(Trace trace) throws BanyanDBException { + public long update(Trace trace) throws BanyanDBException { TraceMetadataRegistry registry = new TraceMetadataRegistry(checkNotNull(this.channel)); - registry.update(trace); + return registry.updateWithRevision(trace); } /** diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/SchemaWatcher.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/SchemaWatcher.java new file mode 100644 index 000000000000..a2c23ea23a8d --- /dev/null +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/SchemaWatcher.java @@ -0,0 +1,146 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.library.banyandb.v1.client; + +import com.google.protobuf.Duration; +import io.grpc.Channel; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import lombok.Getter; +import lombok.RequiredArgsConstructor; +import org.apache.skywalking.banyandb.schema.v1.BanyandbSchema; +import org.apache.skywalking.banyandb.schema.v1.SchemaBarrierServiceGrpc; +import org.apache.skywalking.library.banyandb.v1.client.grpc.HandleExceptionsWith; +import org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBException; + +/** + * Client wrapper around BanyanDB's {@code SchemaWatcherService}. Replaces the legacy + * "poll {@code findX} until you can read your own write" idiom with an authoritative + * fence: the server-side watcher blocks until every data node has observed the target + * schema state. + * + *

Three RPCs: + *

    + *
  • {@link #awaitRevisionApplied} — block until every data node's local schema + * cache has caught up to a target {@code mod_revision}. Use this after a + * Create / Update where the response gave a non-zero revision; the etcd + * revision is global, so a single fence covers all schema mutations done + * under it. + *
  • {@link #awaitSchemaApplied} — block until specific keys are present at or + * above per-key revisions. Useful when the caller wants a precise key list + * confirmed (e.g., index rules + binding for a single measure). + *
  • {@link #awaitSchemaDeleted} — block until specific keys have disappeared + * from every data node's cache. Use after a Delete that returned + * {@code mod_revision == 0} (server did not record a tombstone) — the + * revision-based fence won't observe the deletion, so the caller falls back + * to a key-based wait. + *
+ * + *

The legacy {@code findX}-poll path used to time-out after about 5 s before + * declaring "schema not applied"; the server-side watcher RPCs themselves take a + * timeout so the caller no longer needs an external retry loop. + */ +public final class SchemaWatcher { + + private final SchemaBarrierServiceGrpc.SchemaBarrierServiceBlockingStub stub; + + public SchemaWatcher(final Channel channel) { + this.stub = SchemaBarrierServiceGrpc.newBlockingStub(channel); + } + + /** + * Block until every data node has observed at least {@code minRevision}, or until + * the timeout elapses. {@code minRevision == 0} returns immediately (no fence + * needed). Returns the server's response so callers can inspect laggards on a + * non-applied result. + */ + public Result awaitRevisionApplied(final long minRevision, + final java.time.Duration timeout) throws BanyanDBException { + if (minRevision <= 0L) { + return Result.applied(); + } + final BanyandbSchema.AwaitRevisionAppliedResponse resp = HandleExceptionsWith.callAndTranslateApiException(() -> + stub.awaitRevisionApplied(BanyandbSchema.AwaitRevisionAppliedRequest.newBuilder() + .setMinRevision(minRevision) + .setTimeout(toProto(timeout)) + .build())); + return new Result(resp.getApplied(), resp.getLaggardsList()); + } + + /** + * Block until every data node reports the listed keys present at or above their + * per-key revisions. {@code minRevisions} entries pair positionally with + * {@code keys}; pass {@code 0} for "any revision will do". + */ + public Result awaitSchemaApplied(final List keys, + final List minRevisions, + final java.time.Duration timeout) throws BanyanDBException { + final BanyandbSchema.AwaitSchemaAppliedRequest.Builder req = BanyandbSchema.AwaitSchemaAppliedRequest.newBuilder() + .addAllKeys(keys) + .addAllMinRevisions(minRevisions) + .setTimeout(toProto(timeout)); + final BanyandbSchema.AwaitSchemaAppliedResponse resp = HandleExceptionsWith.callAndTranslateApiException(() -> + stub.awaitSchemaApplied(req.build())); + return new Result(resp.getApplied(), resp.getLaggardsList()); + } + + /** + * Block until every data node has removed the listed keys from its cache. + * Use after a Delete that returned {@code mod_revision == 0}; the + * revision-based fence cannot observe a deletion that didn't get a tombstone. + */ + public Result awaitSchemaDeleted(final List keys, + final java.time.Duration timeout) throws BanyanDBException { + final BanyandbSchema.AwaitSchemaDeletedResponse resp = HandleExceptionsWith.callAndTranslateApiException(() -> + stub.awaitSchemaDeleted(BanyandbSchema.AwaitSchemaDeletedRequest.newBuilder() + .addAllKeys(keys) + .setTimeout(toProto(timeout)) + .build())); + return new Result(resp.getApplied(), resp.getLaggardsList()); + } + + /** Convenience for the common single-key delete-wait. */ + public Result awaitSchemaDeleted(final BanyandbSchema.SchemaKey key, + final java.time.Duration timeout) throws BanyanDBException { + return awaitSchemaDeleted(Arrays.asList(key), timeout); + } + + private static Duration toProto(final java.time.Duration d) { + return Duration.newBuilder() + .setSeconds(d.getSeconds()) + .setNanos(d.getNano()) + .build(); + } + + /** + * Result of a watcher call. {@link #applied} is true iff every data node has + * caught up; {@link #laggards} carries per-node detail when not. + */ + @Getter + @RequiredArgsConstructor + public static final class Result { + private final boolean applied; + private final List laggards; + + public static Result applied() { + return new Result(true, Collections.emptyList()); + } + } +} diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/grpc/MetadataClient.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/grpc/MetadataClient.java index f7e2529ba698..bb9ae2614b81 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/grpc/MetadataClient.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/grpc/MetadataClient.java @@ -18,7 +18,7 @@ package org.apache.skywalking.library.banyandb.v1.client.grpc; -import com.google.protobuf.GeneratedMessageV3; +import com.google.protobuf.GeneratedMessage; import io.grpc.stub.AbstractBlockingStub; import java.util.List; import org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBException; @@ -29,7 +29,7 @@ * * @param

ProtoBuf: schema defined in ProtoBuf format */ -public abstract class MetadataClient, P extends GeneratedMessageV3> { +public abstract class MetadataClient, P extends GeneratedMessage> { public static final long DEFAULT_MOD_REVISION = 0; protected final STUB stub; @@ -55,6 +55,24 @@ protected MetadataClient(STUB stub) { */ public abstract void update(P payload) throws BanyanDBException; + /** + * Update the schema and return the etcd {@code mod_revision} stamped on the + * server-side write. Callers that need to fence subsequent data writes / queries + * against the new shape (via {@code SchemaBarrierService.AwaitRevisionApplied}) + * use this overload to capture the revision; callers that don't need the fence + * can still use {@link #update(GeneratedMessage)}. + * + *

Default implementation calls {@link #update(GeneratedMessage)} and returns + * {@link #DEFAULT_MOD_REVISION} (0) — registries that do not yet expose + * {@code mod_revision} on their Update response keep the no-fence behaviour. + * Concrete subclasses override to read {@code mod_revision} off the typed + * response. + */ + public long updateWithRevision(P payload) throws BanyanDBException { + update(payload); + return DEFAULT_MOD_REVISION; + } + /** * Delete a schema * @@ -65,6 +83,21 @@ protected MetadataClient(STUB stub) { */ public abstract boolean delete(String group, String name) throws BanyanDBException; + /** + * Delete a schema and return the etcd {@code mod_revision} of the tombstone. + * Returns {@link #DEFAULT_MOD_REVISION} (0) when the server did not record a + * tombstone — callers that need a delete-fence then fall back to + * {@code SchemaBarrierService.AwaitSchemaDeleted} keyed on the resource. + * + *

Default implementation calls {@link #delete(String, String)} and returns + * {@link #DEFAULT_MOD_REVISION}; concrete subclasses override to read + * {@code mod_revision} off the typed response. + */ + public long deleteWithRevision(String group, String name) throws BanyanDBException { + delete(group, name); + return DEFAULT_MOD_REVISION; + } + /** * Get a schema with name * diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/GroupMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/GroupMetadataRegistry.java index 66243d020b95..995c47d16c16 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/GroupMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/GroupMetadataRegistry.java @@ -35,17 +35,25 @@ public GroupMetadataRegistry(Channel channel) { @Override public long create(Group payload) throws BanyanDBException { - execute(() -> stub.create(BanyandbDatabase.GroupRegistryServiceCreateRequest.newBuilder() - .setGroup(payload) - .build())); - return DEFAULT_MOD_REVISION; + BanyandbDatabase.GroupRegistryServiceCreateResponse resp = execute(() -> + stub.create(BanyandbDatabase.GroupRegistryServiceCreateRequest.newBuilder() + .setGroup(payload) + .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public void update(Group payload) throws BanyanDBException { - execute(() -> stub.update(BanyandbDatabase.GroupRegistryServiceUpdateRequest.newBuilder() - .setGroup(payload) - .build())); + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(Group payload) throws BanyanDBException { + BanyandbDatabase.GroupRegistryServiceUpdateResponse resp = execute(() -> + stub.update(BanyandbDatabase.GroupRegistryServiceUpdateRequest.newBuilder() + .setGroup(payload) + .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override @@ -54,7 +62,12 @@ public boolean delete(String group, String name) throws BanyanDBException { stub.delete(BanyandbDatabase.GroupRegistryServiceDeleteRequest.newBuilder() .setGroup(name) .build())); - return resp != null && resp.getDeleted(); + // Schema-consistency Phase 1+ proto removed the explicit `bool deleted` field; + // a non-null response means the server accepted the delete. mod_revision is + // the new authoritative signal for "tombstone recorded" — callers needing + // delete-fence semantics should use AwaitSchemaDeleted via SchemaBarrier when + // mod_revision is 0 (no tombstone). + return resp != null; } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleBindingMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleBindingMetadataRegistry.java index d5bbffa2ab9a..135d3d8af4ad 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleBindingMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleBindingMetadataRegistry.java @@ -36,26 +36,39 @@ public IndexRuleBindingMetadataRegistry(Channel channel) { @Override public long create(IndexRuleBinding payload) throws BanyanDBException { - execute(() -> stub.create(BanyandbDatabase.IndexRuleBindingRegistryServiceCreateRequest.newBuilder() - .setIndexRuleBinding(payload) - .build())); - return DEFAULT_MOD_REVISION; + BanyandbDatabase.IndexRuleBindingRegistryServiceCreateResponse resp = execute(() -> + stub.create(BanyandbDatabase.IndexRuleBindingRegistryServiceCreateRequest.newBuilder() + .setIndexRuleBinding(payload) + .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public void update(IndexRuleBinding payload) throws BanyanDBException { - execute(() -> stub.update(BanyandbDatabase.IndexRuleBindingRegistryServiceUpdateRequest.newBuilder() - .setIndexRuleBinding(payload) - .build())); + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(IndexRuleBinding payload) throws BanyanDBException { + BanyandbDatabase.IndexRuleBindingRegistryServiceUpdateResponse resp = execute(() -> + stub.update(BanyandbDatabase.IndexRuleBindingRegistryServiceUpdateRequest.newBuilder() + .setIndexRuleBinding(payload) + .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public boolean delete(String group, String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(String group, String name) throws BanyanDBException { BanyandbDatabase.IndexRuleBindingRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.IndexRuleBindingRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleMetadataRegistry.java index 27cdcf61e04b..86e3aa073782 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/IndexRuleMetadataRegistry.java @@ -35,28 +35,39 @@ public IndexRuleMetadataRegistry(Channel channel) { @Override public long create(IndexRule payload) throws BanyanDBException { - execute(() -> + BanyandbDatabase.IndexRuleRegistryServiceCreateResponse resp = execute(() -> stub.create(BanyandbDatabase.IndexRuleRegistryServiceCreateRequest.newBuilder() .setIndexRule(payload) .build())); - return DEFAULT_MOD_REVISION; + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public void update(IndexRule payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(IndexRule payload) throws BanyanDBException { + BanyandbDatabase.IndexRuleRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.IndexRuleRegistryServiceUpdateRequest.newBuilder() .setIndexRule(payload) .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public boolean delete(String group, String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(String group, String name) throws BanyanDBException { BanyandbDatabase.IndexRuleRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.IndexRuleRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/MeasureMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/MeasureMetadataRegistry.java index 1e3be1b421e0..7cc51596b379 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/MeasureMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/MeasureMetadataRegistry.java @@ -45,19 +45,30 @@ public long create(final Measure payload) throws BanyanDBException { @Override public void update(final Measure payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(final Measure payload) throws BanyanDBException { + BanyandbDatabase.MeasureRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.MeasureRegistryServiceUpdateRequest.newBuilder() .setMeasure(payload) .build())); + return resp.getModRevision(); } @Override public boolean delete(final String group, final String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(final String group, final String name) throws BanyanDBException { BanyandbDatabase.MeasureRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.MeasureRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/PropertyMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/PropertyMetadataRegistry.java index ae543bc6aa59..4621d5e1e696 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/PropertyMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/PropertyMetadataRegistry.java @@ -45,19 +45,30 @@ public long create(final Property payload) throws BanyanDBException { @Override public void update(final Property payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(final Property payload) throws BanyanDBException { + BanyandbDatabase.PropertyRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.PropertyRegistryServiceUpdateRequest.newBuilder() .setProperty(payload) .build())); + return resp.getModRevision(); } @Override public boolean delete(final String group, final String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(final String group, final String name) throws BanyanDBException { BanyandbDatabase.PropertyRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.PropertyRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/Serializable.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/Serializable.java index d3a4856fbb9e..666c85c78195 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/Serializable.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/Serializable.java @@ -24,7 +24,7 @@ * * @param

the produced class must be in Protobuf message type. */ -public interface Serializable

{ +public interface Serializable

{ /** * Serialize the object to the protobuf format * diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/StreamMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/StreamMetadataRegistry.java index efcb66370715..002f6b09a585 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/StreamMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/StreamMetadataRegistry.java @@ -45,19 +45,30 @@ public long create(Stream payload) throws BanyanDBException { @Override public void update(Stream payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(Stream payload) throws BanyanDBException { + BanyandbDatabase.StreamRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.StreamRegistryServiceUpdateRequest.newBuilder() .setStream(payload) .build())); + return resp.getModRevision(); } @Override public boolean delete(String group, String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(String group, String name) throws BanyanDBException { BanyandbDatabase.StreamRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.StreamRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TopNAggregationMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TopNAggregationMetadataRegistry.java index f5813e1caccf..49cdd69fc054 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TopNAggregationMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TopNAggregationMetadataRegistry.java @@ -35,28 +35,39 @@ public TopNAggregationMetadataRegistry(Channel channel) { @Override public long create(TopNAggregation payload) throws BanyanDBException { - execute(() -> + BanyandbDatabase.TopNAggregationRegistryServiceCreateResponse resp = execute(() -> stub.create(BanyandbDatabase.TopNAggregationRegistryServiceCreateRequest.newBuilder() .setTopNAggregation(payload) .build())); - return DEFAULT_MOD_REVISION; + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public void update(TopNAggregation payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(TopNAggregation payload) throws BanyanDBException { + BanyandbDatabase.TopNAggregationRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.TopNAggregationRegistryServiceUpdateRequest.newBuilder() .setTopNAggregation(payload) .build())); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override public boolean delete(String group, String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(String group, String name) throws BanyanDBException { BanyandbDatabase.TopNAggregationRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.TopNAggregationRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TraceMetadataRegistry.java b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TraceMetadataRegistry.java index 64fc61afe008..ed05c449187d 100644 --- a/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TraceMetadataRegistry.java +++ b/oap-server/server-library/library-banyandb-client/src/main/java/org/apache/skywalking/library/banyandb/v1/client/metadata/TraceMetadataRegistry.java @@ -45,19 +45,30 @@ public long create(Trace payload) throws BanyanDBException { @Override public void update(Trace payload) throws BanyanDBException { - execute(() -> + updateWithRevision(payload); + } + + @Override + public long updateWithRevision(Trace payload) throws BanyanDBException { + BanyandbDatabase.TraceRegistryServiceUpdateResponse resp = execute(() -> stub.update(BanyandbDatabase.TraceRegistryServiceUpdateRequest.newBuilder() .setTrace(payload) .build())); + return resp.getModRevision(); } @Override public boolean delete(String group, String name) throws BanyanDBException { + return deleteWithRevision(group, name) >= 0; + } + + @Override + public long deleteWithRevision(String group, String name) throws BanyanDBException { BanyandbDatabase.TraceRegistryServiceDeleteResponse resp = execute(() -> stub.delete(BanyandbDatabase.TraceRegistryServiceDeleteRequest.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder().setGroup(group).setName(name).build()) .build())); - return resp != null && resp.getDeleted(); + return resp == null ? DEFAULT_MOD_REVISION : resp.getModRevision(); } @Override diff --git a/oap-server/server-library/library-banyandb-client/src/main/proto b/oap-server/server-library/library-banyandb-client/src/main/proto index b1c87663e577..7d0b0a568481 160000 --- a/oap-server/server-library/library-banyandb-client/src/main/proto +++ b/oap-server/server-library/library-banyandb-client/src/main/proto @@ -1 +1 @@ -Subproject commit b1c87663e57796402736a5f52dd3b4ca8d981e89 +Subproject commit 7d0b0a5684812523325a30b2318a026870b45c8f diff --git a/oap-server/server-library/library-banyandb-client/src/test/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClientTestCI.java b/oap-server/server-library/library-banyandb-client/src/test/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClientTestCI.java index fdd826ca0192..08bd9f98b55d 100644 --- a/oap-server/server-library/library-banyandb-client/src/test/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClientTestCI.java +++ b/oap-server/server-library/library-banyandb-client/src/test/java/org/apache/skywalking/library/banyandb/v1/client/BanyanDBClientTestCI.java @@ -20,7 +20,7 @@ import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.banyandb.common.v1.BanyandbCommon; -import org.apache.skywalking.oap.server.library.it.ITVersions; +import org.apache.skywalking.oap.server.library.it.BanyanDBTestContainer; import org.testcontainers.containers.GenericContainer; import org.testcontainers.containers.wait.strategy.Wait; import org.testcontainers.junit.jupiter.Container; @@ -31,22 +31,15 @@ @Slf4j @Testcontainers public class BanyanDBClientTestCI { - private static final String REGISTRY = "ghcr.io"; - private static final String IMAGE_NAME = "apache/skywalking-banyandb"; - private static final String TAG = ITVersions.get("SW_BANYANDB_COMMIT"); - - private static final String IMAGE = REGISTRY + "/" + IMAGE_NAME + ":" + TAG; - - protected static final int GRPC_PORT = 17912; - protected static final int HTTP_PORT = 17913; + protected static final int GRPC_PORT = BanyanDBTestContainer.GRPC_PORT; + protected static final int HTTP_PORT = BanyanDBTestContainer.HTTP_PORT; @Container public GenericContainer banyanDB = new GenericContainer<>( - DockerImageName.parse(IMAGE)) - .withCommand("standalone", "--stream-root-path", "/tmp/banyandb-stream-data", - "--measure-root-path", "/tmp/banyand-measure-data") + DockerImageName.parse(BanyanDBTestContainer.image())) + .withCommand(BanyanDBTestContainer.standaloneCommand()) .withExposedPorts(GRPC_PORT, HTTP_PORT) - .waitingFor(Wait.forHttp("/api/healthz").forPort(HTTP_PORT)); + .waitingFor(Wait.forHttp(BanyanDBTestContainer.HEALTH_ENDPOINT).forPort(HTTP_PORT)); protected BanyanDBClient client; diff --git a/oap-server/server-library/library-batch-queue/src/main/java/org/apache/skywalking/oap/server/library/batchqueue/BatchQueue.java b/oap-server/server-library/library-batch-queue/src/main/java/org/apache/skywalking/oap/server/library/batchqueue/BatchQueue.java index 597ca4d6ba8f..1890bd0d2bfe 100644 --- a/oap-server/server-library/library-batch-queue/src/main/java/org/apache/skywalking/oap/server/library/batchqueue/BatchQueue.java +++ b/oap-server/server-library/library-batch-queue/src/main/java/org/apache/skywalking/oap/server/library/batchqueue/BatchQueue.java @@ -128,6 +128,13 @@ public class BatchQueue { */ private final ConcurrentHashMap, HandlerConsumer> handlerMap; + /** + * Per-type registration weight, populated by {@link #addHandler(Class, HandlerConsumer, double)}. + * Used by {@link #removeHandler(Class)} so the running {@link #weightedHandlerCount} can be + * decremented symmetrically when a handler is unregistered (runtime rule hot-remove). + */ + private final ConcurrentHashMap, Double> handlerWeights; + /** * Running weighted sum of registered handlers, used by adaptive partition policy. * Each handler contributes its weight (default 1.0) when registered via @@ -301,6 +308,7 @@ public class BatchQueue { this.config = config; this.partitionSelector = config.getPartitionSelector(); this.handlerMap = new ConcurrentHashMap<>(); + this.handlerWeights = new ConcurrentHashMap<>(); this.warnedUnregisteredTypes = ConcurrentHashMap.newKeySet(); int threadCount = config.getThreads().resolve(); @@ -422,7 +430,13 @@ public void addHandler(final Class type, final HandlerConsumer h throw new IllegalArgumentException("Handler weight must be > 0, got: " + weight); } handlerMap.put(type, handler); - weightedHandlerCount += weight; + final Double previous = handlerWeights.put(type, weight); + if (previous != null) { + // Re-register of an existing type: replace weight symmetrically. + weightedHandlerCount += weight - previous; + } else { + weightedHandlerCount += weight; + } final int newPartitionCount = config.getPartitions() .resolve(resolvedThreadCount, weightedHandlerCount); @@ -460,6 +474,38 @@ public void addHandler(final Class type, final HandlerConsumer h } } + /** + * Remove a previously-registered type handler. Intended for runtime rule hot-remove (MAL/LAL). + * Not safe to call concurrently with {@link #addHandler(Class, HandlerConsumer, double)} for + * the same or another type — the caller must serialize registrations (the runtime-rule module + * holds a per-file lock around its {@code create}/{@code removeMetric} sequence; OAP startup + * registrations run single-threaded). + * + *

Side effects: + *

    + *
  • {@link #handlerMap} entry for {@code type} is removed atomically. Subsequent drained + * items of that class hit the "no handler" path and are logged + dropped at most once per type.
  • + *
  • {@link #weightedHandlerCount} is decremented by the weight recorded at registration.
  • + *
  • Partition array is NOT shrunk — slots allocated by adaptive growth stay for the lifetime + * of the process. This is a bounded leak proportional to cumulative distinct types seen, + * accepted by design (the alternative requires quiescing producers cluster-wide).
  • + *
  • The {@link #warnedUnregisteredTypes} memo is cleared for this type so a later re-registration + * followed by accidental sample arrival produces a fresh warning.
  • + *
+ * + * @param type the class whose handler should be removed + * @return {@code true} if a handler was present and removed, {@code false} if no handler was registered + */ + public boolean removeHandler(final Class type) { + final HandlerConsumer removed = handlerMap.remove(type); + final Double weight = handlerWeights.remove(type); + if (weight != null) { + weightedHandlerCount -= weight; + } + warnedUnregisteredTypes.remove(type); + return removed != null; + } + /** * Initialize rebalancing infrastructure and schedule periodic rebalance task. * Called from constructor when {@code .balancer(DrainBalancer, intervalMs)} is diff --git a/oap-server/server-library/library-integration-test/src/main/java/org/apache/skywalking/oap/server/library/it/BanyanDBTestContainer.java b/oap-server/server-library/library-integration-test/src/main/java/org/apache/skywalking/oap/server/library/it/BanyanDBTestContainer.java new file mode 100644 index 000000000000..c47e56d6278b --- /dev/null +++ b/oap-server/server-library/library-integration-test/src/main/java/org/apache/skywalking/oap/server/library/it/BanyanDBTestContainer.java @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.library.it; + +/** + * Single source of truth for the BanyanDB container image + port configuration used by + * integration tests across the repository. + * + *

The tag comes from {@code test/e2e-v2/script/env}'s {@code SW_BANYANDB_COMMIT}, read + * via {@link ITVersions}. Every IT that wants a BanyanDB instance should consult this class + * so a version bump lands in one place — the env file — and every IT picks it up on the + * next rebuild. Hardcoding a tag in a test file drifts silently, which is why the {@code + * docker/.env} standalone file is pinned to the same commit the env file carries. + * + *

This class deliberately does NOT return a {@code GenericContainer} directly. Testcontainers + * is a test-scope dependency throughout the build, and returning a Testcontainers type from a + * {@code library-integration-test} main-compile helper would force every downstream test module + * that just wants the image name to also pull Testcontainers at compile scope. Instead, tests + * call {@link #image()} and {@link #standaloneCommand()} and wire their own + * {@code GenericContainer} — keeping Testcontainers scope as-is. + */ +public final class BanyanDBTestContainer { + + public static final int GRPC_PORT = 17912; + public static final int HTTP_PORT = 17913; + + /** HTTP path Testcontainers' Wait strategies can probe to know the server is ready. */ + public static final String HEALTH_ENDPOINT = "/api/healthz"; + + private static final String REGISTRY = "ghcr.io"; + private static final String IMAGE_NAME = "apache/skywalking-banyandb"; + + private BanyanDBTestContainer() { + } + + /** + * Fully-qualified image reference pinned to the repo's currently-declared BanyanDB commit. + * Tests use this with {@code DockerImageName.parse(...)} to construct their + * {@code GenericContainer}. + */ + public static String image() { + final String tag = ITVersions.get("SW_BANYANDB_COMMIT"); + if (tag == null || tag.isEmpty()) { + throw new IllegalStateException( + "SW_BANYANDB_COMMIT missing from test/e2e-v2/script/env — " + + "cannot determine BanyanDB image tag for integration tests"); + } + return REGISTRY + "/" + IMAGE_NAME + ":" + tag; + } + + /** + * Standalone-mode arguments matching what every BanyanDB IT in the repo has used + * historically. Tests pass the returned array through + * {@code GenericContainer.withCommand(...)} so stream and measure data land on a + * predictable in-container path for any follow-up diagnostics. + */ + public static String[] standaloneCommand() { + return new String[] { + "standalone", + "--stream-root-path", "/tmp/banyandb-stream-data", + "--measure-root-path", "/tmp/banyand-measure-data", + // Drive the testing-only metadata cache wait flags down to 1s so a + // delete-measure + define cycle isn't silently absorbed by stale resolver + // cache (otherwise post-recreate writes target the old internal measure id + // and disappear). + "--measure-metadata-cache-wait-duration", "1s", + "--stream-metadata-cache-wait-duration", "1s", + "--trace-metadata-cache-wait-duration", "1s", + // Flush every 100ms so awaitDataPoints sees writes promptly. Production + // defaults are 5s for measure / property / schema-server and 1s for stream + // / trace, tuned for throughput; tests prefer end-to-end latency. + "--measure-flush-timeout", "100ms", + "--stream-flush-timeout", "100ms", + "--trace-flush-timeout", "100ms", + "--property-flush-timeout", "100ms", + }; + } +} diff --git a/oap-server/server-library/library-pprof-parser/pom.xml b/oap-server/server-library/library-pprof-parser/pom.xml index 456b93c6b33f..4c4938a6c4e8 100755 --- a/oap-server/server-library/library-pprof-parser/pom.xml +++ b/oap-server/server-library/library-pprof-parser/pom.xml @@ -38,18 +38,10 @@ com.google.protobuf protobuf-java - 3.25.5 com.google.code.gson gson - 2.10.1 - - - org.projectlombok - lombok - 1.18.30 - provided @@ -60,7 +52,7 @@ protobuf-maven-plugin 0.6.1 - com.google.protobuf:protoc:3.25.5:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-query-plugin/traceql-plugin/pom.xml b/oap-server/server-query-plugin/traceql-plugin/pom.xml index d15dc47eadfe..74e801df2958 100644 --- a/oap-server/server-query-plugin/traceql-plugin/pom.xml +++ b/oap-server/server-query-plugin/traceql-plugin/pom.xml @@ -62,11 +62,11 @@ ${protobuf-maven-plugin.version} - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-receiver-plugin/aws-firehose-receiver/pom.xml b/oap-server/server-receiver-plugin/aws-firehose-receiver/pom.xml index 53672aec96a1..d6881f82b5cc 100644 --- a/oap-server/server-receiver-plugin/aws-firehose-receiver/pom.xml +++ b/oap-server/server-receiver-plugin/aws-firehose-receiver/pom.xml @@ -49,10 +49,10 @@ protobuf-java directly, you will be transitively depending on the protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:1.42.1:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-receiver-plugin/envoy-metrics-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/envoy/EnvoyMetricReceiverProvider.java b/oap-server/server-receiver-plugin/envoy-metrics-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/envoy/EnvoyMetricReceiverProvider.java index 4f7b54928f81..f9887d651b91 100644 --- a/oap-server/server-receiver-plugin/envoy-metrics-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/envoy/EnvoyMetricReceiverProvider.java +++ b/oap-server/server-receiver-plugin/envoy-metrics-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/envoy/EnvoyMetricReceiverProvider.java @@ -25,6 +25,7 @@ import org.apache.skywalking.oap.server.core.oal.rt.OALEngineLoaderService; import org.apache.skywalking.oap.server.core.server.GRPCHandlerRegister; import org.apache.skywalking.oap.server.core.server.GRPCHandlerRegisterImpl; +import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.watermark.WatermarkGRPCInterceptor; import org.apache.skywalking.oap.server.library.module.ModuleDefine; import org.apache.skywalking.oap.server.library.module.ModuleProvider; @@ -156,11 +157,17 @@ public void notifyAfterCompleted() throws ServiceNotProvidedException, ModuleSta @Override public String[] requiredModules() { + // StorageModule is declared so Storage.start() runs before this provider — the + // envoy metrics rules are read via Rules.loadRules, which consults the + // RuntimeRuleOverrideResolver chain. Without this dep the DB-backed resolver would + // silently no-op at boot and DB overrides would only take effect on the next + // reconciler tick. return new String[] { TelemetryModule.NAME, CoreModule.NAME, SharingServerModule.NAME, - MeshReceiverModule.NAME + MeshReceiverModule.NAME, + StorageModule.NAME }; } } diff --git a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverModule.java b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverModule.java index 782d75940862..717c809a71c8 100644 --- a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverModule.java +++ b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverModule.java @@ -18,6 +18,7 @@ package org.apache.skywalking.oap.server.receiver.otel; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; import org.apache.skywalking.oap.server.library.module.ModuleDefine; import org.apache.skywalking.oap.server.receiver.otel.otlp.OpenTelemetryMetricRequestProcessor; @@ -30,6 +31,6 @@ public OtelMetricReceiverModule() { @Override public Class[] services() { - return new Class[] {OpenTelemetryMetricRequestProcessor.class}; + return new Class[] {OpenTelemetryMetricRequestProcessor.class, MalConverterRegistry.class}; } } diff --git a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverProvider.java b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverProvider.java index 30e86556de07..c4331099c02c 100644 --- a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverProvider.java +++ b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/OtelMetricReceiverProvider.java @@ -22,6 +22,8 @@ import org.apache.skywalking.oap.server.library.module.ModuleProvider; import org.apache.skywalking.oap.server.library.module.ModuleStartException; import org.apache.skywalking.oap.server.library.module.ServiceNotProvidedException; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; +import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.receiver.otel.otlp.OpenTelemetryMetricRequestProcessor; import org.apache.skywalking.oap.server.receiver.sharing.server.SharingServerModule; @@ -68,6 +70,10 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { metricRequestProcessor = new OpenTelemetryMetricRequestProcessor( getManager(), config); registerServiceImplementation(OpenTelemetryMetricRequestProcessor.class, metricRequestProcessor); + // Expose the same instance under the MalConverterRegistry contract so the runtime-rule + // plugin can push / drop otel-rules converters without depending on otel-receiver's + // concrete processor class. + registerServiceImplementation(MalConverterRegistry.class, metricRequestProcessor); final List enabledHandlers = config.getEnabledHandlers(); final var handlers = new ArrayList(); @@ -94,6 +100,13 @@ public void notifyAfterCompleted() throws ServiceNotProvidedException, ModuleSta @Override public String[] requiredModules() { - return new String[] {SharingServerModule.NAME}; + // StorageModule is declared so Storage.start() (and its catch-up whenCreating + // fan-out that creates the runtime_rule management table) runs before this + // provider's start(), guaranteeing the RuntimeRuleOverrideResolver's DB-backed + // resolver can load during static rule registration. Without this dep the + // module-system sort could place OTEL ahead of Storage, the resolver would + // silently no-op at boot, and DB overrides would only take effect on the + // reconciler's next tick. + return new String[] {SharingServerModule.NAME, StorageModule.NAME}; } } diff --git a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java index 4c6505a12d86..45cd6129d6ad 100644 --- a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java +++ b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessor.java @@ -29,6 +29,7 @@ import lombok.Getter; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily; import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.PrometheusMetricConverter; @@ -51,7 +52,9 @@ import org.apache.skywalking.oap.server.telemetry.api.MetricsTag; import java.io.IOException; +import java.util.Collections; import java.util.HashMap; +import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.function.Function; @@ -59,12 +62,11 @@ import static io.opentelemetry.proto.metrics.v1.AggregationTemporality.AGGREGATION_TEMPORALITY_DELTA; import static io.opentelemetry.proto.metrics.v1.AggregationTemporality.AGGREGATION_TEMPORALITY_UNSPECIFIED; -import static java.util.stream.Collectors.toList; import static java.util.stream.Collectors.toMap; @RequiredArgsConstructor @Slf4j -public class OpenTelemetryMetricRequestProcessor implements Service { +public class OpenTelemetryMetricRequestProcessor implements Service, MalConverterRegistry { private final ModuleManager manager; @@ -99,11 +101,26 @@ public class OpenTelemetryMetricRequestProcessor implements Service { // in resource attributes (e.g., Envoy AI Gateway), it takes precedence via putIfAbsent. .put("service.name", "job_name") .build(); - // Initialized to an empty list so that {@link #processMetricsRequest} and - // {@link #toMeter} are safe no-ops when no MAL rules are enabled, instead of - // throwing NPE in {@code processMetricsRequest} (which unconditionally does - // {@code converters.forEach(...)}). - private List converters = new java.util.ArrayList<>(); + /** + * Active MAL converters, keyed by {@code ":"} so boot-time entries and + * runtime-rule entries share one namespace. A runtime {@code /addOrUpdate} for a rule that + * already has a static version replaces the boot entry in place, avoiding double-dispatch + * on ingest samples; a runtime {@code /inactivate} teardown removes the entry cleanly. + * + *

Volatile + copy-on-write: readers in {@link #processMetricsRequest} and {@link #toMeter} + * observe a consistent snapshot without taking a lock; writers replace the reference under + * {@link #convertersWriteLock}. Iteration order is preserved by {@link LinkedHashMap} so + * the behaviour matches the pre-refactor {@code List} ordering for static rules. + */ + private volatile Map converters = Collections.emptyMap(); + private final Object convertersWriteLock = new Object(); + + /** + * Catalog identifier for the {@link OpenTelemetryMetricRequestProcessor}'s MAL rules. + * Matches the on-disk directory name and the runtime-rule catalog — the REST handler + * rejects requests under any other catalog so the key namespace stays aligned. + */ + private static final String OTEL_CATALOG = "otel-rules"; @Getter(lazy = true) private final MetricsCreator metricsCreator = manager.find(TelemetryModule.NAME).provider().getService(MetricsCreator.class); @@ -148,7 +165,7 @@ public void processMetricsRequest(final ExportMetricsServiceRequest requests) { .flatMap(tryIt -> MetricConvert.log(tryIt, "Convert OTEL metric to prometheus metric")) ) ); - converters.forEach(convert -> convert.toMeter(sampleFamilies)); + converters.values().forEach(convert -> convert.toMeter(sampleFamilies)); }); } } @@ -160,7 +177,41 @@ public void processMetricsRequest(final ExportMetricsServiceRequest requests) { * MAL converters configured via enabledOtelMetricsRules. */ public void toMeter(final ImmutableMap sampleFamilies) { - converters.forEach(convert -> convert.toMeter(sampleFamilies)); + converters.values().forEach(convert -> convert.toMeter(sampleFamilies)); + } + + /** + * Install or replace a single MAL converter identified by {@code key}. Thread-safe against + * concurrent readers and other writers; readers observe either the pre-call snapshot or the + * post-call snapshot, never a torn intermediate state. Called by the runtime-rule plugin + * when an operator's {@code /addOrUpdate} commits a new MAL bundle under the + * {@code otel-rules} catalog; boot-time loading also uses this method so there is exactly + * one installation path. + */ + @Override + public void addOrReplaceConverter(final String key, final MetricConvert convert) { + synchronized (convertersWriteLock) { + final Map copy = new LinkedHashMap<>(converters); + copy.put(key, convert); + converters = Collections.unmodifiableMap(copy); + } + } + + /** + * Drop the MAL converter previously installed under {@code key}. No-op if the key is not + * present — {@code /delete} on a runtime rule that already tore down on this node shouldn't + * surface an error. + */ + @Override + public void removeConverter(final String key) { + synchronized (convertersWriteLock) { + if (!converters.containsKey(key)) { + return; + } + final Map copy = new LinkedHashMap<>(converters); + copy.remove(key); + converters = Collections.unmodifiableMap(copy); + } } public void start() throws ModuleStartException { @@ -181,10 +232,9 @@ public void start() throws ModuleStartException { } final MeterSystem meterSystem = manager.find(CoreModule.NAME).provider().getService(MeterSystem.class); - converters = rules - .stream() - .map(r -> new MetricConvert(r, meterSystem)) - .collect(toList()); + for (final Rule rule : rules) { + addOrReplaceConverter(OTEL_CATALOG + ":" + rule.getName(), new MetricConvert(rule, meterSystem)); + } } private static Map buildLabels(List kvs) { diff --git a/oap-server/server-receiver-plugin/otel-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessorConverterRegistryTest.java b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessorConverterRegistryTest.java new file mode 100644 index 000000000000..6e3d66489c52 --- /dev/null +++ b/oap-server/server-receiver-plugin/otel-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/otel/otlp/OpenTelemetryMetricRequestProcessorConverterRegistryTest.java @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.skywalking.oap.server.receiver.otel.otlp; + +import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.otel.OtelMetricReceiverConfig; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.mockito.Mockito.mock; + +/** + * Tests the {@link org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry} + * behaviour that {@link OpenTelemetryMetricRequestProcessor} exposes. We go through the + * public interface only — add/replace/remove + toMeter are the operator-visible contract. + * + *

We don't exercise the processMetricsRequest OTLP path here because that pulls in + * proto-message fixtures the existing integration tests cover. What we target is the + * hot-update concurrency contract: registry mutations don't NPE ingest, and + * removeConverter on an absent key is a clean no-op (so {@code /delete} never errors + * out on an already-torn-down peer). + */ +class OpenTelemetryMetricRequestProcessorConverterRegistryTest { + + @Test + void addOrReplaceThenRemoveRoundTrips() { + final OpenTelemetryMetricRequestProcessor proc = newProcessor(); + final MetricConvert convert = mock(MetricConvert.class); + + // Add, replace with a different converter, then remove. Each call must be side-effect + // free beyond its intended mutation — addOrReplaceConverter never throws, even when + // the same key is rebound back-to-back (runtime-rule FILTER_ONLY swap re-uses the + // existing key, so this is the hot path). + proc.addOrReplaceConverter("otel-rules:vm", convert); + proc.addOrReplaceConverter("otel-rules:vm", mock(MetricConvert.class)); + proc.removeConverter("otel-rules:vm"); + } + + @Test + void removeConverterOnAbsentKeyIsIdempotent() { + // /delete on a bundle this peer already tore down (out-of-order tick firing) must + // not raise. The reconciler calls dropRuntimeMalConverter defensively; if the key is + // missing that's "already converged", not a failure. + final OpenTelemetryMetricRequestProcessor proc = newProcessor(); + + assertDoesNotThrow(() -> proc.removeConverter("otel-rules:nonexistent")); + } + + @Test + void toMeterDoesNotThrowWithNoConverters() { + // Fresh processor — converters map is empty. Samples arriving here must be silently + // dropped (not NPE) so SpanListener code that feeds MetricKit samples via toMeter + // doesn't have to guard for "runtime-rule not wired yet". + final OpenTelemetryMetricRequestProcessor proc = newProcessor(); + + assertDoesNotThrow(() -> proc.toMeter(com.google.common.collect.ImmutableMap.of())); + } + + private static OpenTelemetryMetricRequestProcessor newProcessor() { + return new OpenTelemetryMetricRequestProcessor(mock(ModuleManager.class), + mock(OtelMetricReceiverConfig.class)); + } +} diff --git a/oap-server/server-receiver-plugin/pom.xml b/oap-server/server-receiver-plugin/pom.xml index 068a86167c62..56c5fca9395f 100644 --- a/oap-server/server-receiver-plugin/pom.xml +++ b/oap-server/server-receiver-plugin/pom.xml @@ -50,6 +50,7 @@ aws-firehose-receiver skywalking-async-profiler-receiver-plugin skywalking-pprof-receiver-plugin + skywalking-runtime-rule-receiver-plugin diff --git a/oap-server/server-receiver-plugin/receiver-proto/pom.xml b/oap-server/server-receiver-plugin/receiver-proto/pom.xml index aaff44d8c39a..fd83c06c87c8 100644 --- a/oap-server/server-receiver-plugin/receiver-proto/pom.xml +++ b/oap-server/server-receiver-plugin/receiver-proto/pom.xml @@ -98,11 +98,11 @@ protobuf-java version that grpc depends on. --> - com.google.protobuf:protoc:${com.google.protobuf.protoc.version}:exe:${os.detected.classifier} + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:${protoc-gen-grpc-java.plugin.version}:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/pom.xml b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/pom.xml new file mode 100644 index 000000000000..da9ae62e02b8 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/pom.xml @@ -0,0 +1,161 @@ + + + + + + server-receiver-plugin + org.apache.skywalking + ${revision} + + 4.0.0 + + skywalking-runtime-rule-receiver-plugin + jar + + + + org.apache.skywalking + server-core + ${project.version} + + + org.apache.skywalking + library-server + ${project.version} + + + + org.apache.skywalking + meter-analyzer + ${project.version} + + + + org.apache.skywalking + log-analyzer + ${project.version} + + + + org.apache.commons + commons-compress + + + + io.grpc + grpc-protobuf + ${grpc.version} + + + io.grpc + grpc-stub + ${grpc.version} + + + + org.apache.skywalking + library-integration-test + ${project.version} + test + + + org.apache.skywalking + library-banyandb-client + ${project.version} + test + + + + org.apache.skywalking + storage-banyandb-plugin + ${project.version} + test + + + org.testcontainers + testcontainers + test + + + org.testcontainers + junit-jupiter + test + + + + io.grpc + grpc-testing + test + + + + + + + org.xolstice.maven.plugins + protobuf-maven-plugin + ${protobuf-maven-plugin.version} + + + com.google.protobuf:protoc:${protobuf-java.version}:exe:${os.detected.classifier} + + grpc-java + + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} + + + + + + compile + compile-custom + + + + + + + diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDelta.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDelta.java new file mode 100644 index 000000000000..40938aa1299f --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDelta.java @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.util.Collections; +import java.util.HashSet; +import java.util.Set; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; + +/** + * Structured result of classifying a new rule bundle against the currently-running one on this + * node. Carries both the coarse classification and the fine-grained per-metric delta needed + * by the MAL apply path (per-metric-name shape diff) and by {@code AlarmKernelService.reset}. + * + *

Immutable. The classifier builds one of these; the apply pipeline consumes it. + */ +public final class DSLDelta { + + private final Classification classification; + /** Metric names added in the new bundle vs the old one. */ + private final Set addedMetrics; + /** Metric names present in the old bundle but not the new. */ + private final Set removedMetrics; + /** + * Metric names present in both with a shape change (function or scope moved). Trigger the + * per-metric remove+add sequence in {@link org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem}. + */ + private final Set shapeBreakMetrics; + /** Human-readable explanation for log lines and HTTP response bodies. */ + private final String reason; + + public DSLDelta(final Classification classification, + final Set addedMetrics, + final Set removedMetrics, + final Set shapeBreakMetrics, + final String reason) { + this.classification = classification; + this.addedMetrics = safe(addedMetrics); + this.removedMetrics = safe(removedMetrics); + this.shapeBreakMetrics = safe(shapeBreakMetrics); + this.reason = reason == null ? "" : reason; + } + + public static DSLDelta noChange() { + return new DSLDelta(Classification.NO_CHANGE, + Collections.emptySet(), Collections.emptySet(), Collections.emptySet(), + "content byte-identical"); + } + + public static DSLDelta newRule(final Set metrics) { + return new DSLDelta(Classification.NEW, + metrics, Collections.emptySet(), Collections.emptySet(), + "new (catalog, name) on this node"); + } + + public static DSLDelta filterOnly(final String reason) { + return new DSLDelta(Classification.FILTER_ONLY, + Collections.emptySet(), Collections.emptySet(), Collections.emptySet(), reason); + } + + public static DSLDelta structural(final Set added, + final Set removed, + final Set shapeBreak, + final String reason) { + return new DSLDelta(Classification.STRUCTURAL, added, removed, shapeBreak, reason); + } + + public Classification classification() { + return classification; + } + + public Set addedMetrics() { + return addedMetrics; + } + + public Set removedMetrics() { + return removedMetrics; + } + + public Set shapeBreakMetrics() { + return shapeBreakMetrics; + } + + public String reason() { + return reason; + } + + /** + * Set of metric names whose semantics moved and whose alarm windows should therefore be + * reset at the tail of a successful apply. Empty for FILTER_ONLY and NO_CHANGE; union of + * {added + removed + shapeBreak} for STRUCTURAL; empty for NEW because no prior windows + * exist. + */ + public Set alarmResetSet() { + if (classification != Classification.STRUCTURAL) { + return Collections.emptySet(); + } + final HashSet union = new HashSet<>(); + union.addAll(addedMetrics); + union.addAll(removedMetrics); + union.addAll(shapeBreakMetrics); + return Collections.unmodifiableSet(union); + } + + private static Set safe(final Set src) { + return src == null || src.isEmpty() + ? Collections.emptySet() + : Collections.unmodifiableSet(new HashSet<>(src)); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifier.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifier.java new file mode 100644 index 000000000000..5548bfba9240 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifier.java @@ -0,0 +1,351 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.io.StringReader; +import java.util.Collections; +import java.util.HashSet; +import java.util.LinkedHashMap; +import java.util.LinkedHashSet; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import org.apache.skywalking.oap.log.analyzer.v2.provider.LALConfig; +import org.apache.skywalking.oap.log.analyzer.v2.provider.LALConfigs; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.MetricsRule; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.Rule; +import org.yaml.snakeyaml.Yaml; + +/** + * Two-path classifier. Given a rule file's old and new content (byte strings), produces a + * {@link DSLDelta} that drives the dslManager's apply strategy and the alarm-window + * reset set. + * + *

Return contract by case: + *

    + *
  • {@code oldContent == newContent} (byte-identical) → {@link DSLDelta#noChange}. The + * dslManager short-circuits — no compile, no swap, no DB touch.
  • + *
  • {@code oldContent == null} (first-time on this node) → {@link DSLDelta#newRule}. The + * metric-name set surfaces as {@code addedMetrics}; {@code alarmResetSet()} is empty + * because no prior windows exist.
  • + *
  • Any other change → {@link DSLDelta#structural}. The added / removed / shape-break + * sets drive per-metric remove+add in the applier and the alarm-reset target set in + * {@code AlarmKernelService.reset}.
  • + *
+ * + *

FILTER_ONLY is emitted when every metric name is present in both bundles with identical + * shape (same {@code (functionName, scopeType)} tuple per {@link MalShapeExtractor}). The + * dslManager's fast path then skips DDL, alarm reset, and L1/L2 drain. + * + *

STRUCTURAL is emitted when any metric's shape differs, or when metric names were added / + * removed between old and new. The {@code shapeBreak} set in {@link DSLDelta} carries the + * precise set of metrics whose shape moved — driving {@code alarmResetSet()} to the minimal + * correct target instead of a blanket reset, and feeding the {@code allowStorageChange} + * guardrail on the REST handler that rejects shape-breaking edits unless explicitly opted in. + * + *

Shape extraction failure — an unparseable MAL expression on either side — falls back to + * the safe super-set: STRUCTURAL with every common metric in shape-break. Alarms reset more + * often than strictly required; one evaluation period is enough to self-heal. + */ +public final class DeltaClassifier { + + private DeltaClassifier() { + } + + /** + * Classify a MAL rule-file delta. Parses both YAMLs, enumerates metric names by the same + * {@code metricPrefix + "_" + ruleName} rule {@code MetricConvert} uses at apply time, and + * returns the {@link DSLDelta} the dslManager/REST handler should act on. + * + *

Null {@code newContent} is treated as "removing the bundle" — STRUCTURAL with every + * previously-owned name in {@code removedMetrics}. Null {@code oldContent} is NEW. If both + * YAMLs parse but either has no {@code metricsRules}, the enumerated set is empty and the + * classification reflects the delta (usually STRUCTURAL with nothing added or removed — a + * degenerate but legal state, e.g. an operator writing a valid-but-empty rules list). + * + * @throws IllegalArgumentException when either YAML is malformed; the dslManager catches + * this and surfaces it as an apply error rather than losing bundle state. + */ + public static DSLDelta classifyMal(final String oldContent, final String newContent) { + if (newContent == null) { + final Set removed = safeEnumerateMalNames(oldContent); + return DSLDelta.structural( + Collections.emptySet(), removed, Collections.emptySet(), + "bundle removed"); + } + if (oldContent != null && oldContent.equals(newContent)) { + return DSLDelta.noChange(); + } + final Set newMetrics = enumerateMalNames(newContent); + if (oldContent == null) { + return DSLDelta.newRule(newMetrics); + } + final Set oldMetrics = enumerateMalNames(oldContent); + final Set added = minus(newMetrics, oldMetrics); + final Set removed = minus(oldMetrics, newMetrics); + final Set commonMetrics = intersect(oldMetrics, newMetrics); + + // Per-metric shape diff. For every metric present in both the old and + // new bundle, extract (functionName, scopeType) from its MAL expression and compare. + // A shape diff on even one metric => STRUCTURAL with that metric in shape-break; no + // shape diff across all commons plus no adds/removes => FILTER_ONLY (body tweaks, + // no storage move). Shape-extract failures (unparseable expression on either side) + // fall back conservatively to shape-break for that metric. + final Map oldShapes; + final Map newShapes; + try { + oldShapes = MalShapeExtractor.extract(oldContent); + newShapes = MalShapeExtractor.extract(newContent); + } catch (final RuntimeException se) { + // Extraction threw — fall back to conservative STRUCTURAL with every common + // metric in shape-break. Safe superset; never reports FILTER_ONLY for a bundle + // we couldn't fully analyse. + return DSLDelta.structural(added, removed, commonMetrics, + "shape extract failed: " + se.getMessage()); + } + + final Set shapeBreak = new LinkedHashSet<>(); + for (final String name : commonMetrics) { + final MalShapeExtractor.MalShape oldShape = oldShapes.get(name); + final MalShapeExtractor.MalShape newShape = newShapes.get(name); + if (oldShape == null || newShape == null || !oldShape.equals(newShape)) { + // Missing shape on either side = unknown = treat as shape-break (conservative). + // Mismatch = true shape break. + shapeBreak.add(name); + } + } + + if (added.isEmpty() && removed.isEmpty() && shapeBreak.isEmpty()) { + // Same metric-name set, same shape for every one. Safe to skip DDL, alarm reset, + // and L1/L2 drain — the fast path. + return DSLDelta.filterOnly("body/filter/tag edits only (shapes unchanged)"); + } + + final Set shapeBreakFrozen = shapeBreak.isEmpty() + ? Collections.emptySet() + : Collections.unmodifiableSet(shapeBreak); + final String reason = reasonFor(added, removed, shapeBreakFrozen); + return DSLDelta.structural(added, removed, shapeBreakFrozen, reason); + } + + /** + * Classify a LAL rule-file delta. LAL has no direct metric-name target for alarm windows + * (rule keys are {@code (layer, ruleName)} pairs, not metric names), so the added/removed/ + * shape-break sets are left empty here. When inline-MAL extraction lands in a follow-up + * (LAL→MAL chain), those synthetic metric names will flow into the shape-break set and + * drive {@link DSLDelta#alarmResetSet}. + * + *

For now this just distinguishes NO_CHANGE vs NEW vs STRUCTURAL based on content + * identity; the dslManager uses the classification to log the intended path and to avoid + * re-applying byte-identical LAL content. + */ + public static DSLDelta classifyLal(final String oldContent, final String newContent) { + if (newContent == null) { + return DSLDelta.structural( + Collections.emptySet(), Collections.emptySet(), Collections.emptySet(), + "bundle removed"); + } + if (oldContent != null && oldContent.equals(newContent)) { + return DSLDelta.noChange(); + } + if (oldContent == null) { + // The Set parameter documents *claimed* rule keys, not metric names; we + // keep it empty so alarmResetSet() is empty on NEW (matches MAL behaviour — no + // prior windows existed). + return DSLDelta.newRule(Collections.emptySet()); + } + return DSLDelta.structural( + Collections.emptySet(), Collections.emptySet(), Collections.emptySet(), + "LAL content changed"); + } + + /** + * Extract the set of {@code (layer, ruleName)} keys a LAL file declares. Surfaced here so + * the dslManager can run its cross-file collision check without re-parsing — + * {@link LalFileApplier#planKeys} does the same work; this is a lower-dep alternative when + * we only need the enumerated set, not an applier round-trip. + */ + /** + * Detect whether an LAL update moves any rule's {@code outputType} — the FQCN of the + * {@code AbstractLog} subclass the sink dispatches to. A change here reroutes log records + * to a different storage-backed subclass, which on BanyanDB means a different measure + * (and on all backends means any previously-indexed rows for the old type are orphaned). + * The REST handler's {@code allowStorageChange} guardrail treats a non-empty return as a + * storage-level edit. + * + *

Also treats rule additions / removals as "storage-affecting" because the inline-MAL + * metrics a new/removed rule declares would flow into {@code MeterSystem.removeMetric} + * and trigger a measure drop on BanyanDB. + * + * @return set of rule names whose outputType changed, plus rule names added or removed + * between old and new. Empty when neither bundle declares outputType and the + * rule key set is identical — the safe path. + */ + public static Set lalStorageAffectingChanges(final String oldContent, final String newContent) { + if (oldContent == null || newContent == null) { + // Either side absent means the whole bundle is being added or removed — the + // caller already treats this as a major event; a non-empty return here just + // confirms it at the fine-grained level. + return oldContent == null && newContent == null + ? Collections.emptySet() + : enumerateLalRuleKeys(newContent == null ? oldContent : newContent); + } + final Map oldOut = lalRuleOutputTypes(oldContent); + final Map newOut = lalRuleOutputTypes(newContent); + final Set out = new LinkedHashSet<>(); + // Added rules (new side has a key the old side doesn't). + for (final String key : newOut.keySet()) { + if (!oldOut.containsKey(key)) { + out.add(key); + } + } + // Removed rules. + for (final String key : oldOut.keySet()) { + if (!newOut.containsKey(key)) { + out.add(key); + } + } + // Changed outputType on a shared rule. + for (final Map.Entry e : oldOut.entrySet()) { + final String newVal = newOut.get(e.getKey()); + if (newVal != null && !Objects.equals(nullToEmpty(e.getValue()), nullToEmpty(newVal))) { + out.add(e.getKey()); + } + } + return out.isEmpty() ? Collections.emptySet() : Collections.unmodifiableSet(out); + } + + private static Map lalRuleOutputTypes(final String content) { + if (content == null || content.isEmpty()) { + return Collections.emptyMap(); + } + try (StringReader r = new StringReader(content)) { + final LALConfigs configs = new Yaml().loadAs(r, LALConfigs.class); + if (configs == null || configs.getRules() == null) { + return Collections.emptyMap(); + } + final Map out = new LinkedHashMap<>(); + for (final LALConfig c : configs.getRules()) { + final String layer = c.getLayer() == null || c.getLayer().isEmpty() + ? "auto" : c.getLayer(); + out.put(layer + ":" + c.getName(), c.getOutputType()); + } + return Collections.unmodifiableMap(out); + } catch (final Throwable t) { + throw new IllegalArgumentException( + "LAL YAML parse failure while extracting outputType: " + t.getMessage(), t); + } + } + + private static String nullToEmpty(final String s) { + return s == null ? "" : s; + } + + public static Set enumerateLalRuleKeys(final String content) { + final Set out = new LinkedHashSet<>(); + if (content == null || content.isEmpty()) { + return Collections.unmodifiableSet(out); + } + try (StringReader r = new StringReader(content)) { + final LALConfigs configs = new Yaml().loadAs(r, LALConfigs.class); + if (configs == null || configs.getRules() == null) { + return Collections.unmodifiableSet(out); + } + for (final LALConfig c : configs.getRules()) { + final String layer = c.getLayer() == null || c.getLayer().isEmpty() + ? "auto" : c.getLayer(); + out.add(layer + ":" + c.getName()); + } + } catch (final Throwable t) { + throw new IllegalArgumentException( + "LAL YAML parse failure while enumerating rule keys: " + t.getMessage(), t); + } + return Collections.unmodifiableSet(out); + } + + private static Set safeEnumerateMalNames(final String content) { + if (content == null || content.isEmpty()) { + return Collections.emptySet(); + } + try { + return enumerateMalNames(content); + } catch (final RuntimeException e) { + // A malformed old-side shouldn't block removal — the bundle we're deleting was + // parseable once and has a known metric set in the applier's Applied record; the + // dslManager uses that as ground truth, so a parse failure here is diagnostic only. + return Collections.emptySet(); + } + } + + private static Set enumerateMalNames(final String content) { + try (StringReader reader = new StringReader(content)) { + final Rule rule = new Yaml().loadAs(reader, Rule.class); + if (rule == null || rule.getMetricsRules() == null) { + return Collections.emptySet(); + } + final Set out = new LinkedHashSet<>(); + final String prefix = rule.getMetricPrefix(); + if (prefix == null) { + return Collections.emptySet(); + } + for (final MetricsRule r : rule.getMetricsRules()) { + if (r.getName() != null) { + out.add(prefix + "_" + r.getName()); + } + } + return Collections.unmodifiableSet(out); + } catch (final Throwable t) { + throw new IllegalArgumentException( + "MAL YAML parse failure: " + t.getMessage(), t); + } + } + + private static Set minus(final Set a, final Set b) { + final Set r = new LinkedHashSet<>(a); + r.removeAll(b); + return r.isEmpty() ? Collections.emptySet() : Collections.unmodifiableSet(r); + } + + private static Set intersect(final Set a, final Set b) { + final Set r = new HashSet<>(a); + r.retainAll(b); + return r.isEmpty() ? Collections.emptySet() : Collections.unmodifiableSet(r); + } + + private static String reasonFor(final Set added, final Set removed, + final Set shapeBreak) { + final StringBuilder sb = new StringBuilder(); + if (!added.isEmpty()) { + sb.append("added=").append(added.size()); + } + if (!removed.isEmpty()) { + if (sb.length() > 0) { + sb.append(", "); + } + sb.append("removed=").append(removed.size()); + } + if (!shapeBreak.isEmpty()) { + if (sb.length() > 0) { + sb.append(", "); + } + sb.append("possibly-shape-changed=").append(shapeBreak.size()); + } + return sb.length() == 0 ? "content changed" : sb.toString(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java new file mode 100644 index 000000000000..5345312bdd20 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java @@ -0,0 +1,396 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.io.StringReader; +import java.util.ArrayList; +import java.util.Collections; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Set; +import javassist.ClassPool; +import javassist.LoaderClassPath; +import lombok.Getter; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; +import org.apache.skywalking.oap.log.analyzer.v2.provider.LALConfig; +import org.apache.skywalking.oap.server.core.analysis.Layer; +import org.apache.skywalking.oap.log.analyzer.v2.provider.LALConfigs; +import org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; +import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied; +import org.yaml.snakeyaml.Yaml; + +/** + * Turns a runtime-rule LAL file (the {@code lal} catalog) into live DSL entries on this OAP + * node. Uses {@link LogFilterListener.Factory}'s compile helper + addOrReplace / remove API so + * the runtime path hits exactly the same DSL registry the log-analysis pipeline consults — + * there is no duplicate LAL wiring. + * + *

A LAL YAML file holds a list of rules, each with its own {@code name + layer + dsl + + * inputType + outputType}. Apply compiles every rule in the file and registers it; remove + * unwinds exactly the set of (layer, ruleName) pairs the earlier apply put in place. + */ +@Slf4j +public class LalFileApplier { + + private final LogFilterListener.Factory factory; + + public LalFileApplier(final LogFilterListener.Factory factory) { + this.factory = factory; + } + + /** + * Parse the YAML and return the list of {@code (layer, ruleName)} keys the file will claim + * if applied. Lets the dslManager detect cross-file collisions before any compile work — + * the file's rules are rejected before Factory.addOrReplace sees them. + * + *

This is a read-only inspection; no compile, no side effects. Throws + * {@link ApplyException} only on YAML parse failure (not on DSL content — that is caught + * later inside {@link #apply}). + */ + public List planKeys(final String yamlContent, final String sourceName) throws ApplyException { + final List configs = parse(yamlContent, sourceName); + final List keys = new ArrayList<>(configs.size()); + for (final LALConfig c : configs) { + final boolean isAuto = LALConfig.LAYER_AUTO.equalsIgnoreCase(c.getLayer()); + final Layer layer = isAuto + ? null + : Layer.nameOf(c.getLayer()); + keys.add(new RegisteredRule(layer, c.getName())); + } + return keys; + } + + /** + * Parse + compile + register every rule declared in the YAML content. + * + * @param yamlContent raw YAML bytes of the rule file (byte-identical to the POSTed body) + * @param sourceName informational identifier used in Javassist's SourceFile attribute so + * generated bytecode shows the originating rule file in stack traces. + * Convention: {@code catalog + "/" + name}. + * @return {@link Applied} bundle with the list of registered rule keys so the next + * update/delete can unwind via {@link #remove(Applied)}. + * @throws ApplyException on YAML parse error or DSL compile error. Partial registration + * is rolled back by the caller via {@link #remove(Applied)} with the partially- + * populated {@code partial} list. + */ + public Applied apply(final String yamlContent, final String sourceName) throws ApplyException { + return apply(yamlContent, sourceName, ""); + } + + /** + * Parse + compile + register, with the content hash threaded through so the per-file + * {@link RuleClassLoader} that owns every generated {@code LalExpression} for this apply + * carries a traceable identity through the {@code ClassLoaderGc}. + */ + public Applied apply(final String yamlContent, final String sourceName, + final String contentHash) throws ApplyException { + return apply(yamlContent, sourceName, contentHash, DSLClassLoaderManager.Kind.RUNTIME); + } + + /** + * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#STATIC} mints a {@code static:} + * loader so the static fall-over path (bundled rule serving again after the runtime + * override is removed) is distinguishable from the runtime path in logs and diagnostics. + */ + public Applied apply(final String yamlContent, final String sourceName, + final String contentHash, + final DSLClassLoaderManager.Kind kind) throws ApplyException { + final List configs = parse(yamlContent, sourceName); + + // One per-file RuleClassLoader for the whole file — every rule inside shares it, so all + // generated LalExpression classes (one per rule) drop together on unregister when the + // manager retires the loader. The pool is parented to the default pool so shipped + // classes (LalExpression interface, FilterSpec, ExecutionContext, LogBuilder, layer + // SPI output types) resolve via parent-first lookup; LoaderClassPath ensures Javassist + // can write new subclasses back into this loader via defineClass. + final int firstSlash = sourceName.indexOf('/'); + final Catalog catalog = firstSlash > 0 + ? Catalog.of(sourceName.substring(0, firstSlash)) + : Catalog.LAL; + final String ruleName = firstSlash > 0 + ? sourceName.substring(firstSlash + 1) + : sourceName; + final RuleClassLoader ruleLoader = DSLClassLoaderManager.INSTANCE.newBuilder( + catalog, ruleName, kind, contentHash); + final ClassPool pool = new ClassPool(ClassPool.getDefault()); + pool.appendClassPath(new LoaderClassPath(ruleLoader)); + + // Two-phase apply at file granularity, mirroring the MAL restructure: + // + // Phase 1 — compile ALL rules under the per-file loader. No factory.addOrReplace, no + // registry mutation. If any rule's DSL fails to parse, the whole file apply aborts + // with an empty partial list — nothing was ever registered, nothing to roll back. + // The per-file loader's compiled classes die with the (throwaway) loader on the + // exception propagation. + // + // Phase 2 — atomically swap into the factory registry. factory.addOrReplace is a + // volatile map write; a partial-failure window here is theoretical (Map.put + // doesn't throw). We still track progress per-rule so if somehow the JVM throws + // during phase 2, the caller's rollback list is accurate. + final List compiled = new ArrayList<>(configs.size()); + for (final LALConfig c : configs) { + c.setSourceName(sourceName); + try { + compiled.add(factory.compile(c, pool, ruleLoader)); + } catch (final Throwable t) { + // Compile-phase failure: zero registrations landed, so partial is empty. + throw new ApplyException( + "LAL compile failed for rule '" + c.getName() + "' in " + sourceName, + t, Collections.emptyList()); + } + } + + final List registered = new ArrayList<>(); + for (final LogFilterListener.Factory.CompiledLAL x : compiled) { + try { + // Cross-file collision guard: if another LAL file already owns (layer, + // ruleName), and we're not the prior holder (which would be a self-replace), + // reject — the registry's uniqueness invariant is per-layer within the + // cluster. Self-replace is safe because Phase 1 already succeeded and + // addOrReplace is the intended atomic takeover. + factory.addOrReplace(x); + registered.add(new RegisteredRule(x.layer, x.ruleName)); + } catch (final Throwable t) { + throw new ApplyException( + "LAL register failed for rule '" + x.ruleName + "' in " + sourceName, + t, Collections.unmodifiableList(new ArrayList<>(registered))); + } + } + return new Applied(sourceName, Collections.unmodifiableList(registered), ruleLoader); + } + + /** + * Reverse of {@link #apply}: drop every (layer, ruleName) the previous apply registered. + * Safe to call with a partially-populated {@link Applied} (e.g. from {@link ApplyException}). + */ + public void remove(final Applied applied) { + if (applied == null || applied.getRegistered().isEmpty()) { + return; + } + for (final RegisteredRule r : applied.getRegistered()) { + try { + factory.remove(r.getLayer(), r.getRuleName()); + } catch (final Throwable t) { + log.warn("runtime-rule LAL remove: failed to remove (layer={}, rule={})", + r.getLayer(), r.getRuleName(), t); + } + } + } + + /** + * Parse raw YAML and return the {@link RegisteredRule} keys the LAL file would own. + * Static-only variant — does not compile, does not register, does not construct a + * per-file classloader. Used by the dslManager's teardown path when it needs to drop + * boot-registered LAL rules for a static-only bundle but has no {@link Applied} in + * {@code appliedLal} to consult (first operator {@code /inactivate} of a shipped LAL + * file). + * + *

Returns an empty list on any parse failure — teardown is best-effort, and a + * malformed static rule cannot own live handlers anyway. {@code layer:auto} rules are + * surfaced as entries with {@code layer == null} so the caller can route them through + * the factory's auto-rule removal path. + */ + public static List parseRuleKeys(final String yamlContent, final String sourceName) { + if (yamlContent == null || yamlContent.isEmpty()) { + return Collections.emptyList(); + } + try (StringReader reader = new StringReader(yamlContent)) { + final LALConfigs configs = new Yaml().loadAs(reader, LALConfigs.class); + if (configs == null || configs.getRules() == null) { + return Collections.emptyList(); + } + final List out = new ArrayList<>(configs.getRules().size()); + for (final LALConfig c : configs.getRules()) { + if (c.getName() == null || c.getName().isEmpty()) { + continue; + } + final Layer layer; + if (c.getLayer() == null || LALConfig.LAYER_AUTO.equalsIgnoreCase(c.getLayer())) { + layer = null; + } else { + try { + layer = Layer.valueOf(c.getLayer()); + } catch (final IllegalArgumentException bad) { + // Unknown layer string in the YAML — skip this rule rather than abort; + // teardown is best-effort. + log.warn("runtime-rule: LAL static rule '{}' has unknown layer '{}' in {}; " + + "skipping from teardown enumeration", c.getName(), c.getLayer(), sourceName); + continue; + } + } + out.add(new RegisteredRule(layer, c.getName())); + } + return Collections.unmodifiableList(out); + } catch (final Throwable t) { + log.warn("runtime-rule: failed to parse static LAL content for {} — no rule keys " + + "enumerated for teardown", sourceName, t); + return Collections.emptyList(); + } + } + + private List parse(final String yamlContent, final String sourceName) throws ApplyException { + try (StringReader reader = new StringReader(yamlContent)) { + final LALConfigs configs = new Yaml().loadAs(reader, LALConfigs.class); + if (configs == null || configs.getRules() == null || configs.getRules().isEmpty()) { + throw new ApplyException( + "LAL YAML parsed to empty/malformed — no rules list in " + sourceName, + null, Collections.emptyList()); + } + return configs.getRules(); + } catch (final ApplyException e) { + throw e; + } catch (final Throwable t) { + throw new ApplyException("LAL YAML parse failure for " + sourceName, t, + Collections.emptyList()); + } + } + + /** Result of a successful {@link #apply} — retained so the next update/delete can unwind. */ + public static final class Applied implements EngineApplied { + @Getter + private final String sourceName; + @Getter + private final List registered; + /** + * Per-file loader that owns every generated {@code LalExpression} class for this apply. + * Retained as a strong reference so the classes stay live while the bundle is ACTIVE; + * the dslManager retires it through {@code ClassLoaderGc} on unregister so GC is + * observable. Null for the legacy 2-arg {@link #apply(String, String)} entry point, + * which remains for backward compatibility in tests. + */ + @Getter + private final RuleClassLoader ruleClassLoader; + + public Applied(final String sourceName, final List registered) { + this(sourceName, registered, null); + } + + public Applied(final String sourceName, final List registered, + final RuleClassLoader ruleClassLoader) { + this.sourceName = sourceName; + this.registered = registered; + this.ruleClassLoader = ruleClassLoader; + } + + @Override + public int suspendDispatch(final ModuleManager moduleManager) { + if (registered == null || registered.isEmpty()) { + return 0; + } + try { + final LogFilterListener.Factory f = moduleManager.find(LogAnalyzerModule.NAME) + .provider() + .getService(LogFilterListener.Factory.class); + final List keys = ruleKeys(); + f.suspend(keys); + return keys.size(); + } catch (final Throwable t) { + log.warn("runtime-rule LAL Applied: suspendDispatch lookup failed; " + + "next tick retries.", t); + return 0; + } + } + + @Override + public int resumeDispatch(final ModuleManager moduleManager) { + if (registered == null || registered.isEmpty()) { + return 0; + } + try { + final LogFilterListener.Factory f = moduleManager.find(LogAnalyzerModule.NAME) + .provider() + .getService(LogFilterListener.Factory.class); + final List keys = ruleKeys(); + f.resume(keys); + return keys.size(); + } catch (final Throwable t) { + log.warn("runtime-rule LAL Applied: resumeDispatch lookup failed; " + + "next tick retries.", t); + return 0; + } + } + + /** Cross-file ownership uses {@code (layer, ruleName)} keys: another active LAL + * bundle declaring the same key would overwrite this one's handler. */ + @Override + public Set claimedKeys() { + if (registered == null || registered.isEmpty()) { + return Collections.emptySet(); + } + final LinkedHashSet out = new LinkedHashSet<>(); + for (final RegisteredRule r : registered) { + out.add(LogFilterListener.Factory.ruleKey(r.getLayer(), r.getRuleName())); + } + return Collections.unmodifiableSet(out); + } + + @Override + public Object classLoader() { + return ruleClassLoader; + } + + /** LAL has no alarm semantics — alarm windows key off metric names, not log rules. */ + @Override + public Set alarmResetTargets() { + return Collections.emptySet(); + } + + private List ruleKeys() { + final List keys = new ArrayList<>(registered.size()); + for (final RegisteredRule r : registered) { + keys.add(LogFilterListener.Factory.ruleKey(r.getLayer(), r.getRuleName())); + } + return keys; + } + } + + /** One registered (layer, ruleName) pair. {@code layer} is null for auto-layer rules. */ + public static final class RegisteredRule { + @Getter + private final Layer layer; + @Getter + private final String ruleName; + + public RegisteredRule(final Layer layer, + final String ruleName) { + this.layer = layer; + this.ruleName = ruleName; + } + } + + /** + * Uniform error type with the {@code partial} registration list so the caller can roll + * back whatever made it through before the failure via {@link #remove(Applied)}. + */ + public static final class ApplyException extends Exception { + @Getter + private final List partial; + + public ApplyException(final String message, final Throwable cause, + final List partial) { + super(message, cause); + this.partial = partial == null ? Collections.emptyList() : partial; + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java new file mode 100644 index 000000000000..d6a96e905864 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java @@ -0,0 +1,420 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.io.StringReader; +import java.util.Collections; +import java.util.LinkedHashMap; +import java.util.LinkedHashSet; +import java.util.Map; +import java.util.Set; +import javassist.ClassPool; +import javassist.LoaderClassPath; +import lombok.Getter; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.MetricsRule; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.Rule; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; +import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied; +import org.yaml.snakeyaml.Yaml; + +/** + * Turns a runtime-rule MAL file (one of the {@code otel-rules} / {@code log-mal-rules} + * catalogs) into a live {@link MetricConvert} on this OAP node. + * + *

This is the MAL half of the apply pipeline: parse the stored YAML, construct a + * {@link MetricConvert}, and let the existing meter-analyzer path register each declared + * metric name via {@link MeterSystem#create}. Removal reverses the registrations through + * {@link MeterSystem#removeMetric}, which drains L1/L2 handlers and drops the BanyanDB + * measure per the work done in bundle A. + * + *

Class isolation: every MAL-generated class for this file — storage-side {@code Metrics} + * subclasses (×3 downsampling variants per declared metric), the {@code MalExpression} + * implementation class per rule, each closure companion, and the {@code MalFilter} if the + * YAML declares a top-level {@code filter:} — is defined in the per-file + * {@link RuleClassLoader} created here. Retiring the loader through the dslManager's + * {@code ClassLoaderGc} on unregister makes the whole bundle GC-eligible at once, + * observable via the loader's phantom-reference queue. + */ +@Slf4j +public class MalFileApplier { + + private final MeterSystem meterSystem; + + public MalFileApplier(final MeterSystem meterSystem) { + this.meterSystem = meterSystem; + } + + /** + * Parse + compile + register every rule declared in the YAML content. + * + * @param yamlContent raw YAML bytes of the rule file (byte-identical to the POSTed body) + * @param sourceName informational identifier used in Javassist's {@code SourceFile} + * attribute so generated bytecode shows the originating rule file in + * stack traces. Convention: {@code catalog + "/" + name}. + * @return {@link Applied} bundle holding the set of metric names this file registered + * (used on the next update/delete to call {@link MeterSystem#removeMetric} on + * each) plus the {@link MetricConvert}, retained so sample dispatch can reach it. + * @throws ApplyException on YAML parse error or rule validation failure. MAL expression + * compile errors may surface at {@link MetricConvert} construction — those are + * also wrapped here so the caller's error handling is uniform. + */ + public Applied apply(final String yamlContent, final String sourceName, + final String contentHash, + final StorageManipulationOpt storageOpt) throws ApplyException { + return apply(yamlContent, sourceName, contentHash, storageOpt, + DSLClassLoaderManager.Kind.RUNTIME); + } + + /** + * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#STATIC} mints a {@code static:} + * loader so the static fall-over path (bundled rule serving again after the runtime + * override is removed) is distinguishable from the runtime path in logs and diagnostics. + */ + public Applied apply(final String yamlContent, final String sourceName, + final String contentHash, + final StorageManipulationOpt storageOpt, + final DSLClassLoaderManager.Kind kind) throws ApplyException { + final Rule rule = parse(yamlContent, sourceName); + final Set metricNames = enumerateMetricNames(rule); + + // Per-file RuleClassLoader + Javassist ClassPool. The pool is parented to the default + // pool so shipped ancestor classes (SumFunction, HistogramFunction, ...) resolve via + // parent-first lookup; LoaderClassPath ensures Javassist's generator can write the + // new Metrics subclass into THIS loader. Dropping the bundle drops the loader, and + // every class the loader defined becomes GC-eligible — observable via the manager's + // internal phantom-reference queue. + final int firstSlash = sourceName.indexOf('/'); + final Catalog catalog = firstSlash > 0 + ? Catalog.of(sourceName.substring(0, firstSlash)) + : Catalog.OTEL_RULES; + final String ruleName = firstSlash > 0 + ? sourceName.substring(firstSlash + 1) + : sourceName; + final RuleClassLoader ruleLoader = DSLClassLoaderManager.INSTANCE.newBuilder( + catalog, ruleName, kind, contentHash); + final ClassPool pool = new ClassPool(ClassPool.getDefault()); + pool.appendClassPath(new LoaderClassPath(ruleLoader)); + + final MetricConvert convert; + try { + convert = new MetricConvert(rule, meterSystem, pool, ruleLoader, storageOpt); + } catch (final MetricConvert.PartialRegistrationException pre) { + // Phase-2 register threw partway. Carry ONLY the subset that actually landed in + // MeterSystem — the caller uses this set for rollback. Passing the full enumerated + // set here would remove metrics the old bundle still owns (disastrous on + // FILTER_ONLY edits, where by definition every metric name is also in the old + // bundle). + throw new ApplyException( + "MAL register failed for " + sourceName + " (partial)", + pre.getCause() == null ? pre : pre.getCause(), + pre.getRegisteredBeforeFailure()); + } catch (final Throwable t) { + // Phase-1 compile failure or other pre-register throw. Nothing was registered with + // MeterSystem, so rollback set is empty — passing a non-empty set would cause the + // caller to unregister metrics the old bundle owns and this apply never touched. + throw new ApplyException("MAL compile failed for " + sourceName, t, Collections.emptySet()); + } + return new Applied(rule, convert, metricNames, ruleLoader); + } + + /** + * Back-compat overload: callers that haven't yet picked a storage policy pass + * {@link StorageManipulationOpt#fullInstall()}. Main-node apply path. + */ + public Applied apply(final String yamlContent, final String sourceName, + final String contentHash) throws ApplyException { + return apply(yamlContent, sourceName, contentHash, StorageManipulationOpt.fullInstall()); + } + + /** + * Back-compat overload for callers that haven't yet threaded the content hash through. + * Uses an empty hash — the per-file loader still works, it's just less traceable in the + * {@code ClassLoaderGc} output. + */ + public Applied apply(final String yamlContent, final String sourceName) throws ApplyException { + return apply(yamlContent, sourceName, "", StorageManipulationOpt.fullInstall()); + } + + /** + * Reverse of {@link #apply}: drop every metric name the previous apply registered under + * the given {@link StorageManipulationOpt storage policy}. Main-node callers pass + * {@link StorageManipulationOpt#fullInstall()} so {@code BanyanDBIndexInstaller.dropTable} + * actually deletes the server-side measure. Peer-node callers pass + * {@link StorageManipulationOpt#localCacheOnly()} so local teardown (L1/L2 drain, + * {@code meterPrototypes} eviction, CtClass detach) still runs but the server-side drop + * is suppressed — main owns server-side state. + * + *

Failure handling is best-effort-with-surfacing: the loop does NOT abort on the first + * failing metric (the caller's desired end-state is "all gone", and one stubborn metric + * should not block the teardown of its siblings), but collected failures throw + * {@link RemoveException} after the loop. Callers on the REST sync path propagate that + * throw so the operator sees 500 {@code teardown_deferred} / {@code commit_deferred} + * instead of a misleading 200 {@code inactivated} / {@code structural_applied}; the + * dslManager tick wraps the call site so one failing rule doesn't block the whole tick. + * + * @throws RemoveException when one or more metrics failed to remove fully. The caller's + * local state (this applier's view) has still advanced — MeterSystem's + * meterPrototypes entry was dropped on best-effort — but the backend drop (or + * worker drain, or CtClass detach) did not fully succeed for the listed names. + */ + public void remove(final Set metricNames, final StorageManipulationOpt storageOpt) { + if (metricNames == null || metricNames.isEmpty()) { + return; + } + Map failures = null; + for (final String name : metricNames) { + try { + meterSystem.removeMetric(name, storageOpt); + } catch (final Throwable t) { + log.warn("runtime-rule MAL remove: failed to remove metric {}", name, t); + if (failures == null) { + failures = new LinkedHashMap<>(); + } + failures.put(name, t); + } + } + if (failures != null) { + throw new RemoveException(failures); + } + } + + /** Back-compat overload: full-install policy (server-side drop fires). */ + public void remove(final Set metricNames) { + remove(metricNames, StorageManipulationOpt.fullInstall()); + } + + private Rule parse(final String yamlContent, final String sourceName) throws ApplyException { + try (StringReader reader = new StringReader(yamlContent)) { + final Rule rule = new Yaml().loadAs(reader, Rule.class); + if (rule == null) { + throw new ApplyException("YAML parsed to null — empty or malformed rule file: " + + sourceName, null, Collections.emptySet()); + } + if (rule.getName() == null || rule.getName().isEmpty()) { + rule.setName(sourceName); + } + return rule; + } catch (final ApplyException e) { + throw e; + } catch (final Throwable t) { + throw new ApplyException("YAML parse failure for " + sourceName, t, Collections.emptySet()); + } + } + + /** + * Parse raw YAML and return the set of metric names the rule file would register. + * Static-only variant — does not compile, does not register, does not construct a + * per-file classloader. Used by the dslManager's teardown path when it needs to know + * which metrics a boot-loaded static rule owns but has no {@link Applied} in + * {@code appliedMal} to consult (first operator {@code /inactivate} or first + * structural {@code /addOrUpdate} of a rule that only had a static version). + * + *

Returns an empty set on any parse failure — teardown is best-effort, and a + * malformed static rule cannot own live metrics anyway. + */ + public static Set parseMetricNames(final String yamlContent, final String sourceName) { + if (yamlContent == null || yamlContent.isEmpty()) { + return Collections.emptySet(); + } + try (StringReader reader = new StringReader(yamlContent)) { + final Rule rule = new Yaml().loadAs(reader, Rule.class); + if (rule == null) { + return Collections.emptySet(); + } + if (rule.getName() == null || rule.getName().isEmpty()) { + rule.setName(sourceName); + } + return enumerateMetricNames(rule); + } catch (final Throwable t) { + log.warn("runtime-rule: failed to parse static MAL content for {} — no metric " + + "names enumerated for teardown", sourceName, t); + return Collections.emptySet(); + } + } + + /** + * Compute the full set of registered metric names for a parsed Rule, mirroring + * {@code MetricConvert.formatMetricName} ({@code metricPrefix + "_" + ruleName}). + */ + private static Set enumerateMetricNames(final Rule rule) { + final Set out = new LinkedHashSet<>(); + if (rule.getMetricsRules() == null) { + return Collections.unmodifiableSet(out); + } + for (final MetricsRule r : rule.getMetricsRules()) { + if (r.getName() == null) { + continue; + } + out.add(rule.getMetricPrefix() + "_" + r.getName()); + } + return Collections.unmodifiableSet(out); + } + + /** Result of a successful {@link #apply} — retained so the next update/delete can unwind. */ + public static final class Applied implements EngineApplied { + @Getter + private final Rule rule; + @Getter + private final MetricConvert metricConvert; + @Getter + private final Set registeredMetricNames; + /** + * Per-file loader that owns every generated Metrics class for this apply. Retained + * here as a strong reference so the classes stay live while the bundle is ACTIVE; + * the dslManager retires it through {@code ClassLoaderGc} on unregister so GC is + * observable. + */ + @Getter + private final RuleClassLoader ruleClassLoader; + + public Applied(final Rule rule, final MetricConvert metricConvert, + final Set registeredMetricNames, + final RuleClassLoader ruleClassLoader) { + this.rule = rule; + this.metricConvert = metricConvert; + this.registeredMetricNames = registeredMetricNames; + this.ruleClassLoader = ruleClassLoader; + } + + @Override + public int suspendDispatch(final ModuleManager moduleManager) { + if (registeredMetricNames == null || registeredMetricNames.isEmpty()) { + return 0; + } + try { + final MeterSystem ms = moduleManager.find(CoreModule.NAME).provider() + .getService(MeterSystem.class); + return ms.suspendDispatch(registeredMetricNames); + } catch (final Throwable t) { + log.warn("runtime-rule MAL Applied: suspendDispatch lookup failed; " + + "next tick retries.", t); + return 0; + } + } + + @Override + public int resumeDispatch(final ModuleManager moduleManager) { + if (registeredMetricNames == null || registeredMetricNames.isEmpty()) { + return 0; + } + try { + final MeterSystem ms = moduleManager.find(CoreModule.NAME).provider() + .getService(MeterSystem.class); + return ms.resumeDispatch(registeredMetricNames); + } catch (final Throwable t) { + log.warn("runtime-rule MAL Applied: resumeDispatch lookup failed; " + + "next tick retries.", t); + return 0; + } + } + + /** Cross-file ownership uses metric names: another active MAL bundle declaring the + * same {@code metricPrefix_metricName} is a config collision the operator must + * resolve before either side can apply. */ + @Override + public Set claimedKeys() { + return registeredMetricNames == null + ? Collections.emptySet() + : registeredMetricNames; + } + + @Override + public Object classLoader() { + return ruleClassLoader; + } + + @Override + public Set alarmResetTargets() { + return registeredMetricNames == null + ? Collections.emptySet() + : registeredMetricNames; + } + } + + /** + * Uniform error type raised by {@link #apply}. Carries the {@code partiallyRegistered} set + * so the caller can invoke {@link #remove} to roll back whatever made it through before + * the failure. + */ + public static final class ApplyException extends Exception { + @Getter + private final Set partiallyRegistered; + + public ApplyException(final String message, final Throwable cause, + final Set partiallyRegistered) { + super(message, cause); + this.partiallyRegistered = partiallyRegistered == null + ? Collections.emptySet() + : partiallyRegistered; + } + } + + /** + * Raised by {@link #remove(Set, StorageManipulationOpt)} when one or more metric + * teardowns failed. Unchecked so REST-sync callers can let it propagate without changing + * method signatures across the call chain; the REST layer already catches + * {@code Throwable} and surfaces it as 500 {@code teardown_deferred}. Cause is the first + * underlying failure; the full map is retained on the exception for diagnostic logging. + */ + public static final class RemoveException extends RuntimeException { + @Getter + private final Map failures; + + public RemoveException(final Map failures) { + super(buildMessage(failures), firstCause(failures)); + this.failures = failures == null + ? Collections.emptyMap() + : Collections.unmodifiableMap(new LinkedHashMap<>(failures)); + } + + private static String buildMessage(final Map failures) { + if (failures == null || failures.isEmpty()) { + return "runtime-rule MAL remove failed (no failures recorded)"; + } + final StringBuilder sb = new StringBuilder( + "runtime-rule MAL remove failed for ").append(failures.size()).append(" metric(s): "); + boolean first = true; + for (final Map.Entry e : failures.entrySet()) { + if (!first) { + sb.append(", "); + } + sb.append(e.getKey()).append(" (") + .append(e.getValue() == null ? "null" : e.getValue().getClass().getSimpleName()) + .append(")"); + first = false; + } + return sb.toString(); + } + + private static Throwable firstCause(final Map failures) { + if (failures == null || failures.isEmpty()) { + return null; + } + return failures.values().iterator().next(); + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalShapeExtractor.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalShapeExtractor.java new file mode 100644 index 000000000000..c6c42228dde3 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalShapeExtractor.java @@ -0,0 +1,210 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import com.google.common.base.Strings; +import java.io.StringReader; +import java.util.Collections; +import java.util.LinkedHashMap; +import java.util.Map; +import java.util.Objects; +import org.apache.commons.lang3.StringUtils; +import org.apache.commons.text.CaseUtils; +import org.apache.skywalking.oap.meter.analyzer.v2.dsl.DSL; +import org.apache.skywalking.oap.meter.analyzer.v2.dsl.ExpressionMetadata; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.MetricsRule; +import org.apache.skywalking.oap.meter.analyzer.v2.prometheus.rule.Rule; +import org.apache.skywalking.oap.server.core.analysis.meter.ScopeType; +import org.yaml.snakeyaml.Yaml; + +/** + * Extracts the per-metric storage shape {@code (functionName, scopeType)} from a MAL rule file + * without running Javassist codegen. The classifier uses this to decide whether an update is + * truly STRUCTURAL (shape moved for at least one metric) or can ride the FILTER_ONLY fast path + * (all shapes identical, only expression bodies / filters / tags changed). + * + *

Algorithm mirrors what {@code MetricConvert} + {@code Analyzer.init} do at apply time: + *

    + *
  1. Apply {@code expPrefix} / {@code expSuffix} to the rule's {@code exp} field exactly + * as {@code MetricConvert.formatExp} does, producing the final expression string.
  2. + *
  3. Parse that string via {@link DSL#extractMetadata(String)} (AST walk only, no Javassist + * bytecode generation — cheap enough to run on every classify call).
  4. + *
  5. Derive the storage function name by the same formula {@code Analyzer.init} uses: + * {@code CaseUtils.toCamelCase(downsampling.lowercase) + capitalize(dataType)}, where + * dataType is picked from {@code isHistogram} + {@code percentiles} + {@code labels} + * exactly as {@code Analyzer.MetricType} does.
  6. + *
  7. Pair with the scope type directly from the metadata.
  8. + *
+ * + *

Same-shape = same storage-side class = no {@code MeterSystem.removeMetric} + no BanyanDB + * {@code deleteMeasure}. Different shape = the design's "shape-break" case — every shipped + * backend treats the Metrics subclass identity as the measure/table identity, and swapping + * function or scope moves that identity. The runtime-rule {@code allowStorageChange} guardrail + * uses this set to flag when an operator is about to drop an existing measure's data. + */ +public final class MalShapeExtractor { + + private MalShapeExtractor() { + } + + /** + * Per-metric storage shape. Equality / hash code deliberately over both fields so the + * classifier can diff shape maps with a straight {@code Map.equals}-style comparison. + */ + public static final class MalShape { + private final String functionName; + private final ScopeType scopeType; + + public MalShape(final String functionName, final ScopeType scopeType) { + this.functionName = functionName; + this.scopeType = scopeType; + } + + public String getFunctionName() { + return functionName; + } + + public ScopeType getScopeType() { + return scopeType; + } + + @Override + public boolean equals(final Object o) { + if (this == o) { + return true; + } + if (!(o instanceof MalShape)) { + return false; + } + final MalShape other = (MalShape) o; + return Objects.equals(functionName, other.functionName) + && scopeType == other.scopeType; + } + + @Override + public int hashCode() { + return Objects.hash(functionName, scopeType); + } + + @Override + public String toString() { + return "(" + functionName + "," + scopeType + ")"; + } + } + + /** + * Parse a MAL YAML file and return a map {@code metricName → shape}, where metric names + * follow the same {@code metricPrefix + "_" + ruleName} formula {@code MetricConvert} uses. + * + *

Returns an empty map when the YAML is null/empty or has no {@code metricsRules}. Any + * rule whose expression fails to parse is dropped from the result — the classifier treats + * "missing shape" conservatively (falls back to the STRUCTURAL-with-over-approximation + * path it already has). + */ + public static Map extract(final String yamlContent) { + if (yamlContent == null || yamlContent.isEmpty()) { + return Collections.emptyMap(); + } + final Rule rule; + try (StringReader r = new StringReader(yamlContent)) { + rule = new Yaml().loadAs(r, Rule.class); + } catch (final Throwable t) { + throw new IllegalArgumentException("MAL YAML parse failure: " + t.getMessage(), t); + } + if (rule == null || rule.getMetricsRules() == null || rule.getMetricPrefix() == null) { + return Collections.emptyMap(); + } + final Map out = new LinkedHashMap<>(); + for (final MetricsRule mr : rule.getMetricsRules()) { + if (mr.getName() == null) { + continue; + } + final String metricName = rule.getMetricPrefix() + "_" + mr.getName(); + final String fullExpr = formatExp(rule.getExpPrefix(), rule.getExpSuffix(), mr.getExp()); + final MalShape shape = extractShape(fullExpr); + if (shape != null) { + out.put(metricName, shape); + } + } + return Collections.unmodifiableMap(out); + } + + /** + * Extract shape from a single pre-assembled MAL expression string. Returns {@code null} + * when the parser fails — caller treats that as "unknown shape" and falls back to the + * conservative classifier behaviour. Swallowing the parse error here is deliberate: + * classification is advisory metadata layered on top of the actual apply, which will + * raise its own compile-error if the YAML is broken. + */ + public static MalShape extractShape(final String fullExpression) { + if (Strings.isNullOrEmpty(fullExpression)) { + return null; + } + try { + final ExpressionMetadata md = DSL.extractMetadata(fullExpression); + final String dataType = chooseDataType(md); + final String downSamplingStr = + CaseUtils.toCamelCase(md.getDownsampling().toString().toLowerCase(), false, '_'); + final String functionName = String.format("%s%s", + downSamplingStr, StringUtils.capitalize(dataType)); + return new MalShape(functionName, md.getScopeType()); + } catch (final Throwable t) { + return null; + } + } + + /** + * Replicates {@code MetricConvert.formatExp(expPrefix, expSuffix, exp)} so classifier-time + * shape extraction sees the exact same expression string {@code Analyzer.build} would + * compile. Keeping the logic here — rather than exposing it from {@code MetricConvert} — + * avoids coupling the two modules' APIs around a single five-line string operation. + */ + private static String formatExp(final String expPrefix, final String expSuffix, final String exp) { + String ret = exp; + if (!Strings.isNullOrEmpty(expPrefix)) { + ret = String.format("(%s.%s)", StringUtils.substringBefore(exp, "."), expPrefix); + final String after = StringUtils.substringAfter(exp, "."); + if (!Strings.isNullOrEmpty(after)) { + ret = String.format("(%s.%s)", ret, after); + } + } + if (!Strings.isNullOrEmpty(expSuffix)) { + ret = String.format("(%s).%s", ret, expSuffix); + } + return ret; + } + + /** + * Mirror of {@code Analyzer.init}'s MetricType resolution: + * histogram → "histogram" (unless percentiles are specified, then "histogramPercentile"); + * labels present → "labeled"; otherwise → "" (single). + */ + private static String chooseDataType(final ExpressionMetadata md) { + if (md.isHistogram()) { + if (md.getPercentiles() != null && md.getPercentiles().length > 0) { + return "histogramPercentile"; + } + return "histogram"; + } + if (!md.getLabels().isEmpty()) { + return "labeled"; + } + return ""; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouter.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouter.java new file mode 100644 index 000000000000..1ebc3ab264ac --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouter.java @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import java.util.List; +import org.apache.skywalking.oap.server.core.remote.client.Address; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClient; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; + +/** + * Cluster-wide selector for the single "runtime-rule main" OAP. The main is the first entry + * in {@link RemoteClientManager#getRemoteClient()} — which {@code RemoteClientManager} keeps + * sorted by {@link Address} natural ordering (host:port). Every OAP sees the same sorted + * list → every OAP agrees on the same main. The main changes only when cluster topology + * changes (the lexicographically-first node joins or leaves). + * + *

This matches the {@link org.apache.skywalking.oap.server.core.remote.selector.ForeverFirstSelector} + * strategy used by {@code RemoteSenderService} for other always-first routing needs; we don't + * go through that service because we don't need the selector-cache plumbing, just the + * "who's first" answer. + * + *

Why single-main (not per-file hash): at runtime-rule scale — dozens of files, a handful + * of operator pushes per day — the simplicity of "one writer at a time" beats the throughput + * gain of distributing writes across nodes. Single-main also makes the + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState.SuspendOrigin BOTH} + * origin a hard impossibility under correct routing, so its appearance immediately signals + * split-brain without needing per-file analysis. + * + *

Routing is advisory on the REST side: a non-main OAP that receives a write forwards the + * request to the main via the cluster bus (see {@code Forward} RPC). The fail-safe path — + * non-main receives a forwarded request from a node that also thought IT wasn't main — + * short-circuits with HTTP 421 to bound cluster ping-pong at one hop. + */ +public final class MainRouter { + + private MainRouter() { + } + + /** + * First client in the sorted peer list — that's the main. Null when the cluster is empty + * (single-node embedded topology / early boot). Callers treat null as "self is the only + * node, so self is main". + */ + public static RemoteClient mainClient(final RemoteClientManager rcm) { + if (rcm == null) { + return null; + } + final List peers = rcm.getRemoteClient(); + if (peers == null || peers.isEmpty()) { + return null; + } + return peers.get(0); + } + + /** Main's address. Null when the cluster is empty. */ + public static Address mainAddress(final RemoteClientManager rcm) { + final RemoteClient main = mainClient(rcm); + return main == null ? null : main.getAddress(); + } + + /** + * True if this node is the main, i.e. the first-sorted peer is self (or the cluster is + * empty, in which case self is trivially the only valid main). Callers use this as the + * gate before acquiring per-file locks + running the write workflow; non-main requests + * get forwarded via the cluster bus to the main. + */ + public static boolean isSelfMain(final RemoteClientManager rcm) { + final Address main = mainAddress(rcm); + return main == null || main.isSelf(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java new file mode 100644 index 000000000000..985f6149c718 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java @@ -0,0 +1,216 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import com.google.protobuf.ByteString; +import io.grpc.ManagedChannel; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.TimeUnit; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.remote.client.Address; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClient; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.RuntimeRuleClusterServiceGrpc; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendRequest; + +/** + * Client-side broadcast of the Suspend and Resume RPCs during a STRUCTURAL apply. Reuses the + * established inter-node {@link ManagedChannel}s owned by {@link RemoteClientManager} — no + * duplicate channel caching, no duplicate TLS config. Peer discovery is delegated to the + * cluster module via {@code RemoteClientManager#getRemoteClient()}; self is filtered out by + * address. + * + *

Sequential fan-out with a per-call deadline. Unreachable peers are logged and skipped — + * the main node does not abort on a single peer failure. For Suspend, unreachable peers + * recover via the dslManager's self-heal sweep when the DB content eventually changes or the + * self-heal threshold elapses. For Resume, unreachable peers remain SUSPENDED until the + * self-heal threshold elapses or the DB changes on the next main-node retry. + */ +@Slf4j +public final class RuntimeRuleClusterClient { + + private final RemoteClientManager remoteClientManager; + private final String selfNodeId; + private final long perCallDeadlineMs; + + public RuntimeRuleClusterClient(final RemoteClientManager remoteClientManager, + final String selfNodeId, + final long perCallDeadlineMs) { + this.remoteClientManager = remoteClientManager; + this.selfNodeId = selfNodeId; + this.perCallDeadlineMs = perCallDeadlineMs; + } + + /** + * Fan out Suspend to every non-self peer sequentially. Sequential rather than parallel + * because (a) peer count is typically small (2–10 in practice), (b) the blocking stubs + * already carry their own deadlines so worst-case fan-out time is bounded to + * {@code peers * perCallDeadlineMs}, (c) sequential matches the existing cluster-bus code + * style and avoids introducing yet another executor for a short-lived operation. + * + * @return aggregated ack list in iteration order. Entries for unreachable peers are null. + * Main node workflow proceeds regardless. + */ + public List broadcastSuspend(final String catalog, final String name, final String reason) { + final List peers = remoteClientManager.getRemoteClient(); + final List acks = new ArrayList<>(peers.size()); + for (final RemoteClient peer : peers) { + if (peer.getAddress() != null && peer.getAddress().isSelf()) { + continue; + } + acks.add(suspendOne(peer, catalog, name, reason)); + } + return acks; + } + + private SuspendAck suspendOne(final RemoteClient peer, final String catalog, + final String name, final String reason) { + final ManagedChannel channel = peer.getChannel(); + if (channel == null) { + log.warn("runtime-rule Suspend skipped for peer {}: channel not yet established", + peer.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.suspend(SuspendRequest.newBuilder() + .setCatalog(catalog) + .setName(name) + .setReason(reason == null ? "" : reason) + .setSenderNodeId(selfNodeId) + .setIssuedAtMs(System.currentTimeMillis()) + .build()); + } catch (final Throwable t) { + log.warn("runtime-rule Suspend to peer {} failed for {}/{}: {}", + peer.getAddress(), catalog, name, t.getMessage()); + return null; + } + } + + /** + * Fan out Resume to every non-self peer. Same transport, same sequential-with-deadline + * policy as {@link #broadcastSuspend}. Called by the REST handler's failure branches so + * peers flip back to RUNNING within an RPC round-trip instead of waiting for the 60 s + * self-heal threshold in the 99% case. Unreachable peers fall through to self-heal. + */ + public List broadcastResume(final String catalog, final String name, + final String reason) { + final List peers = remoteClientManager.getRemoteClient(); + final List acks = new ArrayList<>(peers.size()); + for (final RemoteClient peer : peers) { + if (peer.getAddress() != null && peer.getAddress().isSelf()) { + continue; + } + acks.add(resumeOne(peer, catalog, name, reason)); + } + return acks; + } + + private ResumeAck resumeOne(final RemoteClient peer, final String catalog, + final String name, final String reason) { + final ManagedChannel channel = peer.getChannel(); + if (channel == null) { + log.warn("runtime-rule Resume skipped for peer {}: channel not yet established", + peer.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.resume(ResumeRequest.newBuilder() + .setCatalog(catalog) + .setName(name) + .setReason(reason == null ? "" : reason) + .setSenderNodeId(selfNodeId) + .setIssuedAtMs(System.currentTimeMillis()) + .build()); + } catch (final Throwable t) { + log.warn("runtime-rule Resume to peer {} failed for {}/{}: {}", + peer.getAddress(), catalog, name, t.getMessage()); + return null; + } + } + + /** + * Forward a write request to the main node for {@code (catalog, name)}. The caller has + * already computed the main via {@link MainRouter#mainClient}. Locates the + * {@link RemoteClient} whose address matches {@code mainAddr} and issues the RPC. + * + *

Uses a longer deadline than Suspend / Resume because the forwarded workflow on the + * main can include compile + DDL + persist which is orders of magnitude slower than a + * bookkeeping broadcast. Caller supplies the deadline in ms so admin operations can + * tune it independently of cluster-control fan-outs. + * + * @return the main's response. Never null on success; throws on transport failure so + * the caller can surface a clear diagnostic to the operator. + */ + public ForwardResponse forwardToMain(final Address mainAddr, + final String operation, + final String catalog, final String name, + final byte[] body, + final boolean allowStorageChange, + final boolean forceReapply, + final long deadlineMs) { + final ManagedChannel channel = findChannelForAddress(mainAddr); + if (channel == null) { + throw new IllegalStateException( + "no cluster channel to forward-target " + mainAddr + " (peer list out of sync?)"); + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(deadlineMs, TimeUnit.MILLISECONDS); + return stub.forward(ForwardRequest.newBuilder() + .setOperation(operation == null ? "" : operation) + .setCatalog(catalog == null ? "" : catalog) + .setName(name == null ? "" : name) + .setBody(body == null ? ByteString.EMPTY : ByteString.copyFrom(body)) + .setAllowStorageChange(allowStorageChange) + .setForceReapply(forceReapply) + .setSenderNodeId(selfNodeId) + .setIssuedAtMs(System.currentTimeMillis()) + .build()); + } + + /** + * Walk the active peer list and return the channel whose address equals {@code target}. + * Null when no match (peer list was refreshed mid-request, or the target left the + * cluster between hash-selection and forward). Caller treats null as a transport error. + */ + private ManagedChannel findChannelForAddress(final Address target) { + if (target == null) { + return null; + } + for (final RemoteClient peer : remoteClientManager.getRemoteClient()) { + final Address peerAddr = peer.getAddress(); + if (peerAddr != null && peerAddr.equals(target)) { + return peer.getChannel(); + } + } + return null; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java new file mode 100644 index 000000000000..7b01b6f5d8ea --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -0,0 +1,369 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import com.google.gson.Gson; +import com.google.gson.JsonObject; +import io.grpc.stub.StreamObserver; +import java.nio.charset.StandardCharsets; +import java.util.Objects; +import lombok.Setter; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.RuntimeRuleClusterServiceGrpc; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendState; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResumeCoordinator; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLScriptKey; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; +import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleService; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; + +/** + * Server-side handler for the three cluster-internal runtime-rule RPCs — see + * {@link RuntimeRuleClusterClient} for the client side: + *

    + *
  • Suspend / Resume — STRUCTURAL-apply pause-and-resume bracket. The selected + * main broadcasts Suspend at the start of a structural cutover, then Resume on + * success or rollback. Peers flip {@code DSLRuntimeState.suspended} via + * {@link SuspendResumeCoordinator#peerSuspend} / {@link SuspendResumeCoordinator#peerResume} + * so dispatch is paused on every node while the schema is moving.
  • + *
  • Forward — single-main routing for {@code addOrUpdate}, {@code inactivate}, + * {@code delete}. Non-main OAPs receive an operator's REST call, forward it via + * this RPC to the cluster's main, and relay the response back to the operator. + * The handler dispatches by operation string into {@link RuntimeRuleService}'s + * {@code execute*} entry points, which run the same workflow direct HTTP callers + * run. Unknown operations return {@code 400 forward_unknown_operation}.
  • + *
+ * + *

Suspend records {@link DSLRuntimeState.SuspendOrigin#PEER} so the state flip is atomic + * w.r.t. concurrent local work and distinct from a SELF-origin suspend that would be set if + * this node were itself the main. Resume clears only the PEER origin — SELF-origin suspends + * (an in-flight local apply on this node) are never cleared by peer Resume. + * + *

Both RPCs are idempotent: repeated Suspend with the same origin returns + * {@code ALREADY_SUSPENDED}; Resume with PEER already cleared returns + * {@code NOT_SUSPENDED_BY_SENDER}. Self-broadcast is suppressed by comparing + * {@code sender_node_id} against this node's own instance id. + * + *

Origin-conflict rejection: if Suspend arrives while SELF origin is already set on this + * node (routing misfire — two OAPs think they're main), the handler returns + * {@code REJECTED}. Main-side caller logs and drops the conflicting apply. + * + *

The receiver does NOT wait for the main's DDL to complete; peers pick up new content on + * their next dslManager tick via {@link + * org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO}. The + * 60 s self-heal in the dslManager is the backstop for the narrow case where the main + * crashes after Suspend but before Resume. + */ +@Slf4j +public class RuntimeRuleClusterServiceImpl + extends RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceImplBase { + + private final DSLManager dslManager; + /** This OAP instance's cluster identifier. Used to suppress self-broadcast loops. */ + private final String selfNodeId; + /** + * Bridge to the REST handler's workflow. Late-bound via {@code @Setter} because the + * cluster service registers with the gRPC server during module {@code start()} before + * the REST handler / service is constructed (the handler transitively references this + * cluster service via {@code RuntimeRuleClusterClient}). Null-guarded in {@link #forward} + * for boot-time safety; a forward that arrives before the service is wired returns 503 + * so operators see a clear "not ready" signal instead of an NPE. + */ + @Setter + private volatile RuntimeRuleService runtimeRuleService; + + public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String selfNodeId) { + this.dslManager = dslManager; + this.selfNodeId = selfNodeId; + } + + @Override + public void suspend(final SuspendRequest request, + final StreamObserver responseObserver) { + final String catalog = request.getCatalog(); + final String name = request.getName(); + + // Suppress accidental self-loop: a broadcast that comes back to the sender must not + // drain the sender's bundle twice. The fan-out side filters self out, but belt-and- + // suspenders here because cluster peer lists can include self under some provider impls. + if (Objects.equals(selfNodeId, request.getSenderNodeId())) { + responseObserver.onNext(SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.ALREADY_SUSPENDED) + .setDetail("self-broadcast suppressed") + .build()); + responseObserver.onCompleted(); + return; + } + + final SuspendResult result; + try { + result = dslManager.getSuspendCoord().peerSuspend(catalog, name); + } catch (final Throwable t) { + log.error("runtime-rule Suspend handler failed for {}/{}: {}", + catalog, name, t.getMessage(), t); + responseObserver.onNext(SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.SUSPEND_STATE_UNSPECIFIED) + .setDetail("peer suspend failed: " + t.getMessage()) + .build()); + responseObserver.onCompleted(); + return; + } + + final SuspendAck ack; + switch (result) { + case SUSPENDED: + log.info("runtime-rule Suspend accepted for {}/{} (sender={}, reason={})", + catalog, name, request.getSenderNodeId(), request.getReason()); + ack = SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.SUSPENDED) + .setDetail("entry dispatch parked (PEER origin); measure and L2 handlers remain live") + .build(); + break; + case ALREADY_SUSPENDED: + ack = SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.ALREADY_SUSPENDED) + .setDetail("idempotent replay; PEER origin already held") + .build(); + break; + case NOT_PRESENT: + log.debug("runtime-rule Suspend received for {}/{} but bundle is NOT_PRESENT", catalog, name); + ack = SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.NOT_PRESENT) + .setDetail("no local bundle for this (catalog, name)") + .build(); + break; + case REJECTED_ORIGIN_CONFLICT: + default: + // This node is itself mid-apply (SELF origin held). Refusing avoids the BOTH + // state that correct single-main routing never produces. Main-side caller + // inspects the REJECTED ack and surfaces it to the operator. + ack = SuspendAck.newBuilder() + .setNodeId(selfNodeId) + .setState(SuspendState.REJECTED) + .setDetail("origin conflict: local apply in flight (SELF origin held); " + + "routing misfire — only one main per (catalog, name) is permitted") + .build(); + break; + } + responseObserver.onNext(ack); + responseObserver.onCompleted(); + } + + @Override + public void resume(final ResumeRequest request, + final StreamObserver responseObserver) { + final String catalog = request.getCatalog(); + final String name = request.getName(); + + if (Objects.equals(selfNodeId, request.getSenderNodeId())) { + responseObserver.onNext(ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.NOT_SUSPENDED_BY_SENDER) + .setDetail("self-broadcast suppressed") + .build()); + responseObserver.onCompleted(); + return; + } + + // Snapshot the pre-resume state so we can distinguish BOTH → SELF (PARTIALLY_RESUMED) + // from PEER → NONE (RESUMED) after the origin mutation. + final String key = DSLScriptKey.key(catalog, name); + final AppliedRuleScript beforeScript = dslManager.getRules().get(key); + final DSLRuntimeState before = beforeScript == null ? null : beforeScript.getState(); + if (before == null) { + responseObserver.onNext(ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.RESUME_NOT_PRESENT) + .setDetail("no local bundle for this (catalog, name)") + .build()); + responseObserver.onCompleted(); + return; + } + final DSLRuntimeState.SuspendOrigin originBefore = before.getSuspendOrigin(); + + try { + dslManager.getSuspendCoord().peerResume(catalog, name); + } catch (final Throwable t) { + log.error("runtime-rule Resume handler failed for {}/{}: {}", + catalog, name, t.getMessage(), t); + responseObserver.onNext(ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.RESUME_STATE_UNSPECIFIED) + .setDetail("peer resume failed: " + t.getMessage()) + .build()); + responseObserver.onCompleted(); + return; + } + + final ResumeAck ack; + if (originBefore == DSLRuntimeState.SuspendOrigin.NONE + || originBefore == DSLRuntimeState.SuspendOrigin.SELF) { + // PEER was never set, or Resume already replayed. Idempotent no-op. + ack = ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.NOT_SUSPENDED_BY_SENDER) + .setDetail("PEER origin was not set; idempotent no-op") + .build(); + } else if (originBefore == DSLRuntimeState.SuspendOrigin.BOTH) { + log.info("runtime-rule Resume for {}/{} cleared PEER; SELF still held — " + + "bundle remains SUSPENDED until local apply completes", catalog, name); + ack = ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.PARTIALLY_RESUMED) + .setDetail("PEER origin cleared; SELF origin still held (local apply in flight)") + .build(); + } else { + // originBefore == PEER → cleared to NONE → RUNNING. + log.info("runtime-rule Resume accepted for {}/{} (sender={}, reason={})", + catalog, name, request.getSenderNodeId(), request.getReason()); + ack = ResumeAck.newBuilder() + .setNodeId(selfNodeId) + .setState(ResumeState.RESUMED) + .setDetail("entry dispatch resumed; bundle back to RUNNING") + .build(); + } + responseObserver.onNext(ack); + responseObserver.onCompleted(); + } + + /** + * Run a forwarded HTTP write on this node. Sender was told by its local {@code MainRouter} + * that this OAP is the hash-selected main for {@code (catalog, name)}. The handler + * dispatches to {@link RuntimeRuleService}'s {@code execute*} entry points, which run the + * same workflow (suspend / apply / persist / resume-on-failure) that a direct HTTP + * caller would hit, with the internal {@code forwarded=true} flag so the REST handler + * skips its own MainRouter check (otherwise double-checking could infinite-loop on + * cluster-view divergence) and instead uses the plain {@link DSLManager} lock + apply + * path — or, if this node also doesn't consider itself main (cluster views disagree), + * returns HTTP 421 to the sender so it can surface a clear "cluster routing misfire" + * signal to the operator. + */ + @Override + public void forward(final ForwardRequest request, + final StreamObserver responseObserver) { + final RuntimeRuleService service = runtimeRuleService; + if (service == null) { + // Module still booting, or REST handler was never wired. 503 tells sender to + // retry; self-heal isn't applicable for a Forward request. + responseObserver.onNext(ForwardResponse.newBuilder() + .setNodeId(selfNodeId) + .setHttpStatus(503) + .setBody(forwardErrorBody("forward_target_unavailable", + "forward target not yet wired on this OAP")) + .build()); + responseObserver.onCompleted(); + return; + } + + if (Objects.equals(selfNodeId, request.getSenderNodeId())) { + // A forward that loops back to the sender is always a bug (either the sender's + // mainFor mapped to itself, or cluster membership flapped). Refuse to execute + // so the operator sees the anomaly instead of the loop silently completing. + log.warn("runtime-rule Forward received from self for {}/{} — refusing to execute", + request.getCatalog(), request.getName()); + responseObserver.onNext(ForwardResponse.newBuilder() + .setNodeId(selfNodeId) + .setHttpStatus(400) + .setBody(forwardErrorBody("forward_self_loop", + "forward arrived from self; check cluster peer list")) + .build()); + responseObserver.onCompleted(); + return; + } + + final String catalog = request.getCatalog(); + final String name = request.getName(); + final String operation = request.getOperation(); + log.info("runtime-rule Forward received: op={} {}/{} (sender={})", + operation, catalog, name, request.getSenderNodeId()); + + final RuntimeRuleService.ForwardResult result; + try { + switch (operation == null ? "" : operation) { + case "addOrUpdate": + result = service.executeAddOrUpdate(catalog, name, + request.getBody().toByteArray(), + request.getAllowStorageChange(), + request.getForceReapply()); + break; + case "inactivate": + result = service.executeInactivate(catalog, name); + break; + case "delete": + // /delete carries an optional mode (e.g. "revertToBundled") in the + // request body so the main can mirror the originator's intent. + final byte[] deleteBody = request.getBody().toByteArray(); + final String deleteMode = deleteBody.length == 0 + ? "" + : new String(deleteBody, StandardCharsets.UTF_8); + result = service.executeDelete(catalog, name, deleteMode); + break; + default: + responseObserver.onNext(ForwardResponse.newBuilder() + .setNodeId(selfNodeId) + .setHttpStatus(400) + .setBody(forwardErrorBody("forward_unknown_operation", + "unknown operation: " + operation)) + .build()); + responseObserver.onCompleted(); + return; + } + } catch (final Throwable t) { + log.error("runtime-rule Forward execution failed for {}/{}: {}", + catalog, name, t.getMessage(), t); + responseObserver.onNext(ForwardResponse.newBuilder() + .setNodeId(selfNodeId) + .setHttpStatus(500) + .setBody(forwardErrorBody("forward_execution_failed", t.getMessage())) + .build()); + responseObserver.onCompleted(); + return; + } + + responseObserver.onNext(ForwardResponse.newBuilder() + .setNodeId(selfNodeId) + .setHttpStatus(result.getHttpStatus()) + .setBody(result.getJsonBody()) + .build()); + responseObserver.onCompleted(); + } + + private static String forwardErrorBody(final String applyStatus, final String message) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", applyStatus); + body.addProperty("message", message == null ? "" : message); + return GSON.toJson(body); + } + + private static final Gson GSON = new Gson(); +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyContext.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyContext.java new file mode 100644 index 000000000000..2f9b49575008 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyContext.java @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * Shared base contract for the per-call state object every {@link RuleEngine} phase receives. + * Holds only DSL-agnostic services + the unified per-rule map; DSL-specific state + * (per-key MAL applied artifacts, per-key LAL applied artifacts, etc.) lives on engine-specific + * subtypes — {@code engine.mal.MalApplyContext}, {@code engine.lal.LalApplyContext}, future + * {@code engine.oal.OalApplyContext} — that each engine constructs via its own + * {@link RuleEngine#newApplyContext} factory. + * + *

This is a context object, not a service. The scheduler builds an + * {@link ApplyInputs} record once per apply / unregister call and passes it to the engine; the + * engine narrows it into its own context subtype, plugging in any DSL-specific state map + * references it holds. Engines never hold long-lived references to a context — every phase + * method takes one as a parameter and uses it transactionally. + * + *

Classloader retire / install is NOT exposed on the context. Engines reach the + * {@link org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager#INSTANCE} + * singleton directly when they need to mint or drop a per-file loader; threading the manager + * through every context would add coupling without value (lifetime is process-wide, not + * per-call). + */ +public interface ApplyContext { + /** For looking up ModelInstaller / MeterSystem during verify + cross-file ownership reads. */ + ModuleManager getModuleManager(); + + /** Install policy for THIS apply / unregister call. */ + StorageManipulationOpt getStorageOpt(); + + /** Best-effort alarm-window reset. Scheduler-owned; engines invoke at commit / unregister + * with the affected metric name set. The orchestrator may swap in a no-op resetter for + * update-path teardowns where the caller drives the alarm reset itself. */ + Consumer> getAlarmResetter(); + + /** Unified per-key rule script map: content + {@link + * org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState} bundled into + * one {@link AppliedRuleScript}. Engines read prior content via {@code rules.get(key) + * != null ? rules.get(key).getContent() : null} and write on commit via + * {@code rules.compute(key, ...)}. The orchestrator owns state transitions; engines own + * content writes on commit. Both go through the same map under the per-file lock. */ + Map getRules(); +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyInputs.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyInputs.java new file mode 100644 index 000000000000..b77de03a2ed0 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/ApplyInputs.java @@ -0,0 +1,47 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import lombok.Getter; +import lombok.RequiredArgsConstructor; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * Shared scheduler inputs the orchestrator hands to {@link RuleEngine#newApplyContext} on + * every apply / unregister call. The engine reads what it needs and folds it into its own + * {@link ApplyContext} subtype together with whatever DSL-specific state the engine holds + * internally (e.g. the MAL engine's {@code appliedMal} map). + * + *

Why a separate POJO instead of letting {@code RuleEngine.newApplyContext} take loose + * parameters: signature stability. Adding a future shared service (a tracing context, a + * feature-flag bag, etc.) is one field on this record without touching every engine. + */ +@Getter +@RequiredArgsConstructor +public final class ApplyInputs { + private final ModuleManager moduleManager; + private final StorageManipulationOpt storageOpt; + private final Consumer> alarmResetter; + private final Map rules; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/Classification.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/Classification.java new file mode 100644 index 000000000000..b0e6684cf3e8 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/Classification.java @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +/** + * Outcome of {@link RuleEngine#classify}. The scheduler reads this to drive the phase + * pipeline: + *

    + *
  • {@link #NO_CHANGE} — content byte-identical to the prior bundle. Scheduler + * short-circuits unless the caller forces a re-apply (e.g. + * {@code /addOrUpdate?force=true}, or the cold-boot ddl-debt promotion path).
  • + *
  • {@link #NEW} — no prior bundle for this key (or the prior bundle was an INACTIVE + * tombstone that's now being reactivated). Scheduler runs the full pipeline: + * compile → fireSchemaChanges (create) → verify → commit.
  • + *
  • {@link #FILTER_ONLY} — DSL body / filter / tag-assignments changed but every + * metric / rule key kept the same shape. Scheduler skips fireSchemaChanges + verify + * (no DDL needed) and goes straight to commit, swapping the in-memory bundle so the + * new body takes effect.
  • + *
  • {@link #STRUCTURAL} — at least one metric / rule key changed shape, or metrics / + * keys were added or removed. Scheduler runs the full pipeline.
  • + *
  • {@link #INACTIVE} — DB row status flipped to INACTIVE. Scheduler routes to + * {@link RuleEngine#unregister}; no compile or DDL fire needed for an apply.
  • + *
+ * + *

Engines compute richer delta info (added / removed / shape-break sets for MAL; + * planned rule keys for LAL) and carry it on their own {@link CompiledDSL} subclass — + * the scheduler doesn't need to see it; only the producing engine consumes it on later + * phases. + */ +public enum Classification { + NO_CHANGE, + NEW, + FILTER_ONLY, + STRUCTURAL, + INACTIVE +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/CompiledDSL.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/CompiledDSL.java new file mode 100644 index 000000000000..7f98dd947b8e --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/CompiledDSL.java @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +/** + * Marker for the artifact that flows phase-to-phase through a {@link RuleEngine}'s + * lifecycle. {@link RuleEngine#compile} produces a CompiledDSL; the scheduler holds + * the reference and passes it to {@code fireSchemaChanges} → {@code verify} → + * {@code commit} (or {@code rollback}). The scheduler treats it as opaque — only the + * producing engine reads its DSL-specific contents (added / removed / shape-break sets, + * Applied artifact, per-file classloader, etc.). + * + *

The contract on this interface is just enough metadata for the scheduler to make + * routing + bookkeeping decisions; everything the engine itself needs lives on its own + * subclass. + */ +public interface CompiledDSL { + String getCatalog(); + + String getName(); + + /** SHA-256 hex of the new content this bundle was compiled from. */ + String getContentHash(); + + /** Outcome of {@link RuleEngine#classify} that produced this bundle. */ + Classification getClassification(); +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/EngineCompileException.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/EngineCompileException.java new file mode 100644 index 000000000000..45331512687d --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/EngineCompileException.java @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +/** + * Thrown by {@link RuleEngine#compile} when the engine could not produce a valid + * {@link CompiledDSL}. The engine is responsible for cleaning its own partial state before + * throwing — by the time this exception reaches the orchestrator, no engine-side bookkeeping + * remains from the failed attempt and the prior bundle is still serving. + * + *

The orchestrator catches this, stamps the snapshot's {@code applyError} with + * {@link Throwable#getMessage()} (which carries the underlying applier's diagnostics), and + * surfaces the failure to the REST caller / lets the next dslManager tick retry. + */ +public final class EngineCompileException extends RuntimeException { + private static final long serialVersionUID = 1L; + + public EngineCompileException(final Throwable cause) { + super(cause.getMessage() + + (cause.getCause() == null ? "" : " — " + cause.getCause().getMessage()), cause); + } + + public EngineCompileException(final String message, final Throwable cause) { + super(message, cause); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java new file mode 100644 index 000000000000..f2f5bc3ef76d --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java @@ -0,0 +1,362 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * Per-DSL phase pipeline. The scheduler (DSLManager) holds one engine per DSL via + * {@link RuleEngineRegistry} and drives every engine through three orchestrators: + * {@code DSLRuntimeApply} (apply pipeline), {@code DSLRuntimeUnregister} (tear-down), + * {@code DSLRuntimeDelete} (destructive). The scheduler's apply driver is unified + * ({@code DSLManager.handleApply}) — there is no per-DSL branching. Engines own all + * DSL-specific work: Javassist generation, applier registration, backend listener chain, + * classloader retire, alarm-reset target sets, backend service lookup. + * + *

How each method is scheduled

+ * + *

1. Read-only inputs the scheduler queries before/around the pipeline + *

+ *   supportedCatalogs() — read once at module start by {@link RuleEngineRegistry#register}.
+ *                         Lets the registry build catalog → engine in O(1).
+ *
+ *   classify(old, new, inactive) — called by {@code DSLManager.handleApply} on every apply
+ *                         attempt. Drives routing:
+ *                           INACTIVE → {@code DSLRuntimeUnregister} + tombstone snapshot
+ *                           NO_CHANGE → snapshot hash refresh, no engine work
+ *                           NEW / FILTER_ONLY / STRUCTURAL → continue to ownership guard
+ *                                                            then compile
+ *
+ *   claimedKeys(content, source) — called by the cross-file ownership guard. Engine
+ *                         returns its file's claim set; scheduler intersects against
+ *                         {@link #activeClaimsExcluding} of every other live bundle plus
+ *                         INACTIVE-row claims read from the DAO.
+ *
+ *   activeClaimsExcluding(selfKey) — called by the cross-file ownership guard. Engine
+ *                         walks its internal applied-state map and returns owner-keyed
+ *                         claim sets so the guard reports which file holds a colliding
+ *                         claim in its error message.
+ *
+ *   storageImpactKeys(prior, new) — called by the REST {@code allowStorageChange}
+ *                         guardrail. Engine returns the per-DSL set of changes that
+ *                         mutate cluster-shared backend schema (MAL: shape-break metrics
+ *                         plus added / removed names; LAL: outputType renames + rule key
+ *                         add/remove). Empty for body-only edits, which the guardrail
+ *                         allows unconditionally.
+ * 
+ * + *

2. Apply pipeline — every classification flows through the same path + *

+ *   The scheduler always invokes {@code DSLRuntimeApply#compileAndVerify} and only then
+ *   decides to stash (deferCommit, REST 2-PC path) vs commitInline (tick / sync paths).
+ *   FILTER_ONLY and STRUCTURAL share this flow — engine.commit dispatches on classification
+ *   internally for the bits that differ (FILTER_ONLY skips classloader retire + alarm
+ *   reset + removedMetrics drop because shapes are identical).
+ *
+ *   newApplyContext(inputs)
+ *     ↓                        engine narrows the shared {@link ApplyInputs} into its own
+ *                              context subtype, folding in DSL-specific state.
+ *
+ *   compile(file, classification, ctx)
+ *     ↓                        Generate classes + register handlers + (for MAL) fire the
+ *                              backend listener chain. The engine internally rolls back
+ *                              its partial state on failure before throwing
+ *                              {@link EngineCompileException} — by the time the throw
+ *                              reaches the orchestrator, no engine-side leftovers remain.
+ *
+ *   fireSchemaChanges(compiled, ctx)
+ *     ↓                        SPI hook for engines whose listener chain isn't fused with
+ *                              compile. MAL: no-op (fired inside compile). LAL: no-op (no
+ *                              backend schema). Future engines may use this.
+ *
+ *   verify(compiled, ctx)
+ *     ↓                        Post-DDL probe. MAL: isExists round-trip per Model. LAL:
+ *                              no-op (returns null). Returns null on success or an error
+ *                              string the orchestrator stamps on the snapshot.
+ *
+ *     ┌─ verify-failed ──→ rollback(compiled, ctx) — engine drops just-registered metrics;
+ *     │                                              old applied state still serves.
+ *     └─ verify-OK    ──→ outcome.status = READY_TO_COMMIT, returned to scheduler:
+ *                          ├─ deferCommit → commitCoord.stash (REST 2-PC; drained on
+ *                          │                                    persist outcome)
+ *                          └─ inline      → commitCoord.commitInline → engine.commit
+ *
+ *   commit(compiled, ctx)      Drop removedMetrics from the dispatcher, swap the
+ *                              engine-applied artefacts + appliedContent, push the freshly-
+ *                              compiled converter to the owning receiver, retire the
+ *                              displaced classloader (non-FILTER_ONLY only), fire alarm
+ *                              reset for affected metric names. Idempotent at the in-memory
+ *                              level.
+ *
+ *   rollback(compiled, ctx)    Drop registrations from THIS attempt only — the just-
+ *                              registered added + shape-break metrics. Old applied state
+ *                              is intact (commit hasn't run), so unchanged metrics keep
+ *                              serving.
+ * 
+ * + *

3. Tear-down (driven by {@code DSLRuntimeUnregister}) + *

+ *   unregister(catalog, name, ctx) — called for INACTIVE classification, the tick's gone-
+ *                         keys cleanup, and any other path that needs to drop a bundle.
+ *                         Engine clears its applied-state entry, drops registered
+ *                         dispatcher handlers, retires the classloader, fires alarm reset
+ *                         for the prior metric set. Storage opt determines whether
+ *                         backend schema is dropped (fullInstall) or preserved
+ *                         (localCacheOnly — the {@code /inactivate} contract).
+ * 
+ * + *

4. Destructive {@code /delete} (driven by {@code DSLRuntimeDelete}) + *

+ *   dropBackend(catalog, name, content, ctx) — called by REST {@code /delete}
+ *                         after {@code /inactivate} has already cleared the engine's
+ *                         applied state. Engines with backend schema (MAL) re-register
+ *                         prototypes locally then tear down under fullInstall so the
+ *                         listener chain runs the destructive cascade. Engines without
+ *                         backend (LAL) implement as no-op — the DAO row deletion alone
+ *                         discharges the rule.
+ * 
+ * + *

5. Boot / recovery (driven by {@code StaticRuleLoader}) + *

+ *   loadStaticRuleFile(catalog, name, content) — called once at boot for every static rule
+ *                         the catalog loaders compiled at module start, and again on each
+ *                         tick for any static rule whose DB row got {@code /delete}d while
+ *                         the disk content remained. Engine seeds a synthetic applied
+ *                         entry with its per-DSL claim set so the next {@code /inactivate}
+ *                         / {@code /addOrUpdate} / Suspend lookup finds the bundle.
+ * 
+ * + *

Boundary contract

+ * + *

Engines own everything DSL-specific: delta classifier, compiler, dispatcher + * (MeterSystem / LogFilterListener / ...) registration, backend service lookup, applied- + * state map, classloader handling, alarm-reset target derivation, the {@link CompiledDSL} + * subclass that carries per-call state, and the {@link ApplyContext} subtype that carries + * the scheduler-provided + engine-internal state per phase. + * + *

The scheduler owns everything DSL-agnostic: lock acquisition, cluster Suspend/Resume + * RPCs, persistence (DAO upsert), ddl-debt marker bookkeeping, cross-file ownership + * enforcement (parameterised by {@link #claimedKeys} + {@link #activeClaimsExcluding}), + * self-heal, tick scheduling, classloader graveyard, alarm-reset dispatch, snapshot + * transitions, and the 2-PC stash for deferred commits. It interacts with engines only + * through this SPI + the three orchestrators above + {@code StaticRuleLoader}. + * + *

Adding a new DSL

+ * + *

Implement {@code RuleEngine} and the SPI methods, declare your + * catalogs in {@link #supportedCatalogs}, build a concrete {@code MyApplyContext} subtype + * carrying any extra DSL state, register the engine with {@link RuleEngineRegistry} at + * module start. No scheduler edit is required — the unified {@code handleApply} routes + * via the registry. The boundary holds for telegraf-rules (already MAL syntax, so just an + * additional entry in {@code MalRuleEngine.supportedCatalogs}) and OAL (would be its own + * engine + context). + * + * @param the concrete {@link ApplyContext} subtype this engine consumes; bound at the + * class level so the orchestrators' dispatch helpers are type-safe end-to-end. + */ +public interface RuleEngine { + /** + * Catalogs this engine handles, e.g. {@code {"otel-rules", "log-mal-rules", + * "telegraf-rules"}} for the MAL engine, {@code {"lal"}} for the LAL engine. + * {@link RuleEngineRegistry} reads this once at registration time. + */ + Set supportedCatalogs(); + + /** + * Pure function. Compares {@code newContent} against the previous successfully-applied + * content for the same key (or {@code null} if no prior bundle) plus the row status, + * and returns the {@link Classification} the scheduler uses to drive the rest of the + * pipeline. The {@code isInactive} flag short-circuits to {@link Classification#INACTIVE}. + */ + Classification classify(String oldContent, String newContent, boolean isInactive); + + /** + * Pure function. Returns the names this content claims for cross-file ownership + * comparison: metric names for MAL, {@code ":"} encoded keys for LAL. + * The scheduler runs the comparison itself (active appliedX entries plus INACTIVE rows + * from the DAO); the engine just produces its file's claim set. + */ + Set claimedKeys(String content, String sourceName); + + /** + * Storage-affecting subset of a content change. The REST {@code allowStorageChange} + * guardrail uses this to refuse edits that would mutate cluster-shared backend schema + * (BanyanDB measure shape, ES index mapping, JDBC table) unless the operator explicitly + * opted in. + * + *

MAL: shape-break metric names (function or scope changed) plus added / removed + * names — those reach the listener chain on the next apply. + * + *

LAL: rule-name additions / removals plus {@code outputType} renames — those reroute + * log records to a different storage-backed subclass. + * + *

Empty result when the change is body-only (filter / tag / output-field tweaks) — + * those don't touch storage and the guardrail allows them through unconditionally. + * + *

Throws {@link IllegalArgumentException} if either content is unparseable; the REST + * handler turns the throw into a 400 {@code compile_failed} response. + */ + Set storageImpactKeys(String priorContent, String newContent); + + /** + * Active claims by every other live bundle this engine knows about, excluding + * {@code selfKey}. Returned as a map of {@code "catalog:name"} → claimed keys for that + * bundle. The orchestrator's cross-file ownership guard intersects each entry's value + * against the planned key set to detect collisions; surfacing per-owner sets (rather + * than a flat union) lets the guard report which file holds the colliding name in its + * error message. + * + *

Engines read their own internal {@code appliedX} map (the one their context + * exposes); the orchestrator does not need access to it directly. + */ + Map> activeClaimsExcluding(String selfKey); + + /** + * Load a static-shipped rule file into the engine's internal applied state. The engine + * builds whatever lightweight Applied artifact its unregister path needs (metric-name + * set for MAL, registered-rule list for LAL) and stores it under {@code "catalog:name"} + * key. Returns {@code true} when an entry was loaded, {@code false} when no claims were + * enumerable from {@code content} (empty rule file) or the engine already has an entry + * for this key. + * + *

Called by {@code StaticRuleLoader} at boot and on tick-time orphan-recovery; lets + * the loader stay DSL-agnostic — it doesn't need to know whether the engine's applied + * state is keyed on metric names or {@code (layer, ruleName)} tuples. + */ + boolean loadStaticRuleFile(String catalog, String name, String content); + + /** + * Build the engine's concrete {@link ApplyContext} subtype from the shared + * {@link ApplyInputs} the scheduler hands every call. The engine plugs in any + * DSL-specific state it carries internally (e.g. the per-key applied artifact map). + */ + C newApplyContext(ApplyInputs inputs); + + /** + * Phase: compile. Produces a {@link CompiledDSL} that carries the engine's per-file + * generated classes + per-file classloader + delta info. NO backend DDL fired here, NO + * scheduler-cache mutation. Throws {@link RuntimeException} on compile failure; + * scheduler stamps {@code applyError} on the snapshot and surfaces to the caller. + */ + CompiledDSL compile(RuntimeRuleManagementDAO.RuntimeRuleFile file, + Classification classification, + C ctx); + + /** + * Phase: schema changes. Drive the listener chain (BanyanDB define / drop, ES index + * mapping, JDBC table, etc.) for the deltas this CompiledDSL represents. The + * {@code StorageManipulationOpt} on the context controls whether the listeners actually + * fire (full / localCacheOnly / localCacheVerify). LAL impl is a no-op (no backend + * schema). Throws on backend failure; scheduler invokes {@link #rollback}. + */ + void fireSchemaChanges(CompiledDSL compiled, C ctx); + + /** + * Phase: verify. Post-DDL backend probe. Returns {@code null} on success, or an error + * string the scheduler stamps on the snapshot's {@code applyError}. MAL: real + * {@code isExists} round-trip per Model. LAL: no-op (returns {@code null}). + */ + String verify(CompiledDSL compiled, C ctx); + + /** + * Phase: commit. Swap the in-memory cache (engine-owned applied state + appliedContent + * for this key), promote the freshly-built classloader via {@link + * org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager#commit} and + * retire any displaced prior loader through the manager, fire alarm-reset for the + * affected metric name set via the context's alarmResetter callback. From this call + * onward the bundle is live; up to this call all phases can be rolled back cleanly. + */ + void commit(CompiledDSL compiled, C ctx); + + /** + * Phase: rollback. Drop registrations from THIS attempt — the just-registered metrics + * (MAL) or rule keys (LAL). Old applied state stays; scheduler hasn't swapped the + * cache yet, so dispatch keeps serving the prior bundle. Idempotent. + */ + void rollback(CompiledDSL compiled, C ctx); + + /** + * Tear down a previously-applied bundle (or a static-only bundle). Driven by + * {@code /inactivate} (with {@code localCacheOnly} so backend stays), {@code /delete} + * (with {@code fullInstall} so backend drops), and the tick's gone-keys cleanup on main. + * Engine clears its own dispatcher state + per-key applied entry. Shared post-cleanup + * (content-cache clear) is the orchestrator's concern after this call returns. + */ + void unregister(String catalog, String name, C ctx); + + /** + * Discharge backend schema for {@code /delete}. By the time the REST handler invokes + * {@code /delete}, {@code /inactivate} has already cleared the engine's applied state + * — a naive {@link #unregister} call would no-op the destructive cascade and the + * backend resource would orphan once the DAO row is deleted. Engines that own backend + * schema (MAL) re-register prototypes locally then tear down under fullInstall so the + * listener chain runs the destructive cascade on the existing resource. Engines without + * backend (LAL) implement this as a no-op — the row deletion alone discharges the rule. + * + *

{@code bundledContent} controls the destructiveness: + *

    + *
  • {@code null} — destructive: drop all backend resources the runtime row + * claimed. The rule is being permanently removed (no bundled twin on disk to + * fall back to).
  • + *
  • non-null — delta: drop only metrics that {@code runtimeContent} claims but + * {@code bundledContent} does not, plus metrics in both at different shape. + * Bundled-shared metrics at matching shape are preserved (no data loss for the + * measures bundled will reuse on its synchronous reload). Used when {@code + * /delete} reverts to a bundled twin.
  • + *
+ * + *

Throws {@link IllegalStateException} when a prerequisite fails (e.g., MeterSystem + * unavailable, parse error in either content); the caller (the {@code DSLRuntimeDelete} + * orchestrator) propagates the throw so the REST handler aborts the row deletion — + * refusing to delete the row is the correct failure mode (an orphaned backend resource + * with no DAO row to drive a retry is worse). + */ + void dropBackend(String catalog, String name, String runtimeContent, + String bundledContent, C ctx); + + /** + * After a runtime override has been removed for {@code (catalog, name)}, reload the + * bundled rule from {@link + * org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry} (if any) and bring + * it back into service via a fresh {@code static:} loader from + * {@link org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager}. + * + *

Returns {@code true} when a bundled rule was found and reinstalled; {@code false} + * when no bundled rule exists for this key (the rule is genuinely gone) or the engine + * doesn't participate in static fall-over (e.g. its catalog has no {@code StaticRuleRegistry} + * entries). + * + *

Errors during reload propagate as {@link RuntimeException}s the orchestrator logs + * but does not surface to the operator; the next dslManager tick will retry through the + * normal classify/apply path against whatever DB state then exists. + * + * @param alarmResetter alarm-window reset callback for affected metric names. The + * orchestrator picks the same callback it used for {@link + * #unregister} so an "update path" tear-down (where the caller + * drives reset itself) doesn't double-reset. + * @param moduleManager scheduler-supplied module manager so the engine can resolve its + * backend dispatcher (MeterSystem / LogFilterListener.Factory). + */ + boolean reloadStatic(String catalog, String name, Consumer> alarmResetter, + ModuleManager moduleManager); +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngineRegistry.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngineRegistry.java new file mode 100644 index 000000000000..25f233e3c1a3 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngineRegistry.java @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine; + +import java.util.Collection; +import java.util.HashMap; +import java.util.Map; + +/** + * Catalog → {@link RuleEngine} lookup. Built once at module start, read on every + * apply / unregister call. Engines self-declare their catalogs via {@link + * RuleEngine#supportedCatalogs()}; the registry indexes them so the scheduler can + * route a {@link org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO.RuntimeRuleFile} + * to the right engine in O(1). + * + *

Adding a new DSL is one line in {@link + * org.apache.skywalking.oap.server.receiver.runtimerule.module.RuntimeRuleModuleProvider}: + * register the engine instance with this registry. No scheduler edit required. + */ +public final class RuleEngineRegistry { + private final Map> byCatalog = new HashMap<>(); + + /** + * Register {@code engine} for every catalog it claims. Throws {@link IllegalStateException} + * on duplicate catalog: two engines competing for the same catalog is a configuration error + * worth failing module start over rather than silently dropping one. + */ + public void register(final RuleEngine engine) { + for (final String catalog : engine.supportedCatalogs()) { + final RuleEngine prior = byCatalog.put(catalog, engine); + if (prior != null && prior != engine) { + throw new IllegalStateException( + "Duplicate RuleEngine registration for catalog '" + catalog + + "': " + prior.getClass().getName() + " vs " + engine.getClass().getName()); + } + } + } + + /** + * @return the engine registered for {@code catalog}, or {@code null} if none. The scheduler + * treats {@code null} as a hard error (catalog should never be loaded if no engine claims + * it — the static rule registry filters by supported catalog at boot). + */ + public RuleEngine forCatalog(final String catalog) { + return byCatalog.get(catalog); + } + + /** All distinct engines, for module-start logging and lifecycle wiring. */ + public Collection> engines() { + return byCatalog.values(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/CompiledLalDSL.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/CompiledLalDSL.java new file mode 100644 index 000000000000..d85c44dec05f --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/CompiledLalDSL.java @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal; + +import lombok.Getter; +import lombok.RequiredArgsConstructor; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.LalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.CompiledDSL; + +/** + * LAL-specific {@link CompiledDSL} carrying the output of {@link LalRuleEngine#compile} + * through {@link LalRuleEngine#commit} (or rollback). LAL has no backend schema, so the + * fire / verify phases are no-ops; the only artifacts that flow phase-to-phase are the new + * {@link LalFileApplier.Applied} (for the in-memory swap) and the prior one (so commit can + * compute truly-gone keys + retire the displaced loader). + */ +@Getter +@RequiredArgsConstructor +public final class CompiledLalDSL implements CompiledDSL { + private final String catalog; + private final String name; + private final String contentHash; + private final Classification classification; + /** Raw YAML the bundle was compiled from, written into {@code appliedContent[key]} on + * commit so the next classify call has the prior content to diff against. */ + private final String content; + /** Prior bundle, {@code null} on first apply. */ + private final LalFileApplier.Applied oldApplied; + /** Freshly-compiled bundle. Live in {@code LogFilterListener.Factory} from the moment + * compile returned via {@code addOrReplace} — rollback re-uses the partial registration + * set on the apply exception. */ + private final LalFileApplier.Applied newApplied; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalApplyContext.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalApplyContext.java new file mode 100644 index 000000000000..23af22684275 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalApplyContext.java @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import lombok.Getter; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * LAL-specific {@link ApplyContext} marker. The engine's {@code Applied} artefact lives on + * {@link AppliedRuleScript#getApplied} (cast to {@code LalFileApplier.Applied}), so this + * context no longer needs a parallel applied map. + */ +@Getter +public final class LalApplyContext implements ApplyContext { + private final ModuleManager moduleManager; + private final StorageManipulationOpt storageOpt; + private final Consumer> alarmResetter; + private final Map rules; + + public LalApplyContext(final ApplyInputs inputs) { + this.moduleManager = inputs.getModuleManager(); + this.storageOpt = inputs.getStorageOpt(); + this.alarmResetter = inputs.getAlarmResetter(); + this.rules = inputs.getRules(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java new file mode 100644 index 000000000000..6fd1de48ab64 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java @@ -0,0 +1,500 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import java.util.function.Consumer; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; +import org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DeltaClassifier; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.LalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.CompiledDSL; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.EngineCompileException; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLScriptKey; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; + +/** + * LAL implementation of {@link RuleEngine}. Owns the {@code (layer, ruleName)} lifecycle for + * the {@code lal} catalog: parse / classify / compile / register / commit / unregister. There + * is no backend schema for LAL bundles, so {@link #fireSchemaChanges} and {@link #verify} are + * no-ops once wired. + * + *

Holds a stable reference to the scheduler's unified {@code rules} map at construction. + * Each rule's LAL-applied artifact lives on {@link AppliedRuleScript#getApplied} (an + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied} cast to + * {@link LalFileApplier.Applied}), so the engine no longer keeps a parallel + * {@code appliedLal} map. + * + *

Phase coverage today: {@link #unregister} is wired; the apply phases still throw and the + * scheduler routes around them via the legacy {@code DSLManager.applyOneRuleFile} path until + * the per-phase migration completes. + */ +@Slf4j +public final class LalRuleEngine implements RuleEngine { + private static final Set CATALOGS = Set.of("lal"); + + private final Map rules; + private final ModuleManager moduleManager; + /** Lazy-resolved + memoised. {@code LogAnalyzerModule} may not be installed on this OAP; + * resolve on first use and degrade to {@code null} on absence. */ + private volatile LalFileApplier lalFileApplier; + + public LalRuleEngine(final Map rules, + final ModuleManager moduleManager) { + this.rules = rules; + this.moduleManager = moduleManager; + } + + /** Read this engine's typed Applied artefact for a key, or {@code null} when there is no + * entry / no engine artefact / the entry's artefact belongs to a different engine. */ + private static LalFileApplier.Applied appliedFor(final Map rules, + final String key) { + final AppliedRuleScript script = rules.get(key); + if (script == null) { + return null; + } + final org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied a = script.getApplied(); + return a instanceof LalFileApplier.Applied ? (LalFileApplier.Applied) a : null; + } + + /** Resolve the engine's {@link LalFileApplier}. Returns {@code null} when the + * {@code LogAnalyzerModule} isn't installed — LAL rules are then no-op and the tick + * logs the absence at debug level. */ + private LalFileApplier resolveApplier() { + LalFileApplier local = lalFileApplier; + if (local != null) { + return local; + } + try { + final LogFilterListener.Factory factory = moduleManager.find(LogAnalyzerModule.NAME) + .provider().getService(LogFilterListener.Factory.class); + local = new LalFileApplier(factory); + lalFileApplier = local; + return local; + } catch (final Throwable t) { + return null; + } + } + + @Override + public Set supportedCatalogs() { + return CATALOGS; + } + + /** + * Wraps {@link DeltaClassifier#classifyLal} and folds the {@code isInactive} short-circuit + * in. LAL's classifier currently only distinguishes NO_CHANGE / NEW / STRUCTURAL — a + * filter-only path would require a finer parse of expression bodies vs rule keys; falling + * conservatively to STRUCTURAL is correct (one extra alarm-window reset, no correctness + * loss). + */ + @Override + public Classification classify(final String oldContent, final String newContent, final boolean isInactive) { + if (isInactive) { + return Classification.INACTIVE; + } + return DeltaClassifier.classifyLal(oldContent, newContent).classification(); + } + + /** + * The {@code (layer, ruleName)} keys this content claims, encoded as + * {@code "layer:ruleName"} (auto-layer rules use the literal {@code "auto"}). Used by the + * cross-file ownership guard. + */ + @Override + public Set claimedKeys(final String content, final String sourceName) { + return DeltaClassifier.enumerateLalRuleKeys(content); + } + + @Override + public Set storageImpactKeys(final String priorContent, final String newContent) { + if (priorContent == null || priorContent.isEmpty()) { + return Collections.emptySet(); + } + // LAL: outputType renames + rule add/remove are storage-affecting (they reroute log + // records to a different storage-backed subclass). DeltaClassifier already enumerates + // these via lalStorageAffectingChanges. + return DeltaClassifier.lalStorageAffectingChanges(priorContent, newContent); + } + + @Override + public Map> activeClaimsExcluding(final String selfKey) { + final Map> out = new HashMap<>(); + for (final Map.Entry e : rules.entrySet()) { + if (selfKey.equals(e.getKey())) { + continue; + } + final LalFileApplier.Applied applied = appliedFor(rules, e.getKey()); + if (applied == null) { + continue; + } + final Set claimed = new HashSet<>(); + for (final LalFileApplier.RegisteredRule r : applied.getRegistered()) { + claimed.add(DSLScriptKey.lalRuleKey(r)); + } + out.put(e.getKey(), claimed); + } + return out; + } + + @Override + public boolean loadStaticRuleFile(final String catalog, final String name, final String content) { + final String key = DSLScriptKey.key(catalog, name); + if (appliedFor(rules, key) != null) { + return false; + } + final List staticKeys = + LalFileApplier.parseRuleKeys(content, catalog + "/" + name); + if (staticKeys.isEmpty()) { + return false; + } + final LalFileApplier.Applied synthetic = new LalFileApplier.Applied( + catalog + "/" + name, staticKeys); + rules.compute(key, (k, prev) -> prev == null + ? new AppliedRuleScript(catalog, name, null, null).withApplied(synthetic) + : prev.withApplied(synthetic)); + return true; + } + + @Override + public LalApplyContext newApplyContext(final ApplyInputs inputs) { + return new LalApplyContext(inputs); + } + + /** + * Compile + register the LAL bundle in one call. {@link LalFileApplier#apply} fuses + * Javassist class generation with the {@code factory.addOrReplace} dispatcher swap, + * so by the time compile returns the new (layer, ruleName) keys are live and the old + * bundle's keys it overwrote are gone — non-overlapping old keys keep serving until + * commit removes them. The orchestrator runs the cross-file ownership guard before + * calling this; the engine assumes the planned key set is conflict-free. + * + *

Throws {@link RuntimeException} wrapping {@link LalFileApplier.ApplyException} on + * compile / register failure; the orchestrator catches and routes to {@link #rollback}. + */ + @Override + public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final LalApplyContext ctx) { + final String key = DSLScriptKey.key(file.getCatalog(), file.getName()); + final String sourceName = file.getCatalog() + "/" + file.getName(); + final String newHash = ContentHash + .sha256Hex(file.getContent()); + final LalFileApplier lalApplier = resolveApplier(); + if (lalApplier == null) { + throw new IllegalStateException( + "LogAnalyzerModule Factory unavailable for LAL compile of " + sourceName); + } + final LalFileApplier.Applied oldApplied = appliedFor(ctx.getRules(), key); + try { + final LalFileApplier.Applied newApplied = lalApplier.apply( + file.getContent(), sourceName, newHash); + return new CompiledLalDSL(file.getCatalog(), file.getName(), newHash, classification, + file.getContent(), oldApplied, newApplied); + } catch (final LalFileApplier.ApplyException ae) { + // Engine-internal partial rollback for the rare case where Phase 2 of + // LalFileApplier.apply (the addOrReplace loop) threw after at least one rule was + // already swapped. Drop those partial entries so the Factory doesn't carry the + // half-applied set forward. The orchestrator never sees a CompiledLalDSL for this + // path (we throw EngineCompileException instead of returning), so the orchestrator's + // rollback() never runs — meaning the old DSL for any overlap key is NOT restored + // by this catch. The state map still points at the old content, so the next + // reconciler scan (NO_CHANGE → re-apply on disagreement check) will recover by + // recompiling the persisted content. Phase 1 failures arrive here with an empty + // partial set and are no-ops on the Factory. + if (!ae.getPartial().isEmpty()) { + lalApplier.remove(new LalFileApplier.Applied(sourceName, ae.getPartial())); + } + throw new EngineCompileException(ae); + } + } + + /** No-op: LAL has no backend schema. */ + @Override + public void fireSchemaChanges(final CompiledDSL compiled, final LalApplyContext ctx) { + // Intentionally no-op. See class-level Javadoc. + } + + /** No-op: LAL has no backend probe. */ + @Override + public String verify(final CompiledDSL compiled, final LalApplyContext ctx) { + return null; + } + + /** + * Atomic in-memory swap: compute truly-gone keys (old keys not present in new), drop + * those from the dispatcher, install the new {@code Applied} in {@code appliedLal[key]}, + * and retire the displaced classloader. {@code addOrReplace} already overwrote + * overlapping keys at compile time, so commit only needs to clean up keys the new + * bundle dropped entirely. + */ + @Override + public void commit(final CompiledDSL compiled, final LalApplyContext ctx) { + final CompiledLalDSL c = (CompiledLalDSL) compiled; + final String key = DSLScriptKey.key(c.getCatalog(), c.getName()); + final String sourceName = c.getCatalog() + "/" + c.getName(); + final LalFileApplier lalApplier = resolveApplier(); + + if (c.getOldApplied() != null && lalApplier != null) { + final Set newKeys = new HashSet<>(); + for (final LalFileApplier.RegisteredRule r : c.getNewApplied().getRegistered()) { + newKeys.add(DSLScriptKey.lalRuleKey(r)); + } + final List trulyGone = new ArrayList<>(); + for (final LalFileApplier.RegisteredRule r : c.getOldApplied().getRegistered()) { + if (!newKeys.contains(DSLScriptKey.lalRuleKey(r))) { + trulyGone.add(r); + } + } + if (!trulyGone.isEmpty()) { + lalApplier.remove(new LalFileApplier.Applied(sourceName, trulyGone)); + } + } + // Promote the freshly-compiled loader to active. The new loader was minted by + // applier.apply but never installed in the manager's active map (newBuilder only + // mints), so a compile failure earlier would have left the prior loader untouched. + // commit() returns the displaced prior — retire it so the graveyard observes its + // collection. factory.addOrReplace already swapped the DSL out at compile time and + // truly-gone keys were just removed above, so the prior is genuinely dead. + if (c.getNewApplied().getRuleClassLoader() != null) { + DSLClassLoaderManager.INSTANCE.commit(c.getNewApplied().getRuleClassLoader()) + .filter(prior -> prior != c.getNewApplied().getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + ctx.getRules().compute(key, (k, prev) -> prev == null + ? new AppliedRuleScript(c.getCatalog(), c.getName(), null, null) + .withContentAndApplied(c.getContent(), c.getNewApplied()) + : prev.withContentAndApplied(c.getContent(), c.getNewApplied())); + log.info("runtime-rule LAL engine: commit OK for {}/{} — {} rule(s) registered", + c.getCatalog(), c.getName(), c.getNewApplied().getRegistered().size()); + } + + /** + * Restore the prior live DSL after a failed apply attempt. LAL is unusual among the + * runtime-rule engines because {@code compile} mutates the global + * {@link org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener.Factory} + * via {@code addOrReplace} — the swap is destructive at compile time, not at commit. So + * after a compile-or-later failure the Factory holds the new DSL while the persistence + * state-map still claims the old content is the running one. A naive {@code remove(new)} + * leaves the key empty, the state-map points at the now-evaporated old applied, and the + * next reconciler scan sees content unchanged → NO_CHANGE → never repairs. + * + *

The fix is to recompile the prior YAML (read from {@code ctx.getRules()[key]}) and + * re-register so {@code Factory[key]} ends up with the old DSL again. The prior content + * was already valid (it's been serving), so recompile is expected to succeed. If somehow + * it doesn't, log loudly and leave the key empty — the next persistence-state reconcile + * tick will attempt apply against the persisted content and recover from there. + */ + @Override + public void rollback(final CompiledDSL compiled, final LalApplyContext ctx) { + final CompiledLalDSL c = (CompiledLalDSL) compiled; + if (c.getNewApplied() == null || c.getNewApplied().getRegistered().isEmpty()) { + return; + } + final LalFileApplier lalApplier = resolveApplier(); + if (lalApplier == null) { + log.warn("runtime-rule LAL engine: Log Factory unavailable on rollback for {}/{}; " + + "skipping (next tick retries)", c.getCatalog(), c.getName()); + return; + } + final String key = DSLScriptKey.key(c.getCatalog(), c.getName()); + final String sourceName = c.getCatalog() + "/" + c.getName(); + + // Step 1: drop the partial new entries. Without this, recompiling the old DSL would + // hit the cross-file collision guard if any key overlaps. + lalApplier.remove(new LalFileApplier.Applied(sourceName, c.getNewApplied().getRegistered())); + + // Step 2: restore the old DSL. We need both an old applied (proves there WAS a prior + // running rule, so the contract is "preserve" not "leave empty") and the YAML content + // to recompile from (state map). If either is missing, the rollback degenerates to + // "remove only" and the next reconciler scan picks up. + if (c.getOldApplied() == null) { + log.info("runtime-rule LAL engine: rollback OK for {}/{} — {} partial registration(s) removed (no prior DSL to restore)", + c.getCatalog(), c.getName(), c.getNewApplied().getRegistered().size()); + return; + } + final AppliedRuleScript prior = ctx.getRules().get(key); + if (prior == null || prior.getContent() == null) { + log.warn("runtime-rule LAL engine: rollback for {}/{} — old applied present but state-map content missing; key left empty, next reconciler scan will retry", + c.getCatalog(), c.getName()); + return; + } + try { + final String oldHash = ContentHash + .sha256Hex(prior.getContent()); + lalApplier.apply(prior.getContent(), sourceName, oldHash); + log.info("runtime-rule LAL engine: rollback OK for {}/{} — {} partial registration(s) removed and prior DSL restored", + c.getCatalog(), c.getName(), c.getNewApplied().getRegistered().size()); + } catch (final LalFileApplier.ApplyException e) { + // Pathological: the previously-running content fails to recompile. Could happen + // if the runtime classpath changed (e.g. SPI provider added/removed). Leave the + // key empty rather than throw further; the persistence-state retry path is the + // recovery mechanism. + log.error("runtime-rule LAL engine: rollback for {}/{} could not restore prior DSL — key left empty; persistence-state retry will reapply", + c.getCatalog(), c.getName(), e); + } + } + + + /** + * Tear down a previously-applied (or static) LAL bundle for {@code (catalog, name)}. + * Removes the registered rule keys from the LogFilterListener.Factory and retires the + * per-file classloader. {@code storageOpt} is irrelevant — LAL has no backend. + * + *

Static-rule fallback. When {@code priorLal} is {@code null}, parses + * {@link StaticRuleRegistry} content for the rule keys and removes those — see the MAL + * counterpart's class-level Javadoc for the rationale. + * + *

No alarm reset for LAL — alarm windows are keyed off metric names, not log rules. + */ + @Override + public void unregister(final String catalog, final String name, final LalApplyContext ctx) { + final String key = DSLScriptKey.key(catalog, name); + final String sourceName = catalog + "/" + name; + + final LalFileApplier.Applied priorLal = appliedFor(ctx.getRules(), key); + if (priorLal != null) { + ctx.getRules().computeIfPresent(key, (k, prev) -> prev.withApplied(null)); + final LalFileApplier lalApplier = resolveApplier(); + if (lalApplier == null) { + log.warn("runtime-rule dslManager: Log Factory unavailable; cannot unregister " + + "{} LAL rule(s) for {}/{}", + priorLal.getRegistered().size(), catalog, name); + } else { + lalApplier.remove(priorLal); + log.info("runtime-rule dslManager: unregistered {} LAL rule(s) for {}/{}", + priorLal.getRegistered().size(), catalog, name); + } + DSLClassLoaderManager.INSTANCE.dropRuntime(Catalog.LAL, name); + return; + } + + // Static-rule fallback. + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent == null) { + return; + } + final List staticKeys = + LalFileApplier.parseRuleKeys(staticContent, sourceName); + if (staticKeys.isEmpty()) { + return; + } + final LalFileApplier.Applied synthetic = new LalFileApplier.Applied(sourceName, staticKeys); + final LalFileApplier lalApplier = resolveApplier(); + if (lalApplier == null) { + log.warn("runtime-rule dslManager: Log Factory unavailable; cannot unregister " + + "{} boot-registered LAL rule(s) for {}/{}", + staticKeys.size(), catalog, name); + return; + } + lalApplier.remove(synthetic); + log.info("runtime-rule dslManager: unregistered {} boot-registered LAL rule(s) for " + + "static rule {}/{}", + staticKeys.size(), catalog, name); + } + + /** No-op: LAL has no backend schema. {@code /delete}'s row deletion alone discharges + * the rule — no destructive cascade or delta-drop needed. */ + @Override + public void dropBackend(final String catalog, final String name, + final String runtimeContent, final String bundledContent, + final LalApplyContext ctx) { + // Intentionally no-op. + } + + /** + * Re-install the bundled LAL rule for {@code (catalog, name)} from {@link StaticRuleRegistry}. + * The runtime override that masked it has already been removed; without this fall-over, + * the bundled rule's compiled classes would be gone and operators would have to restart + * the OAP to get the bundled DSL serving again. + * + *

Compiles via {@code lalApplier.apply(..., Kind.STATIC)} so the per-file loader is + * minted with the {@code static:} prefix — diagnostics can tell at a glance whether a + * key is being served by a runtime override or a static fall-over. + */ + @Override + public boolean reloadStatic(final String catalog, final String name, + final Consumer> alarmResetter, + final ModuleManager moduleManager) { + if (!CATALOGS.contains(catalog)) { + return false; + } + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent == null || staticContent.isEmpty()) { + return false; + } + final LalFileApplier lalApplier = resolveApplier(); + if (lalApplier == null) { + log.warn("runtime-rule LAL engine: Log Factory unavailable; cannot reload static " + + "rule {}/{} after override removal", catalog, name); + return false; + } + final String sourceName = catalog + "/" + name; + final String hash = ContentHash.sha256Hex(staticContent); + try { + final LalFileApplier.Applied fresh = lalApplier.apply( + staticContent, sourceName, hash, DSLClassLoaderManager.Kind.STATIC); + // Promote the new static: loader. The displaced prior, if any, is retired — + // typically null here (we're called immediately after unregister, which already + // dropRuntime'd the old runtime loader). + if (fresh.getRuleClassLoader() != null) { + DSLClassLoaderManager.INSTANCE.commit(fresh.getRuleClassLoader()) + .filter(prior -> prior != fresh.getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + // Reset the entry to look like a fresh boot-seeded one: content = bundled YAML + // (so future classify sees the right priorContent), state = null (so the next + // gone-keys cleanup correctly skips this as an untouched bundled-only entry), + // applied = the freshly compiled bundled rule. Without this reset the entry's + // post-/inactivate state would still be INACTIVE and the next tick's gone-keys + // path would re-fire teardown + reload in a loop. + final String key = DSLScriptKey.key(catalog, name); + rules.compute(key, (k, prev) -> { + final ReentrantLock lock = prev != null ? prev.getLock() : new ReentrantLock(); + return new AppliedRuleScript(catalog, name, staticContent, null, lock, fresh); + }); + log.info("runtime-rule LAL engine: static fall-over OK for {}/{} — {} rule(s) " + + "registered from bundled YAML", catalog, name, + fresh.getRegistered().size()); + return true; + } catch (final LalFileApplier.ApplyException ae) { + log.warn("runtime-rule LAL engine: static fall-over for {}/{} failed to compile " + + "the bundled YAML; bundled rule will stay dark until next /addOrUpdate " + + "or restart", catalog, name, ae); + return false; + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/CompiledMalDSL.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/CompiledMalDSL.java new file mode 100644 index 000000000000..a7327ca3357c --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/CompiledMalDSL.java @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal; + +import java.util.Set; +import lombok.Getter; +import lombok.RequiredArgsConstructor; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DSLDelta; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.MalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.CompiledDSL; + +/** + * MAL-specific {@link CompiledDSL} carrying the output of {@link MalRuleEngine#compile} all + * the way through {@link MalRuleEngine#commit} (or rollback). Holds only what the engine + * itself knows about the bundle — the orchestrator owns scheduler-side state (snapshot + * transitions, persistence, suspend coordination) and reads CompiledMalDSL purely as the + * engine's compile output. + * + *

Filter-only: {@code delta} is {@code null}, {@code addedPlusShapeBreak} is empty — + * the path skips alarm reset and classloader retire intentionally (see + * {@link MalRuleEngine#commit}). + */ +@Getter +@RequiredArgsConstructor +public final class CompiledMalDSL implements CompiledDSL { + private final String catalog; + private final String name; + private final String contentHash; + private final Classification classification; + /** Raw YAML the bundle was compiled from, written into {@code appliedContent[key]} on + * commit so the next classify call has the prior content to diff against. */ + private final String content; + /** Prior bundle for this key, or {@code null} on first apply. Held for classloader retire + * on commit. */ + private final MalFileApplier.Applied oldApplied; + /** Freshly-compiled bundle. Live in MeterSystem from the moment compile returned — + * rollback uses {@link #addedPlusShapeBreak} to know what to undo. */ + private final MalFileApplier.Applied newApplied; + /** Classifier verdict + delta sets ({@code added}, {@code removed}, {@code shapeBreak}, + * {@code alarmResetSet}). {@code null} on FILTER_ONLY (compile path doesn't compute + * per-metric deltas there). */ + private final DSLDelta delta; + /** Pre-merged {@code added ∪ shapeBreak} — the canonical rollback / verify target set. */ + private final Set addedPlusShapeBreak; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalApplyContext.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalApplyContext.java new file mode 100644 index 000000000000..e85c5f9b85ab --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalApplyContext.java @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import lombok.Getter; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * MAL-specific {@link ApplyContext} marker. Carries the shared scheduler services resolved + * out of {@link ApplyInputs}; the engine's {@code Applied} artefact lives on + * {@link AppliedRuleScript#getApplied} now (cast to {@code MalFileApplier.Applied}), so this + * context no longer needs a parallel applied map. Kept as a subtype rather than collapsed to + * {@code ApplyContext} so the {@code RuleEngine} type parameter still reads + * as engine-tagged at every call site. + * + *

Constructed by {@link MalRuleEngine#newApplyContext(ApplyInputs)} on every apply / + * unregister call. + */ +@Getter +public final class MalApplyContext implements ApplyContext { + private final ModuleManager moduleManager; + private final StorageManipulationOpt storageOpt; + private final Consumer> alarmResetter; + private final Map rules; + + public MalApplyContext(final ApplyInputs inputs) { + this.moduleManager = inputs.getModuleManager(); + this.storageOpt = inputs.getStorageOpt(); + this.alarmResetter = inputs.getAlarmResetter(); + this.rules = inputs.getRules(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java new file mode 100644 index 000000000000..d4af4d8d9bae --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java @@ -0,0 +1,813 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import java.util.function.Consumer; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; +import org.apache.skywalking.oap.meter.analyzer.v2.MalConverterRegistry; +import org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; +import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.IModelManager; +import org.apache.skywalking.oap.server.core.storage.model.Model; +import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DSLDelta; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DeltaClassifier; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.MalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.CompiledDSL; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.EngineCompileException; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLScriptKey; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * MAL implementation of {@link RuleEngine}. Owns the metric-name lifecycle: parse / classify / + * compile / register / verify / commit / unregister for {@code otel-rules}, + * {@code log-mal-rules}, and {@code telegraf-rules}. All three catalogs share the same MAL + * syntax, so one engine handles all three — the catalog name only routes which dispatcher the + * MAL converter writes into (MeterSystem for otel-rules / telegraf-rules; LAL-extracted MAL + * for log-mal-rules). + * + *

Holds a stable reference to the scheduler's unified {@code rules} map at construction. + * Each rule's MAL-applied artifact lives on {@link AppliedRuleScript#getApplied} (an + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied} cast to + * {@link MalFileApplier.Applied}), so the engine no longer keeps a parallel + * {@code appliedMal} map. Each phase call receives a {@link MalApplyContext} that exposes + * the shared services and the same rules map identity in one cohesive object. + * + *

Phase model. {@link MalFileApplier#apply} fuses compile, register, and listener- + * chain (BanyanDB define / ES mapping / JDBC table) into one call because the generated + * Javassist classes register synchronously with the storage listeners. The SPI's {@code + * fireSchemaChanges} is therefore a no-op for MAL — schema fires inside {@link #compile}. + * {@link #verify} runs the post-DDL {@code isExists} probe and returns an error string the + * orchestrator surfaces on the snapshot, or {@code null} on success. + */ +@Slf4j +public final class MalRuleEngine implements RuleEngine { + private static final Set CATALOGS = Set.of("otel-rules", "log-mal-rules", "telegraf-rules"); + + private final Map rules; + private final ModuleManager moduleManager; + /** Lazy-resolved + memoised. {@link MeterSystem} comes from {@code CoreModule}, which + * may not be ready when this engine is constructed; resolve on first use. */ + private volatile MalFileApplier malFileApplier; + + public MalRuleEngine(final Map rules, + final ModuleManager moduleManager) { + this.rules = rules; + this.moduleManager = moduleManager; + } + + /** Read this engine's typed Applied artefact for a key, or {@code null} when there is no + * entry / no engine artefact / the entry's artefact belongs to a different engine. */ + private static MalFileApplier.Applied appliedFor(final Map rules, + final String key) { + final AppliedRuleScript script = rules.get(key); + if (script == null) { + return null; + } + final org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied a = script.getApplied(); + return a instanceof MalFileApplier.Applied ? (MalFileApplier.Applied) a : null; + } + + /** Exposed for {@link org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.StructuralCommitCoordinator} + * which needs the same {@code MeterSystem} lookup the engine uses internally. */ + public ModuleManager getModuleManager() { + return moduleManager; + } + + /** Resolve the engine's {@link MalFileApplier}. Returns {@code null} when {@code CoreModule} + * hasn't started yet — caller treats this as a transient pre-boot state. */ + private MalFileApplier resolveApplier() { + MalFileApplier local = malFileApplier; + if (local != null) { + return local; + } + try { + final MeterSystem meterSystem = moduleManager.find(CoreModule.NAME).provider() + .getService(MeterSystem.class); + local = new MalFileApplier(meterSystem); + malFileApplier = local; + return local; + } catch (final Throwable t) { + return null; + } + } + + /** Resolve the {@link MalConverterRegistry} for a MAL catalog. {@code null} when the + * owning receiver module isn't installed in this OAP — runtime-rule may run without + * one of the MAL-consuming receivers, and missing-module degrades to "no push". */ + private MalConverterRegistry resolveConverterRegistry(final String catalog) { + final String moduleName; + switch (catalog) { + case "otel-rules": + // String literal keeps otel-receiver-plugin out of runtime-rule's pom. + moduleName = "receiver-otel"; + break; + case "log-mal-rules": + moduleName = LogAnalyzerModule.NAME; + break; + case "telegraf-rules": + // String literal keeps telegraf-receiver-plugin out of runtime-rule's pom. + moduleName = "receiver-telegraf"; + break; + default: + return null; + } + try { + return moduleManager.find(moduleName).provider().getService(MalConverterRegistry.class); + } catch (final Throwable t) { + log.debug("runtime-rule MAL engine: MalConverterRegistry for catalog {} (module {}) " + + "not available: {}", catalog, moduleName, t.getMessage()); + return null; + } + } + + /** Install / replace this bundle's {@link MetricConvert} in the owning receiver's + * registry so ingest samples reach the freshly-compiled converter. No-op when the + * convert is null or the owning receiver module isn't installed on this OAP. */ + private void pushRuntimeConverter(final String catalog, final String name, + final MetricConvert convert) { + if (convert == null) { + return; + } + final MalConverterRegistry registry = resolveConverterRegistry(catalog); + if (registry == null) { + return; + } + registry.addOrReplaceConverter(runtimeConverterKey(catalog, name), convert); + } + + /** Reverse of {@link #pushRuntimeConverter}. Invoked from unregister / commit-overwrite + * paths so the receiver no longer sees the removed converter. */ + private void dropRuntimeConverter(final String catalog, final String name) { + final MalConverterRegistry registry = resolveConverterRegistry(catalog); + if (registry == null) { + return; + } + registry.removeConverter(runtimeConverterKey(catalog, name)); + } + + private static String runtimeConverterKey(final String catalog, final String name) { + return catalog + ":" + name; + } + + @Override + public Set supportedCatalogs() { + return CATALOGS; + } + + /** + * Wraps {@link DeltaClassifier#classifyMal} and folds the {@code isInactive} short-circuit + * in. The richer {@link DSLDelta} (added / removed / shape-break sets, alarm-reset set) + * is recomputed during {@link #compile} when needed; keeping the SPI return type to + * {@link Classification} keeps the scheduler boundary lean. + */ + @Override + public Classification classify(final String oldContent, final String newContent, final boolean isInactive) { + if (isInactive) { + return Classification.INACTIVE; + } + return DeltaClassifier.classifyMal(oldContent, newContent).classification(); + } + + /** + * Metric names this content claims, used by the cross-file ownership guard. Mirrors the + * same enumeration MAL apply uses ({@code metricPrefix + "_" + ruleName}). YAML parse + * failure surfaces as {@link IllegalArgumentException} — the scheduler stamps the apply + * error and aborts the cross-file check. + */ + @Override + public Set claimedKeys(final String content, final String sourceName) { + return MalFileApplier.parseMetricNames(content, sourceName); + } + + @Override + public Set storageImpactKeys(final String priorContent, final String newContent) { + // First-time bundle: no prior storage identity to break. + if (priorContent == null || priorContent.isEmpty()) { + return Collections.emptySet(); + } + final DSLDelta delta = DeltaClassifier.classifyMal(priorContent, newContent); + // Only shape-break is guarded — that's the case where existing data on the BanyanDB + // measure becomes incompatible with the new shape. Add / remove are intentional ops + // the operator clearly knows about; FILTER_ONLY / NEW don't move any shape. + if (delta.classification() != Classification.STRUCTURAL) { + return Collections.emptySet(); + } + return delta.shapeBreakMetrics(); + } + + @Override + public Map> activeClaimsExcluding(final String selfKey) { + final Map> out = new HashMap<>(); + for (final Map.Entry e : rules.entrySet()) { + if (selfKey.equals(e.getKey())) { + continue; + } + final MalFileApplier.Applied applied = appliedFor(rules, e.getKey()); + if (applied == null) { + continue; + } + out.put(e.getKey(), applied.getRegisteredMetricNames()); + } + return out; + } + + @Override + public boolean loadStaticRuleFile(final String catalog, final String name, final String content) { + final String key = DSLScriptKey.key(catalog, name); + if (appliedFor(rules, key) != null) { + return false; + } + final Set staticMetricNames = + MalFileApplier.parseMetricNames(content, catalog + "/" + name); + if (staticMetricNames.isEmpty()) { + return false; + } + // Synthetic Applied: only the metric-name set is needed for unregister-side cascade. + // ruleClassLoader is null because static classes live in the default loader; rule and + // metricConvert are null because this entry tracks boot state, not a runtime apply. + final MalFileApplier.Applied synthetic = new MalFileApplier.Applied( + null, null, staticMetricNames, null); + rules.compute(key, (k, prev) -> prev == null + ? new AppliedRuleScript(catalog, name, null, null).withApplied(synthetic) + : prev.withApplied(synthetic)); + return true; + } + + @Override + public MalApplyContext newApplyContext(final ApplyInputs inputs) { + return new MalApplyContext(inputs); + } + + /** + * Compile + register + fire schema in one call (the underlying {@link MalFileApplier#apply} + * is fused — Javassist class generation registers synchronously with the listener chain). + * Drops shape-break metrics first so {@code MeterSystem.create} can re-register at the new + * shape. Returns a {@link CompiledMalDSL} carrying the deltas, prior Applied, and the + * freshly-registered Applied for the rest of the pipeline. + * + *

Throws {@link MalFileApplier.ApplyException} (wrapped in {@link RuntimeException} for + * SPI compatibility) on compile / register failure; the orchestrator catches and routes + * to {@link #rollback}. + */ + @Override + public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final MalApplyContext ctx) { + final String key = DSLScriptKey.key(file.getCatalog(), file.getName()); + final String sourceName = file.getCatalog() + "/" + file.getName(); + final String newHash = ContentHash + .sha256Hex(file.getContent()); + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + throw new IllegalStateException("MeterSystem unavailable for MAL compile of " + + sourceName); + } + final MalFileApplier.Applied oldApplied = appliedFor(ctx.getRules(), key); + + // FILTER_ONLY fast path: no shape-break drop, no DDL move, no alarm reset, no + // classloader retire. Just produce the freshly-compiled Applied and let commit do + // the in-memory swap. The classifier already ran — engines don't re-classify here. + if (classification == Classification.FILTER_ONLY) { + final MalFileApplier.Applied fresh; + try { + fresh = applier.apply( + file.getContent(), sourceName, newHash, ctx.getStorageOpt()); + } catch (final MalFileApplier.ApplyException ae) { + // Engine-internal partial rollback: undo whatever this attempt managed to + // register before the throw. Old appliedMal[key] is untouched — it's still + // serving — so removing the partial set is the only mutation needed. + applier.remove(ae.getPartiallyRegistered(), ctx.getStorageOpt()); + throw new EngineCompileException(ae); + } + return new CompiledMalDSL(file.getCatalog(), file.getName(), newHash, classification, + file.getContent(), oldApplied, fresh, /* delta */ null, Collections.emptySet()); + } + + // STRUCTURAL / NEW: re-derive the precise delta (added / removed / shape-break) from + // the prior content. The scheduler's classify() call handed us the verdict but not + // the delta sets — recomputing here keeps the SPI lean and ensures the delta the + // engine acts on is internally consistent with the content it's compiling. + final AppliedRuleScript priorScript = ctx.getRules().get(key); + final String priorContent = priorScript == null ? null : priorScript.getContent(); + final DSLDelta delta = DeltaClassifier.classifyMal(priorContent, file.getContent()); + + // Shape-break metrics MUST be dropped before applier.apply re-registers them at the + // new shape — MeterSystem.create rejects re-register at a different (function, scope) + // with an IllegalArgumentException. This IS the destructive shape-break contract: + // the REST handler's allowStorageChange guardrail has already gated it, and the + // design accepts that a verify-failure after this point loses shape-break data. + if (!delta.shapeBreakMetrics().isEmpty()) { + log.info("runtime-rule MAL engine: {}/{} dropping {} shape-break metric(s) before " + + "re-create: {}", file.getCatalog(), file.getName(), + delta.shapeBreakMetrics().size(), delta.shapeBreakMetrics()); + applier.remove(delta.shapeBreakMetrics(), ctx.getStorageOpt()); + } + + final MalFileApplier.Applied newApplied; + try { + newApplied = applier.apply( + file.getContent(), sourceName, newHash, ctx.getStorageOpt()); + } catch (final MalFileApplier.ApplyException ae) { + // Engine-internal partial rollback: undo only the metrics this attempt would + // have created or re-shaped (added ∪ shape-break). Unchanged metrics short- + // circuit on MeterSystem idempotency and were never re-registered, so removing + // them would wipe BanyanDB measure data the apply never actually touched. + // Shape-break metrics: we removed the old class pre-apply; the new one may or + // may not have registered before the throw — remove is idempotent either way. + // That's the documented allowStorageChange cost. + final Set rollbackTargets = new HashSet<>(); + rollbackTargets.addAll(delta.addedMetrics()); + rollbackTargets.addAll(delta.shapeBreakMetrics()); + applier.remove(rollbackTargets, ctx.getStorageOpt()); + throw new EngineCompileException(ae); + } + + final Set addedPlusShapeBreak = new HashSet<>(); + addedPlusShapeBreak.addAll(delta.addedMetrics()); + addedPlusShapeBreak.addAll(delta.shapeBreakMetrics()); + + return new CompiledMalDSL(file.getCatalog(), file.getName(), newHash, classification, + file.getContent(), oldApplied, newApplied, delta, + Collections.unmodifiableSet(addedPlusShapeBreak)); + } + + /** + * No-op for MAL — schema changes fire inside {@link #compile} via {@link + * MalFileApplier#apply}, which drives the listener chain synchronously with class + * registration. Kept as an SPI hook for future engines (e.g. OAL) where compile and + * fire-schema can genuinely be separated. + */ + @Override + public void fireSchemaChanges(final CompiledDSL compiled, final MalApplyContext ctx) { + // Intentionally no-op. See class-level Javadoc. + } + + /** + * Post-DDL {@code isExists} probe. Returns {@code null} on success, an error string on + * mismatch the orchestrator stamps on the snapshot's {@code applyError}. Only verifies + * added + shape-break metrics — verifying unchanged metrics would duplicate the startup- + * time isExists check and burn gRPC round-trips for no benefit. + * + *

Gracefully degrades when storage / model services aren't present (early boot, some + * embedded test topologies): logs DEBUG and returns {@code null}. + */ + @Override + public String verify(final CompiledDSL compiled, final MalApplyContext ctx) { + final CompiledMalDSL c = (CompiledMalDSL) compiled; + if (c.getClassification() == Classification.FILTER_ONLY) { + return null; + } + final Set targets = c.getAddedPlusShapeBreak(); + if (targets.isEmpty()) { + return null; + } + final ModelInstaller installer; + final IModelManager modelManager; + try { + installer = ctx.getModuleManager().find(StorageModule.NAME).provider() + .getService(ModelInstaller.class); + modelManager = ctx.getModuleManager().find(CoreModule.NAME).provider() + .getService(IModelManager.class); + } catch (final Throwable t) { + log.debug("runtime-rule MAL engine: post-apply verify skipped for {}/{} " + + "(storage/model services unavailable: {})", + c.getCatalog(), c.getName(), t.getMessage()); + return null; + } + final List failures = new ArrayList<>(); + for (final Model m : modelManager.allModels()) { + if (!targets.contains(m.getName())) { + continue; + } + try { + final ModelInstaller.InstallInfo info = installer.isExists(m, ctx.getStorageOpt()); + if (!info.isAllExist()) { + failures.add(info.buildInstallInfoMsg()); + } + } catch (final Throwable t) { + failures.add(m.getName() + " (" + m.getDownsampling() + "): " + t.getMessage()); + } + } + if (failures.isEmpty()) { + if (log.isDebugEnabled()) { + log.debug("runtime-rule MAL engine: post-apply verify OK for {}/{} ({} metric(s))", + c.getCatalog(), c.getName(), targets.size()); + } + return null; + } + final String msg = "post-apply isExists verify FAILED: " + String.join("; ", failures); + log.error("runtime-rule MAL engine CRITICAL: {}/{} {} — orchestrator will roll back to " + + "prior bundle. Fix the DSL/storage mismatch and re-push.", + c.getCatalog(), c.getName(), msg); + return msg; + } + + /** + * Atomic in-memory swap: install the new {@code Applied} in {@code appliedMal[key]}, + * publish the freshly-compiled {@link org.apache.skywalking.oap.meter.analyzer.v2.MetricConvert} + * to the owning receiver's converter registry, retire the displaced classloader (STRUCTURAL + * / NEW only — FILTER_ONLY's old loader's Metrics classes are still the live storage + * target so it stays), and drive alarm-window reset for affected metric names. + * + *

Idempotent at the in-memory level: re-applying the same Applied is a no-op except for + * a redundant alarm reset. The orchestrator owns the snapshot transition ( + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState}) and + * persistence — engine commit only mutates engine-owned state. + */ + @Override + public void commit(final CompiledDSL compiled, final MalApplyContext ctx) { + final CompiledMalDSL c = (CompiledMalDSL) compiled; + // Drop metrics this bundle no longer claims (STRUCTURAL/NEW only — FILTER_ONLY has + // identical metric sets). Honours the caller's storage opt: a peer-driven tick uses + // localCacheOnly here so the cluster-shared backend isn't touched; the main's REST + // path uses fullInstall to fire dropTable through the listener chain. Must run + // BEFORE the swap so the about-to-be-displaced applier still owns the prototypes. + if (c.getClassification() != Classification.FILTER_ONLY + && c.getDelta() != null + && !c.getDelta().removedMetrics().isEmpty()) { + final MalFileApplier applier = resolveApplier(); + if (applier != null) { + applier.remove(c.getDelta().removedMetrics(), ctx.getStorageOpt()); + } + } + final String commitKey = DSLScriptKey.key(c.getCatalog(), c.getName()); + ctx.getRules().compute(commitKey, (k, prev) -> prev == null + ? new AppliedRuleScript(c.getCatalog(), c.getName(), null, null) + .withContentAndApplied(c.getContent(), c.getNewApplied()) + : prev.withContentAndApplied(c.getContent(), c.getNewApplied())); + pushRuntimeConverter( + c.getCatalog(), c.getName(), c.getNewApplied().getMetricConvert()); + // Promote the freshly-compiled loader to active. newBuilder only mints; commit() is + // the only path that registers in the manager's active map, so a compile failure + // earlier would have left the prior loader untouched. STRUCTURAL / NEW commit retires + // the displaced prior — its Metrics classes have been replaced via MeterSystem + // re-create, so it's truly dead. FILTER_ONLY does NOT retire — its Metrics classes + // are still the storage target via MeterSystem.meterPrototypes, and the new loader's + // MalExpression bridges to the same prototypes. Both old and new loaders coexist; + // the manager's active slot points at the newest, the older is held strong via + // meterPrototypes and naturally GC'd if a later STRUCTURAL ever displaces it. + if (c.getNewApplied().getRuleClassLoader() != null) { + final Optional prior = + DSLClassLoaderManager.INSTANCE.commit(c.getNewApplied().getRuleClassLoader()); + if (c.getClassification() != Classification.FILTER_ONLY) { + prior.filter(p -> p != c.getNewApplied().getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + } + if (c.getClassification() != Classification.FILTER_ONLY) { + // Alarm reset for metrics whose semantics changed (added / removed / shape-break). + if (c.getDelta() != null && !c.getDelta().alarmResetSet().isEmpty()) { + ctx.getAlarmResetter().accept(c.getDelta().alarmResetSet()); + } + } + log.info("runtime-rule MAL engine: commit OK for {}/{} — {} metric(s) registered ({})", + c.getCatalog(), c.getName(), + c.getNewApplied().getRegisteredMetricNames().size(), c.getClassification()); + } + + /** + * Drop registrations from THIS attempt — the just-attempted added + shape-break metrics. + * Old Applied stays in {@code appliedMal[key]} so unchanged metrics keep serving (the + * orchestrator hasn't called commit yet, so the swap hasn't happened). Idempotent. + * + *

Shape-break cost. Pre-compile we removed the old shape-break classes; if the + * new ones never registered (compile threw before reaching them), the metrics are gone + * for this evaluation period — documented cost of the {@code allowStorageChange=true} + * guardrail. Once the operator pushes a fixed version, the next apply re-registers. + */ + @Override + public void rollback(final CompiledDSL compiled, final MalApplyContext ctx) { + final CompiledMalDSL c = (CompiledMalDSL) compiled; + if (c.getAddedPlusShapeBreak().isEmpty()) { + return; + } + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + log.warn("runtime-rule MAL engine: MeterSystem unavailable on rollback for {}/{}; " + + "skipping (next tick retries)", c.getCatalog(), c.getName()); + return; + } + applier.remove(c.getAddedPlusShapeBreak(), ctx.getStorageOpt()); + log.info("runtime-rule MAL engine: rollback OK for {}/{} — {} metric(s) removed", + c.getCatalog(), c.getName(), c.getAddedPlusShapeBreak().size()); + } + + /** + * Tear down a previously-applied (or static) MAL bundle for {@code (catalog, name)}. + *

The {@link MalApplyContext#getStorageOpt()} parameter decides whether the listener + * chain reaches the backend: + *

    + *
  • {@code localCacheOnly} — soft-pause path. Local state is cleared (meterPrototypes, + * Models from registry, appliedMal entry, classloader retired) but the listener's + * {@code dropTable} is skipped, so the BanyanDB measure / ES index / JDBC table stays + * intact. This is the {@code /inactivate} contract.
  • + *
  • {@code fullInstall} — destructive path. Same local cleanup PLUS the listener fires + * {@code dropTable} so the backend resource is removed. This is the {@code /delete} + * contract and the tick's gone-keys cleanup on main.
  • + *
+ * + *

Cascade-first ordering. {@code applier.remove} runs before {@code + * appliedMal.remove(key)}. A backend-drop throw therefore leaves {@code appliedMal[key]} + * populated so the next tick (or operator retry) re-enters this method and re-fires the + * cascade. Listeners are required to be idempotent on the drop ({@code BanyanDB + * delete-measure} on a non-existent measure is a no-op). + * + *

Static-rule fallback. When {@code priorMal} is {@code null} (this rule never + * had a runtime apply on this node — e.g., a static-only rule receiving its first + * {@code /inactivate}, or a static-shadow tombstone reaching a fresh main) the method + * parses {@link StaticRuleRegistry} content for the metric names and removes those + * directly. {@code MeterSystem.removeMetric} short-circuits when the prototype is already + * gone, so this fallback is correct under both opt modes. + * + *

Alarm reset. The orchestrator decides whether to invoke the alarm kernel by + * supplying either the real {@link MalApplyContext#getAlarmResetter()} (full tear-down) + * or a no-op (update path where the caller will drive the reset itself with the precise + * delta). + */ + @Override + public void unregister(final String catalog, final String name, final MalApplyContext ctx) { + final String key = DSLScriptKey.key(catalog, name); + final String sourceName = catalog + "/" + name; + + // Always drop the MalConverterRegistry entry for this (catalog, name). The key + // namespace is shared between boot-time and runtime converters, so this single call + // covers both cases; no-op for absent keys. Outside the priorMal guard so the first + // /inactivate of a static-only rule successfully drops the boot-time converter even + // though no Applied entry ever existed. + dropRuntimeConverter(catalog, name); + + final MalFileApplier.Applied priorMal = appliedFor(ctx.getRules(), key); + if (priorMal != null) { + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + log.warn("runtime-rule MAL engine: MeterSystem unavailable; cannot unregister " + + "{} metric(s) for {}/{}", + priorMal.getRegisteredMetricNames().size(), catalog, name); + return; + } + applier.remove(priorMal.getRegisteredMetricNames(), ctx.getStorageOpt()); + ctx.getRules().computeIfPresent(key, (k, prev) -> prev.withApplied(null)); + log.info("runtime-rule MAL engine: unregistered {} metric(s) for {}/{}", + priorMal.getRegisteredMetricNames().size(), catalog, name); + DSLClassLoaderManager.INSTANCE.dropRuntime(Catalog.of(catalog), name); + ctx.getAlarmResetter().accept(priorMal.getRegisteredMetricNames()); + return; + } + + // Static-rule fallback. + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent == null) { + return; + } + final Set staticMetricNames = MalFileApplier.parseMetricNames(staticContent, sourceName); + if (staticMetricNames.isEmpty()) { + return; + } + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + log.warn("runtime-rule MAL engine: MeterSystem unavailable; cannot unregister " + + "{} boot-registered metric(s) for {}/{}", + staticMetricNames.size(), catalog, name); + return; + } + applier.remove(staticMetricNames, ctx.getStorageOpt()); + log.info("runtime-rule MAL engine: unregistered {} boot-registered metric(s) for " + + "static rule {}/{}", + staticMetricNames.size(), catalog, name); + ctx.getAlarmResetter().accept(staticMetricNames); + } + + /** + * Discharge backend schema for {@code /delete}. {@code bundledContent} controls the + * destructiveness: + * + *

    + *
  • {@code null} — destructive: re-register prototypes locally under + * {@code localCacheOnly} (so the listener chain doesn't re-create the measure + * we're about to drop) and then tear down via {@link #unregister} under + * {@code fullInstall}. The two-step dance is needed because {@code /inactivate} + * has already cleared {@code appliedMal[key]}; without re-register, unregister + * would no-op the cascade and the backend would orphan.
  • + *
  • non-null — delta: classify {@code runtimeContent} → {@code bundledContent} + * and drop only metrics the runtime row claims that bundled does NOT claim, plus + * metrics in both at different shape. Bundled-shared metrics at matching shape + * are preserved (no data loss for the measures bundled will reuse on its + * synchronous reload). The drop runs under {@code fullInstall} so the listener + * cascade fires.
  • + *
+ * + *

Throws {@link IllegalStateException} on MeterSystem unavailability or re-register + * failure; the caller propagates so the REST handler aborts {@code dao.delete}. + */ + @Override + public void dropBackend(final String catalog, final String name, + final String runtimeContent, final String bundledContent, + final MalApplyContext ctx) { + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + throw new IllegalStateException( + "MeterSystem unavailable; cannot drop backend measure for " + catalog + "/" + + name + " — refusing to delete the row and orphan the measure. Retry " + + "when MeterSystem is up."); + } + if (bundledContent != null) { + dropBackendDelta(catalog, name, runtimeContent, bundledContent, applier); + return; + } + dropBackendDestructive(catalog, name, runtimeContent, applier, ctx); + } + + private void dropBackendDelta(final String catalog, final String name, + final String runtimeContent, final String bundledContent, + final MalFileApplier applier) { + final DSLDelta delta = DeltaClassifier.classifyMal(runtimeContent, bundledContent); + final Set toDrop = new HashSet<>(); + toDrop.addAll(delta.removedMetrics()); + toDrop.addAll(delta.shapeBreakMetrics()); + if (toDrop.isEmpty()) { + log.info("runtime-rule MAL engine: /delete bundled-twin delta empty for {}/{} — " + + "nothing to drop, bundled will reuse all existing measures", + catalog, name); + return; + } + log.info("runtime-rule MAL engine: /delete bundled-twin delta for {}/{} — dropping {} " + + "runtime-only / shape-break metric(s): {}", + catalog, name, toDrop.size(), toDrop); + applier.remove(toDrop, StorageManipulationOpt.fullInstall()); + } + + private void dropBackendDestructive(final String catalog, final String name, + final String runtimeContent, final MalFileApplier applier, + final MalApplyContext ctx) { + final String key = DSLScriptKey.key(catalog, name); + final String sourceName = catalog + "/" + name; + final String hash = ContentHash.sha256Hex(runtimeContent); + + // Re-register prototypes locally so unregister has Models + meterPrototypes to walk. + // localCacheOnly suppresses listener-side backend define — we don't want to recreate + // the measure we're about to drop. + try { + final MalFileApplier.Applied applied = applier.apply( + runtimeContent, sourceName, hash, StorageManipulationOpt.localCacheOnly()); + if (applied.getRuleClassLoader() != null) { + DSLClassLoaderManager.INSTANCE.commit(applied.getRuleClassLoader()) + .filter(prior -> prior != applied.getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + ctx.getRules().compute(key, (k, prev) -> prev == null + ? new AppliedRuleScript(catalog, name, null, null) + .withContentAndApplied(runtimeContent, applied) + : prev.withContentAndApplied(runtimeContent, applied)); + } catch (final MalFileApplier.ApplyException ae) { + // Roll back any partial state that DID land before the throw — every other apply + // path does the same. localCacheOnly matches the apply: backend was untouched. + if (ae.getPartiallyRegistered() != null && !ae.getPartiallyRegistered().isEmpty()) { + try { + applier.remove(ae.getPartiallyRegistered(), + StorageManipulationOpt.localCacheOnly()); + } catch (final Throwable rollbackErr) { + log.warn("runtime-rule /delete: rollback of partial re-register also " + + "failed for {}/{}; {} prototype(s) may persist locally until OAP " + + "restart.", catalog, name, ae.getPartiallyRegistered().size(), + rollbackErr); + } + } + throw new IllegalStateException( + "re-register for backend drop failed for " + catalog + "/" + name + + "; refusing to delete the row to avoid orphaning the measure. " + + "Cause: " + ae.getMessage(), ae); + } + + // Tear down with fullInstall: drops backend (listener whenRemoving fires dropTable + // for each downsampling variant) and clears the re-registered local state. We need + // to swap the storage opt for this call — clone the context with fullInstall. + final MalApplyContext fullInstallCtx = withStorageOpt(ctx, StorageManipulationOpt.fullInstall()); + unregister(catalog, name, fullInstallCtx); + } + + /** Clone {@code ctx} with the given storage opt. Used by the destructive + * {@link #dropBackend} path to flip from {@code localCacheOnly} (re-register) to + * {@code fullInstall} (destructive teardown). */ + private static MalApplyContext withStorageOpt(final MalApplyContext ctx, + final StorageManipulationOpt opt) { + final ApplyInputs inputs = new ApplyInputs( + ctx.getModuleManager(), opt, + ctx.getAlarmResetter(), ctx.getRules()); + return new MalApplyContext(inputs); + } + + /** + * Re-install the bundled MAL rule for {@code (catalog, name)} from {@link StaticRuleRegistry}. + * The runtime override that masked it has already been removed; without this fall-over, + * the bundled rule's MeterSystem prototypes / Metrics classes / MetricConvert would be + * gone and operators would have to restart the OAP to get the bundled metrics flowing + * again. + * + *

Compiles via {@code applier.apply(..., Kind.STATIC)} so the per-file loader is minted + * with the {@code static:} prefix — diagnostics can tell at a glance whether a key is + * being served by a runtime override or a static fall-over. The applier internally runs + * under {@code localCacheOnly}: the bundled metric backend already exists (it pre-dates + * the override), so we only need to re-register local prototypes and re-publish the + * MetricConvert. + * + *

Alarm reset is forwarded for the bundled metric set so dispatch resumes against a + * clean window. + */ + @Override + public boolean reloadStatic(final String catalog, final String name, + final Consumer> alarmResetter, + final ModuleManager moduleManager) { + if (!CATALOGS.contains(catalog)) { + return false; + } + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent == null || staticContent.isEmpty()) { + return false; + } + final MalFileApplier applier = resolveApplier(); + if (applier == null) { + log.warn("runtime-rule MAL engine: MeterSystem unavailable; cannot reload static " + + "rule {}/{} after override removal", catalog, name); + return false; + } + final String sourceName = catalog + "/" + name; + final String hash = ContentHash.sha256Hex(staticContent); + try { + // createIfAbsent rather than localCacheOnly: when reload follows a /delete that + // dropped runtime-only / shape-break measures (via dropBundledTwinDelta), some + // bundled-claimed measures may be missing in the backend. createIfAbsent recreates + // them without affecting backends that already match. + final MalFileApplier.Applied fresh = applier.apply( + staticContent, sourceName, hash, + StorageManipulationOpt.createIfAbsent(), + DSLClassLoaderManager.Kind.STATIC); + // Promote the new static: loader. Any prior loader (typically null — unregister + // already dropRuntime'd it) is retired so the graveyard observes its collection. + if (fresh.getRuleClassLoader() != null) { + DSLClassLoaderManager.INSTANCE.commit(fresh.getRuleClassLoader()) + .filter(prior -> prior != fresh.getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + // Reset the entry to look like a fresh boot-seeded one: state = null so the + // next gone-keys pass correctly skips this as an untouched bundled-only entry + // (otherwise the post-/inactivate INACTIVE state would leak across and the + // tick would re-fire teardown + reload in a loop). + final String key = DSLScriptKey.key(catalog, name); + rules.compute(key, (k, prev) -> { + final ReentrantLock lock = prev != null ? prev.getLock() : new ReentrantLock(); + return new AppliedRuleScript(catalog, name, staticContent, null, lock, fresh); + }); + pushRuntimeConverter(catalog, name, fresh.getMetricConvert()); + alarmResetter.accept(fresh.getRegisteredMetricNames()); + log.info("runtime-rule MAL engine: static fall-over OK for {}/{} — {} metric(s) " + + "re-registered from bundled YAML", catalog, name, + fresh.getRegisteredMetricNames().size()); + return true; + } catch (final MalFileApplier.ApplyException ae) { + log.warn("runtime-rule MAL engine: static fall-over for {}/{} failed to compile " + + "the bundled YAML; bundled metrics will stay dark until next /addOrUpdate " + + "or restart", catalog, name, ae); + return false; + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/extension/DbOverrideRuntimeRuleResolver.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/extension/DbOverrideRuntimeRuleResolver.java new file mode 100644 index 000000000000..0e5eb6b3e519 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/extension/DbOverrideRuntimeRuleResolver.java @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.extension; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * Serves operator-supplied rule overrides (rows in the {@code runtime_rule} management table) + * to MAL / LAL static-file loaders at boot, so a reboot never regresses to the pre-override + * rule body on disk. Discovered via {@link java.util.ServiceLoader} (declared in + * {@code META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver}). + * + *

Priority

+ * Returns {@code 100} so this resolver wins over any priority-0 default resolver but loses to + * higher-priority sources operators might add later (GitOps, k8s ConfigMap, etc.). Pick a + * higher number on a more authoritative resolver to override DB content at boot. + * + *

Caching

+ * On the first {@link #loadAll} call per catalog, this implementation pulls every + * {@link RuntimeRuleManagementDAO.RuntimeRuleFile} once from storage and caches the catalog + * result in memory. Subsequent calls are pure in-memory lookups. The cache is per-process + * lifetime; new runtime changes after boot are driven by the runtime-rule reconciler tick + * via its own apply path, not through this static-load hook. + * + *

Behaviour per row

+ *
    + *
  • {@code ACTIVE} → {@link RuntimeRuleOverrideResolver.Resolution#active(byte[])} — + * merger replaces the disk content with the DB content.
  • + *
  • {@code INACTIVE} → {@link RuntimeRuleOverrideResolver.Resolution#inactive()} — + * merger removes the entry, even if disk has a file for the same key.
  • + *
+ * + *

Failure modes

+ *
    + *
  • {@code manager} is {@code null}: caller has no module context (tests). Returns an + * empty map; the merger leaves the disk baseline untouched.
  • + *
  • Storage module not installed / DAO service not exposed: log INFO once, cache an empty + * map. Boot proceeds with pure static content.
  • + *
  • Storage read throws: log WARN, cache nothing for this catalog so the next call retries + * (handles boot-time transient failures — runtime_rule table not yet created, gRPC pool + * warming, ES template still being applied).
  • + *
+ */ +@Slf4j +public final class DbOverrideRuntimeRuleResolver implements RuntimeRuleOverrideResolver { + + /** + * Per-catalog cache of resolutions. Keyed by catalog name (e.g. {@code "otel-rules"}); + * value is the resolver's full opinion for that catalog. Populated on first + * {@link #loadAll} per catalog. A transient failure leaves the entry absent so the next + * call retries; a permanent absence (no DAO service) caches an empty map. + */ + private final Map> cache = new HashMap<>(); + + public DbOverrideRuntimeRuleResolver() { + // Required public no-arg constructor for ServiceLoader instantiation. + } + + @Override + public int priority() { + // Baseline runtime-rule priority. Higher-priority sources (e.g. GitOps watchers) can + // override these by registering their own resolver with a larger priority value. + return 100; + } + + @Override + public Map loadAll(final String catalog, final ModuleManager manager) { + if (manager == null) { + // Test path or static-loader call without module context — nothing we can do. + return Collections.emptyMap(); + } + synchronized (this) { + final Map cached = cache.get(catalog); + if (cached != null) { + return cached; + } + final Map loaded = load(catalog, manager); + if (loaded != null) { + // Permanent answer (success OR DAO-unavailable cached empty); promote to cache. + cache.put(catalog, loaded); + return loaded; + } + // Transient failure: don't cache, return empty so this call's static file loads + // from disk; the next call will retry the DAO read. + return Collections.emptyMap(); + } + } + + /** + * One-shot DAO load + classification for a catalog. + * + * @return populated map (possibly empty) on success or permanent absence; {@code null} + * on a transient failure that the caller should not cache. + */ + private Map load(final String catalog, final ModuleManager manager) { + final RuntimeRuleManagementDAO dao; + try { + dao = manager.find(StorageModule.NAME).provider().getService(RuntimeRuleManagementDAO.class); + } catch (final Throwable t) { + // Permanent absence — runtime-rule plugin disabled, storage module shape mismatch, + // etc. Cache empty so we don't keep retrying for the rest of the process lifetime. + log.info("RuntimeRuleManagementDAO unavailable ({}); runtime-rule overrides will not " + + "apply to static boot load this run.", t.getMessage()); + return Collections.emptyMap(); + } + final List rows; + try { + rows = dao.getAll(); + } catch (final IOException ioe) { + log.warn("Failed to read runtime_rule rows at boot for catalog {}; will retry on the " + + "next loadAll call. Static files load from disk for this pass.", catalog, ioe); + return null; + } + final Map result = new HashMap<>(); + for (final RuntimeRuleManagementDAO.RuntimeRuleFile row : rows) { + if (!catalog.equals(row.getCatalog())) { + continue; + } + if ("INACTIVE".equalsIgnoreCase(row.getStatus())) { + result.put(row.getName(), Resolution.inactive()); + } else { + // STATUS_ACTIVE (default) or any non-INACTIVE status — treat as active substitution. + final byte[] bytes = row.getContent() == null + ? new byte[0] + : row.getContent().getBytes(StandardCharsets.UTF_8); + result.put(row.getName(), Resolution.active(bytes)); + } + } + log.info("Runtime-rule boot resolver loaded {} override(s) for catalog {}", result.size(), catalog); + return result; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/metrics/LockMetrics.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/metrics/LockMetrics.java new file mode 100644 index 000000000000..0c7f7e65c996 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/metrics/LockMetrics.java @@ -0,0 +1,194 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.metrics; + +import java.util.concurrent.TimeUnit; +import java.util.concurrent.locks.ReentrantLock; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.telemetry.TelemetryModule; +import org.apache.skywalking.oap.server.telemetry.api.CounterMetrics; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; +import org.apache.skywalking.oap.server.telemetry.api.MetricsCreator; +import org.apache.skywalking.oap.server.telemetry.api.MetricsTag; + +/** + * Observability wrapper around per-file lock acquires. Exposes two helpers: + *
    + *
  • {@link #acquireForRest} — {@code tryLock(timeout)} + histogram-record wait time + + * WARN if above threshold + timeout counter on failure. REST path.
  • + *
  • {@link #tryAcquireForSyncTimer} — non-blocking {@code tryLock()} + skip counter on + * failure. dslManager sync-timer path.
  • + *
+ * + *

Metric names (Prometheus-style, already scraped by the standard + * {@link TelemetryModule} telemetry pipeline): + *

    + *
  • {@code runtime_rule_lock_wait_seconds} — histogram of lock-wait duration on the + * REST path ({@code path=rest}). Sync-timer acquires are non-blocking (no wait time + * to observe), so there is no {@code path=sync-timer} variant.
  • + *
  • {@code runtime_rule_lock_hold_seconds} — histogram of lock-hold duration per path + * ({@code path=rest|sync-timer}). Use via {@link HistogramMetrics.Timer} / + * {@link #restHoldHistogram}.
  • + *
  • {@code runtime_rule_lock_contention_total} — counter of timed-out REST acquires + + * skipped sync-timer acquires, labeled by {@code path,outcome}.
  • + *
+ * + *

Graceful degradation: if the telemetry module isn't wired (embedded test topology), + * resolution returns null and all wrappers fall back to plain {@code ReentrantLock} calls + * without recording anything. Tests can therefore construct these without wiring a full + * telemetry pipeline. + */ +@Slf4j +public final class LockMetrics { + + private static final long REST_WAIT_WARN_THRESHOLD_MS = 1_000L; + + // Sync-timer path is non-blocking (tryLock() with no wait), so there is no wait + // histogram for it — the only contention signal we surface for ticks is the skip counter. + private final HistogramMetrics restWaitHistogram; + private final HistogramMetrics restHoldHistogram; + private final HistogramMetrics syncTimerHoldHistogram; + private final CounterMetrics restTimeoutCounter; + private final CounterMetrics syncTimerSkipCounter; + + public LockMetrics(final ModuleManager moduleManager) { + final MetricsCreator mc = resolve(moduleManager); + if (mc == null) { + this.restWaitHistogram = null; + this.restHoldHistogram = null; + this.syncTimerHoldHistogram = null; + this.restTimeoutCounter = null; + this.syncTimerSkipCounter = null; + log.info("runtime-rule lock metrics disabled — MetricsCreator not resolvable"); + return; + } + final MetricsTag.Keys pathKey = new MetricsTag.Keys("path"); + this.restWaitHistogram = mc.createHistogramMetric( + "runtime_rule_lock_wait_seconds", + "Per-file lock acquisition wait time on the REST path", + pathKey, new MetricsTag.Values("rest")); + this.restHoldHistogram = mc.createHistogramMetric( + "runtime_rule_lock_hold_seconds", + "Per-file lock hold duration on the REST path (full workflow)", + pathKey, new MetricsTag.Values("rest")); + this.syncTimerHoldHistogram = mc.createHistogramMetric( + "runtime_rule_lock_hold_seconds", + "Per-file lock hold duration on the dslManager sync-timer path (single apply)", + pathKey, new MetricsTag.Values("sync-timer")); + this.restTimeoutCounter = mc.createCounter( + "runtime_rule_lock_contention_total", + "Per-file lock contention events — REST timeouts + sync-timer skips", + new MetricsTag.Keys("path", "outcome"), + new MetricsTag.Values("rest", "timeout")); + this.syncTimerSkipCounter = mc.createCounter( + "runtime_rule_lock_contention_total", + "Per-file lock contention events — REST timeouts + sync-timer skips", + new MetricsTag.Keys("path", "outcome"), + new MetricsTag.Values("sync-timer", "skipped")); + } + + private static MetricsCreator resolve(final ModuleManager moduleManager) { + if (moduleManager == null) { + return null; + } + try { + return moduleManager.find(TelemetryModule.NAME).provider().getService(MetricsCreator.class); + } catch (final Throwable t) { + return null; + } + } + + /** + * REST-path acquisition. Blocks up to {@code timeoutMs} using + * {@link ReentrantLock#tryLock(long, TimeUnit)}. Returns true on acquire, false on + * timeout. Records wait histogram for both outcomes; increments the timeout counter on + * false; emits a WARN log line when an acquire took longer than + * {@link #REST_WAIT_WARN_THRESHOLD_MS} — catches pathological waits even without + * operators looking at the dashboard. + */ + public boolean acquireForRest(final ReentrantLock lock, final long timeoutMs, + final String catalog, final String name) { + final long t0 = System.nanoTime(); + final boolean acquired; + try { + acquired = lock.tryLock(timeoutMs, TimeUnit.MILLISECONDS); + } catch (final InterruptedException ie) { + Thread.currentThread().interrupt(); + return false; + } + final long waitMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - t0); + if (restWaitHistogram != null) { + restWaitHistogram.observe(waitMs / 1000.0d); + } + if (!acquired) { + if (restTimeoutCounter != null) { + restTimeoutCounter.inc(); + } + log.info("runtime-rule lock TIMEOUT on REST path {}/{} after {} ms", + catalog, name, waitMs); + return false; + } + if (waitMs >= REST_WAIT_WARN_THRESHOLD_MS) { + log.warn("runtime-rule lock SLOW on REST path {}/{} — waited {} ms before acquiring", + catalog, name, waitMs); + } + return true; + } + + /** + * Sync-timer-path acquisition. Non-blocking {@link ReentrantLock#tryLock()}. Returns + * true on acquire. On failure, increments the skip counter and returns false; caller + * continues to the next file without waiting. + */ + public boolean tryAcquireForSyncTimer(final ReentrantLock lock, + final String catalog, final String name) { + final boolean acquired = lock.tryLock(); + if (!acquired) { + if (syncTimerSkipCounter != null) { + syncTimerSkipCounter.inc(); + } + log.debug("runtime-rule lock skipped on sync-timer path {}/{} — busy", + catalog, name); + } + return acquired; + } + + /** Start a timer that records lock-hold duration for the REST path when closed. */ + public HistogramMetrics.Timer startRestHoldTimer() { + return restHoldHistogram == null ? NO_OP_TIMER : restHoldHistogram.createTimer(); + } + + /** Start a timer that records lock-hold duration for the sync-timer path when closed. */ + public HistogramMetrics.Timer startSyncTimerHoldTimer() { + return syncTimerHoldHistogram == null ? NO_OP_TIMER : syncTimerHoldHistogram.createTimer(); + } + + /** + * Null-object for the test / no-telemetry case. {@link HistogramMetrics.Timer} is an + * AutoCloseable; closing the null-object just does nothing. Avoids having to null-check + * at every call site. + */ + private static final HistogramMetrics.Timer NO_OP_TIMER = new HistogramMetrics.Timer(null) { + @Override + public void close() { + // no-op + } + }; +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelCreator.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModule.java similarity index 53% rename from oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelCreator.java rename to oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModule.java index f8d46e8ad144..4cf800ae9c22 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelCreator.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModule.java @@ -16,26 +16,26 @@ * */ -package org.apache.skywalking.oap.server.core.storage.model; +package org.apache.skywalking.oap.server.receiver.runtimerule.module; -import org.apache.skywalking.oap.server.core.storage.StorageException; -import org.apache.skywalking.oap.server.core.storage.annotation.Storage; -import org.apache.skywalking.oap.server.library.module.Service; +import org.apache.skywalking.oap.server.library.module.ModuleDefine; /** - * INewModel implementation supports creating a new module. + * Module contract for the runtime rule receiver. + * + *

Exposes an HTTP admin surface (default port 17128, disabled by default) for hot-updating + * MAL / LAL rule files without OAP restart. Skeleton only in this bundle — HTTP handlers return + * 501 until the apply pipeline lands in a later bundle. */ -public interface ModelCreator extends Service { - /** - * Add a new model - * - * @return the created new model - */ - Model add(Class aClass, int scopeId, Storage storage) throws StorageException; +public class RuntimeRuleModule extends ModuleDefine { + public static final String NAME = "receiver-runtime-rule"; - void addModelListener(CreatingListener listener) throws StorageException; + public RuntimeRuleModule() { + super(NAME); + } - interface CreatingListener { - void whenCreating(Model model) throws StorageException; + @Override + public Class[] services() { + return new Class[0]; } } diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java new file mode 100644 index 000000000000..46f9b7ebfe1c --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.module; + +import lombok.Getter; +import lombok.Setter; +import org.apache.skywalking.oap.server.library.module.ModuleConfig; + +@Getter +@Setter +public class RuntimeRuleModuleConfig extends ModuleConfig { + /** + * Default {@code 0.0.0.0}. Once the module is enabled via its selector, it binds on this + * address. Expose only behind a gateway / IP allow-list — never to the public internet. + */ + private String restHost = "0.0.0.0"; + /** Default {@code 17128}. Runtime-rule admin HTTP endpoint. */ + private int restPort = 17128; + private String restContextPath = "/"; + private int restIdleTimeOut = 30_000; + private int restAcceptQueueSize = 0; + private int httpMaxRequestHeaderSize = 8192; + /** DSLManager tick interval in seconds. 30 s is the documented convergence bound. */ + private long reconcilerIntervalSeconds = 30; + /** + * SUSPENDED state self-heal threshold in seconds. Must exceed dslManager tick + ES + * refresh + storage replica lag + RPC jitter. Default 60 s is conservative. + */ + private long selfHealThresholdSeconds = 60; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java new file mode 100644 index 000000000000..db226847aac3 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java @@ -0,0 +1,445 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.module; + +import com.linecorp.armeria.common.HttpMethod; +import java.util.Arrays; +import java.util.concurrent.Executors; +import java.util.concurrent.ScheduledExecutorService; +import java.util.concurrent.TimeUnit; +import lombok.extern.slf4j.Slf4j; +import org.apache.logging.log4j.util.Strings; +import org.apache.skywalking.oap.log.analyzer.v2.module.LogAnalyzerModule; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.RunningMode; +import org.apache.skywalking.oap.server.core.alarm.AlarmModule; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; +import org.apache.skywalking.oap.server.core.server.GRPCHandlerRegister; +import org.apache.skywalking.oap.server.core.server.HTTPHandlerRegister; +import org.apache.skywalking.oap.server.core.server.HTTPHandlerRegisterImpl; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleDefine; +import org.apache.skywalking.oap.server.library.module.ModuleProvider; +import org.apache.skywalking.oap.server.library.module.ModuleStartException; +import org.apache.skywalking.oap.server.library.module.ServiceNotProvidedException; +import org.apache.skywalking.oap.server.library.server.http.HTTPServer; +import org.apache.skywalking.oap.server.library.server.http.HTTPServerConfig; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterServiceImpl; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleRestHandler; +import org.apache.skywalking.oap.server.telemetry.TelemetryModule; +import org.apache.skywalking.oap.server.telemetry.api.TelemetryRelatedContext; + +/** + * Boots the runtime-rule admin surface and the components that converge MAL / LAL rule + * changes across an OAP cluster. Disabled by default — the provider is loaded only when + * an operator enables it (selector {@code default} or env var + * {@code SW_RECEIVER_RUNTIME_RULE=default}). Until then the admin port never opens, no + * scheduled tick fires, no cluster RPC is registered. + * + *

What this provider wires

+ *
    + *
  • {@link HTTPServer} on port 17128 — the admin REST surface + * ({@code /runtime/rule/addOrUpdate}, {@code /inactivate}, {@code /delete}, + * {@code /list}, {@code /dump}, single-rule fetch, bundled catalogue).
  • + *
  • {@link DSLManager} + a single-thread scheduled executor — local-state convergence + * on the periodic tick (default 30 s) plus a synchronous first tick at boot.
  • + *
  • {@link RuntimeRuleClusterServiceImpl} on the cluster gRPC bus — receives Suspend + * / Resume / Forward RPCs from peers.
  • + *
  • {@link RuntimeRuleClusterClient} — outbound counterpart for broadcasts and + * forward-to-main writes.
  • + *
  • {@link RuntimeRuleManagementDAO} resolved through the active storage module — + * per-backend upsert / read / delete on the rule rows.
  • + *
+ * + *

Architecture: scheduler · orchestrators · engines

+ * Three layers, with one boundary between each: + *
    + *
  • Scheduler ({@link DSLManager} + REST handler). DSL-agnostic. Owns lock + * acquisition, cluster Suspend/Resume RPC fan-out, persistence (DAO upsert), + * cross-file ownership enforcement, tick cadence, self-heal, classloader + * graveyard lifecycle, alarm-reset dispatch. Holds + * references to the engines via {@code RuleEngineRegistry} and drives the two + * orchestrators below.
  • + *
  • Orchestrators. Two of them, one per pipeline: + *
      + *
    • {@link org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeApply} + * — apply pipeline for NEW / FILTER_ONLY / STRUCTURAL classifications. Routes to + * the right engine via the registry, drives compile → fireSchemaChanges → verify → + * commit | rollback. Returns an {@code Outcome} the scheduler reads to update + * snapshot + persistence.
    • + *
    • {@link org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeUnregister} + * — tear-down pipeline for INACTIVE / {@code /delete} / gone-keys cleanup. Routes + * to {@code engine.unregister}.
    • + *
    + * The orchestrators are DSL-agnostic — they only know the SPI surface, not which engine + * is registered behind a given catalog.
  • + *
  • Engines ({@code MalRuleEngine}, {@code LalRuleEngine}, future + * {@code OalRuleEngine}). DSL-specific. Each implements + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine}: classify, + * claimedKeys, compile, fireSchemaChanges, verify, commit, rollback, unregister. + * Engines own everything that depends on the DSL — Javassist class generation, applier + * registration, post-DDL probe semantics, classloader retire, alarm-reset target sets. + * Adding a new DSL is one SPI implementation + a {@code RuleEngineRegistry.register} + * call; no scheduler or orchestrator edit needed.
  • + *
+ * + *
+ *   ┌─────────────────────────  scheduler  ──────────────────────────┐
+ *   │ RuntimeRuleRestHandler  →  DSLManager.applyOneRuleFileInternal │
+ *   │                                       │                        │
+ *   │   • catalog routing (engineRegistry.forCatalog)                 │
+ *   │   • main / peer routing (MainRouter)                            │
+ *   │   • per-file lock acquisition                                   │
+ *   │   • Suspend/Resume RPC fan-out                                  │
+ *   │   • cross-file ownership guard (DAO + appliedX)                 │
+ *   │   • storage-opt selection (fullInstall / localCacheOnly /       │
+ *   │     localCacheVerify) — gates whether DDL fires                 │
+ *   │   • persistence (RuntimeRuleManagementDAO.save) + 2-PC stash    │
+ *   │     for STRUCTURAL via StructuralCommitCoordinator              │
+ *   │   • DSLRuntimeUnregister orchestrator routes teardown to engine │
+ *   └────────────┬────────────────────────────────┬──────────────────┘
+ *                │                                │
+ *                ▼                                ▼
+ *   ┌──────  MalRuleEngine  ──────┐    ┌──────  LalRuleEngine  ──────┐
+ *   │  catalogs: otel-rules,      │    │  catalogs: lal              │
+ *   │            log-mal-rules,   │    │                             │
+ *   │            telegraf-rules   │    │                             │
+ *   │                             │    │                             │
+ *   │  classify(old, new, ina)    │    │  classify(old, new, ina)    │
+ *   │  claimedKeys(content, src)  │    │  claimedKeys(content, src)  │
+ *   │  compile → CompiledMalDSL   │    │  compile → CompiledLalDSL   │
+ *   │  fireSchemaChanges (no-op)  │    │  fireSchemaChanges (no-op)  │
+ *   │  verify → null | error str  │    │  verify (no-op, null)       │
+ *   │  commit                     │    │  commit                     │
+ *   │  rollback                   │    │  rollback                   │
+ *   │  unregister                 │    │  unregister                 │
+ *   └─────────────────────────────┘    └─────────────────────────────┘
+ * 
+ * + *

Phase pipeline (per-file)

+ *
+ *   classify  ─►  NO_CHANGE   →  scheduler skips (unless forced)
+ *             ─►  INACTIVE    →  scheduler routes to engine.unregister
+ *             ─►  NEW / FILTER_ONLY / STRUCTURAL → continue:
+ *
+ *   claimedKeys                    (scheduler runs cross-file guard on this set)
+ *   engine.newApplyContext(inputs) (engine narrows shared inputs into its own context)
+ *   engine.compile                 (compile classes + register handlers; NO commit yet)
+ *   engine.fireSchemaChanges       (drive listener chain; no-op for MAL since fused into
+ *                                   compile, no-op for LAL since no backend schema)
+ *   engine.verify                  (post-DDL probe; MAL: isExists; LAL: no-op)
+ *           │
+ *           ├─ verify failed →  engine.rollback (drop just-registered)
+ *           └─ verify ok      →  engine.commit  (atomic in-memory swap, retire CL,
+ *                                                fire alarm reset)
+ * 
+ * + *

Per-file lifecycle on shared mechanism

+ *
+ *   POST /runtime/rule/{addOrUpdate|inactivate|delete}
+ *        │
+ *        ├─ scheduler:  validate catalog (registry-driven), find main, forward if peer
+ *        ├─ scheduler:  acquire per-file lock; broadcast Suspend (peers park dispatch)
+ *        ├─ engines:    classify → claimedKeys → compile → fire → verify → commit | rollback
+ *        ├─ scheduler:  persist row (DAO.save) — STRUCTURAL stashes commit until persist OK
+ *        ├─ scheduler:  finalize commit  (success)  → drop removedMetrics, snapshot RUNNING,
+ *        │              broadcastResume; or
+ *        │              discard commit   (failure)  → engine.rollback, broadcastResume
+ *        └─ scheduler:  release lock; return HTTP status (200 / 409 / 421 / 500 / 503)
+ * 
+ * + *

Peers converge on the next dslManager tick by reading the persisted row and re-running + * the same engines under {@link StorageManipulationOpt#localCacheOnly} — peers register + * local handlers + prototypes but skip backend DDL since main has already done the writes. + * {@code /inactivate} is soft-pause (localCacheOnly — backend preserved, OAP-internal state + * torn down); {@code /delete} is destructive (fullInstall so the listener chain fires + * {@code dropTable}). Both ride the same {@link + * org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeUnregister} + * orchestrator that dispatches to {@code engine.unregister}. + * + *

Catalog → engine routing

+ * Catalog membership is data-driven through {@code RuleEngineRegistry}: a catalog is "MAL" + * iff a registered engine is {@code MalRuleEngine}. Adding {@code telegraf-rules} support is + * one entry in {@code MalRuleEngine.supportedCatalogs} — REST validation, scheduler routing, + * and tick enumeration pick it up automatically. (Full telegraf apply additionally requires + * the telegraf receiver module to expose a {@code MalConverterRegistry} service.) + * + *

The full architecture (single-main routing, lock acquisition policy, marker-debt + * invariant for cold-boot / topology-shift, cross-file ownership semantics, soft-pause / + * delete split, self-heal backstop) lives in the design doc: + * {@code docs/en/concepts-and-designs/runtime-rule-hot-update.md}. + */ +@Slf4j +public class RuntimeRuleModuleProvider extends ModuleProvider { + + /** + * Per-peer Suspend / Resume RPC deadline. 2 s — enough for a healthy cluster round-trip, + * short enough that a single unreachable peer doesn't stall the /addOrUpdate latency. + */ + private static final long SUSPEND_RPC_DEADLINE_MS = 2_000L; + /** + * Forward-to-main RPC deadline. Longer than Suspend / Resume because the forwarded + * workflow includes compile + DDL + persist on the main. 30 s covers the typical upper + * bound for a BanyanDB-backed apply with a handful of added metrics; larger rule files + * may need tuning via the module config in a future change. + */ + private static final long FORWARD_RPC_DEADLINE_MS = 30_000L; + + /** + * Initial delay before the scheduled dslManager's first tick. 2 seconds — just past + * {@code RemoteClientManager}'s 1 s initial refresh, so the peer list is almost always + * populated by the time we run. This closes the cold-boot gap for runtime-only DB rows + * when {@link #notifyAfterCompleted} decided to defer the synchronous first tick + * (peer list not yet populated at that moment); without this, a restart could leave + * persisted MAL/LAL overrides absent for a full {@code reconcilerIntervalSeconds} window + * (default 30 s) while ingest runs against static-shape workers. + * + *

Deliberately NOT read from {@code reconcilerIntervalSeconds}: that value controls + * steady-state convergence cadence, not the one-shot catch-up that must happen as soon + * as the peer list is ready. Tick is idempotent, so running at 2 s and again at 2 s + + * {@code reconcilerIntervalSeconds} is cheap — the hash-match short-circuit skips + * unchanged bundles. + */ + private static final long SCHEDULER_INITIAL_DELAY_SECONDS = 2L; + + private RuntimeRuleModuleConfig moduleConfig; + private HTTPServer httpServer; + private ScheduledExecutorService reconcilerExecutor; + private DSLManager dslManager; + + @Override + public String name() { + return "default"; + } + + @Override + public Class module() { + return RuntimeRuleModule.class; + } + + @Override + public ConfigCreator newConfigCreator() { + return new ConfigCreator() { + @Override + public Class type() { + return RuntimeRuleModuleConfig.class; + } + + @Override + public void onInitialized(final RuntimeRuleModuleConfig initialized) { + moduleConfig = initialized; + } + }; + } + + @Override + public void prepare() throws ServiceNotProvidedException { + if (moduleConfig.getRestPort() <= 0) { + throw new ServiceNotProvidedException( + "runtime-rule: restPort must be > 0 when the module is enabled, got " + moduleConfig.getRestPort()); + } + final HTTPServerConfig httpServerConfig = + HTTPServerConfig.builder() + .host(Strings.isBlank(moduleConfig.getRestHost()) ? "0.0.0.0" + : moduleConfig.getRestHost()) + .port(moduleConfig.getRestPort()) + .contextPath(moduleConfig.getRestContextPath()) + .acceptQueueSize(moduleConfig.getRestAcceptQueueSize()) + .idleTimeOut(moduleConfig.getRestIdleTimeOut()) + .maxRequestHeaderSize(moduleConfig.getHttpMaxRequestHeaderSize()) + .build(); + httpServer = new HTTPServer(httpServerConfig); + httpServer.setBlockingTaskName("runtime-rule-http"); + httpServer.initialize(); + } + + @Override + public void start() throws ServiceNotProvidedException, ModuleStartException { + // DSLManager is constructed first so both the HTTP handler and the cluster Suspend + // service can reference it. The scheduled executor is started in notifyAfterCompleted + // after all other modules have finished their boot. The DSLManager builds its own + // RuleEngineRegistry from the per-DSL state maps it owns. + dslManager = new DSLManager( + getManager(), + moduleConfig.getSelfHealThresholdSeconds() * 1000L + ); + + // Cluster-facing Suspend client: fans out to every non-self peer on the OAP cluster bus + // during an addOrUpdate / delete / inactivate so peers stop serving the old bundle + // before the main node commits the row change. Uses the RemoteClient's established + // ManagedChannel — no duplicate channel caching. + final String selfNodeId = TelemetryRelatedContext.INSTANCE.getId(); + final RemoteClientManager remoteClientManager = getManager().find(CoreModule.NAME) + .provider() + .getService(RemoteClientManager.class); + final RuntimeRuleClusterClient clusterClient = new RuntimeRuleClusterClient( + remoteClientManager, selfNodeId, SUSPEND_RPC_DEADLINE_MS); + + // The runtime-rule surface runs on its own HTTPServer bound to a distinct admin port; + // it intentionally does not share the sharing-server HTTPHandlerRegister so the admin + // port stays isolated from the public receiver port. + final RuntimeRuleRestHandler restHandler = new RuntimeRuleRestHandler( + getManager(), dslManager, clusterClient, FORWARD_RPC_DEADLINE_MS); + final HTTPHandlerRegister adminRegister = new HTTPHandlerRegisterImpl(httpServer); + adminRegister.addHandler( + restHandler, + Arrays.asList(HttpMethod.POST, HttpMethod.GET) + ); + + // Register the cluster-internal Suspend / Resume / Forward RPCs on the OAP cluster-bus + // gRPC server (the same server that hosts RemoteService and HealthCheck). Every OAP + // node in the cluster exposes these endpoints so: (a) the main can fan out Suspend / + // Resume to peers during a STRUCTURAL apply, and (b) a non-main OAP that receives an + // HTTP write can transparently forward the work to the main via Forward. The service + // needs a late-bound REST-handler reference for the Forward dispatch target — wired + // right after construction so the first incoming Forward RPC has a workflow to call. + final GRPCHandlerRegister clusterGrpc = getManager().find(CoreModule.NAME) + .provider() + .getService(GRPCHandlerRegister.class); + final RuntimeRuleClusterServiceImpl clusterService = + new RuntimeRuleClusterServiceImpl(dslManager, selfNodeId); + clusterService.setRuntimeRuleService(restHandler.getService()); + clusterGrpc.addHandler(clusterService); + log.info( + "Runtime rule Suspend / Resume / Forward RPCs registered on cluster gRPC server " + + "(selfNodeId={}).", selfNodeId + ); + } + + @Override + public void notifyAfterCompleted() throws ModuleStartException { + // Seed synthetic applied-state entries from the static rules the catalog loaders + // already registered (MeterProcessService, OpenTelemetryMetricRequestProcessor, + // LogFilterListener.Factory). With the seed in place, the dslManager's first tick + // knows those bundles are live — rehydrate won't double-apply — and a later + // /inactivate can cleanly tear down the boot-registered handlers via unregisterBundle + // (which now consults StaticRuleRegistry when appliedMal / appliedLal has no entry). + try { + dslManager.getStaticRuleLoader().loadAll(); + } catch (final Throwable t) { + log.warn("Runtime rule dslManager: static-bundle seeding failed — static rules " + + "will still serve, but the first /inactivate against a shipped rule may " + + "need a restart to fully converge.", t); + } + + // Run one tick before receivers open to close the boot gap for runtime-only rows + // (no static file substitute). Unconditional — no peer-list-readiness gate. The + // earlier gate consulted {@code RemoteClientManager.getRemoteClient().isEmpty()}, + // but that signal is "list is non-empty right now", not "membership has stabilised". + // In a k8s rollout the list flips to non-empty as soon as self joins it, then keeps + // changing for tens of seconds as more pods boot. Gating on it neither guaranteed + // membership stability nor saved a wasteful first apply. If this tick runs under + // {@code localCacheOnly} because peer list is empty, the next scheduled tick (2 s + // later) re-evaluates with whatever {@code RemoteClientManager} now shows and re- + // applies under {@code fullInstall} if this node resolves as main. Backend DDL is + // idempotent so the re-apply costs nothing. + try { + // atBoot=true so a no-init OAP picks localCacheVerify and refuses to + // start with a missing or shape-mismatched backend (k8s pod backloop) + // instead of silently registering local workers against schema that + // doesn't exist. Init / default-mode OAPs are unaffected — their boot + // opt mirrors the standard tick choice for those modes. + dslManager.tick(true); + log.info("Runtime rule dslManager: synchronous first tick completed " + + "(runtime-only DB rows are now applied locally)."); + } catch (final RuntimeException re) { + // Boot pass under localCacheVerify re-throws missing/mismatch as a + // RuntimeException so module bootstrap aborts. Translate to + // ModuleStartException so the OAP exit message points the operator at + // the right place. + throw new ModuleStartException( + "Runtime rule dslManager boot pass failed under localCacheVerify; " + + "the backend schema is missing or diverges from the declared rule. " + + "Bring up the init OAP first or align rule files with the backend, " + + "then restart this node.", + re); + } catch (final Throwable t) { + log.warn("Runtime rule dslManager: synchronous first tick failed — " + + "runtime-only DB rows will be picked up on the scheduled tick.", t); + } + + if (httpServer != null && !RunningMode.isInitMode()) { + httpServer.start(); + log.info( + "Runtime rule admin HTTP server listening on {}:{} (disabled-by-default " + + "module is now active — gateway-protect or restrict to localhost).", + moduleConfig.getRestHost(), moduleConfig.getRestPort() + ); + } + + // DSLManager runs on its own single-thread executor so the tick body cannot starve any + // other OAP scheduler. Tick interval is configurable; default 30s. The DSLManager + // instance itself was constructed in start() so the cluster Suspend service could + // reference it — we just schedule its tick here. + reconcilerExecutor = Executors.newSingleThreadScheduledExecutor(r -> { + final Thread t = new Thread(r, "runtime-rule-dslManager"); + t.setDaemon(true); + return t; + }); + final long intervalSeconds = moduleConfig.getReconcilerIntervalSeconds(); + // Initial delay is fixed at SCHEDULER_INITIAL_DELAY_SECONDS (2 s), not intervalSeconds. + // The synchronous tick in notifyAfterCompleted may have skipped because the peer list + // wasn't ready; running the first scheduled tick only after intervalSeconds (30 s by + // default) would leave persisted runtime-only rules dark for that whole window. By + // firing the first scheduled tick ~2 s in, RemoteClientManager's 1 s initial refresh + // has almost certainly populated the peer list, and tickStorageOpt can make a stable + // main/peer decision. Tick is idempotent, so firing at 2 s and then at 2 s + + // intervalSeconds is cheap — unchanged bundles short-circuit on hash. + reconcilerExecutor.scheduleWithFixedDelay( + dslManager::tick, + SCHEDULER_INITIAL_DELAY_SECONDS, intervalSeconds, TimeUnit.SECONDS + ); + log.info("Runtime rule dslManager scheduled: first tick in {} s, then every {} s.", + SCHEDULER_INITIAL_DELAY_SECONDS, intervalSeconds); + } + + @Override + public String[] requiredModules() { + return new String[] { + // CoreModule — required for RemoteClientManager (cluster peer list for routing + + // broadcast), MeterSystem + IModelManager (apply pipeline), and GRPCHandlerRegister + // (exposing Suspend / Resume / Forward RPCs on the cluster bus). + CoreModule.NAME, + // StorageModule — RuntimeRuleManagementDAO + ManagementStreamProcessor target live + // here; without it, /list, /delete, and dslManager reads have no backend. + StorageModule.NAME, + // LogAnalyzerModule — exposes the LogFilterListener.Factory service the dslManager's + // LAL apply path drives. Always declared so module boot fails fast rather than + // masking a broken deployment behind the runtime-rule's "LAL Factory unavailable" + // surface. + LogAnalyzerModule.NAME, + // AlarmModule — the dslManager fires AlarmKernelService.reset after STRUCTURAL and + // unregister paths. Declared so module boot fails fast when deployments accidentally + // drop the alarm module. The DSLManager still wraps the lookup in try/catch for + // defensive handling of transient provider outages, not as an "optional module" + // signal. + AlarmModule.NAME, + // TelemetryModule — exposes MetricsCreator for the lock-observability histograms + + // counters (runtime_rule_lock_*). Declared so the module refuses to start on a + // deployment where internal metrics wouldn't surface; LockMetrics itself still + // null-guards the resolve call so test topologies without telemetry can instantiate + // the handler without recording metrics. + TelemetryModule.NAME, + }; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java new file mode 100644 index 000000000000..317128476fef --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -0,0 +1,767 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.ReentrantLock; +import lombok.Getter; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.receiver.runtimerule.metrics.LockMetrics; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.alarm.AlarmKernelService; +import org.apache.skywalking.oap.server.core.alarm.AlarmModule; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.RunningMode; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.MainRouter; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.LalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.MalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal.LalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal.MalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; + +/** + * Local per-node state owner + periodic convergence driver for runtime MAL / LAL rule + * bundles. The full architecture (workflows, REST sequence, tick / self-heal diagrams, + * storage-policy split, lock acquisition policy, failure model) lives in the design doc: + * {@code docs/en/concepts-and-designs/runtime-rule-hot-update.md}. This Javadoc covers + * only what's needed to read the code in this class. + * + *

State owned

+ *

All keyed by {@code "catalog:name"}: + *

    + *
  • {@link #rules} — unified per-key {@link AppliedRuleScript}: catalog + name + last + * successfully-applied raw YAML + authoritative {@link DSLRuntimeState} (returned by + * {@code /list}, carries {@link DSLRuntimeState.SuspendOrigin SuspendOrigin}). One map + * under one per-file lock; all per-file operations read or replace one entry instead + * of coordinating across parallel maps.
  • + *
  • Engine-applied artefacts — live on {@link AppliedRuleScript#getApplied} as an + * {@link org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied}. + * Engines write on commit, read on compile/unregister; cross-DSL code (Suspend/Resume + * coordinator, ownership guard) drives the polymorphic interface without switching + * on MAL vs LAL.
  • + *
  • {@link StructuralCommitCoordinator} pending-commits stash — structural MAL commits + * verified but waiting on row-persist; drained by + * {@link StructuralCommitCoordinator#finalizeCommit} / + * {@link StructuralCommitCoordinator#discardCommit}.
  • + *
+ * + *

Locking

+ *

The per-file {@link ReentrantLock} on each {@link AppliedRuleScript} is the outermost + * ordering primitive. Every public entry point that mutates a {@code (catalog, name)} + * acquires it; the REST handler wraps the whole workflow in the same lock. Cross-file + * edits run concurrently. MeterSystem and the LAL factory handle their own internal + * locking, so this lock only needs to protect this class's state plus the structural- + * commit stash. Acquisition policy: REST → {@code tryLock(REST_LOCK_TIMEOUT_MS)}, 409 + * on timeout. Tick → {@code tryLock()} no-wait, skip-and-retry-next-tick on contention. + * Internal coordinator methods → blocking {@code lock()} (operations are short). + * + *

Source of truth

+ *

The persisted runtime-rule entry (BanyanDB Property / ES Document / JDBC Row) is + * authoritative cluster-wide. This class's in-memory maps are the local projection + * converged toward that source on every {@link #tick}. The reconcile is best-effort — + * {@code tick()} never throws out; per-file failures are logged and retried on the next + * iteration. + * + *

Single-main routing

+ *

Each {@code (catalog, name)} routes to a single deterministic main (see + * {@code MainRouter}). The main is the only node that self-suspends, applies, persists, + * finalizes / discards. Peers receive {@code Suspend} / {@code Resume} RPCs. Routing + * conflicts ({@link SuspendResult#REJECTED_ORIGIN_CONFLICT}) surface as 409 to the + * operator. The REST handler forwards non-main requests to the resolved main; 421 fires + * only when a forwarded request arrives at a node that itself doesn't believe it's main + * (split cluster view). The tick picks its storage opt via {@link #tickStorageOpt(boolean)}; + * the per-endpoint REST opts are routed by the handler ({@code /addOrUpdate} → + * {@code fullInstall}, {@code /inactivate} → {@code localCacheOnly}, {@code /delete} → + * dedicated {@link DSLRuntimeDelete} path). + */ +@Slf4j +public final class DSLManager { + + /** + * Unified per-rule map: each entry is a single {@link AppliedRuleScript} carrying the + * raw YAML last successfully applied + the authoritative {@link DSLRuntimeState}. + * {@code AppliedRuleScript} is immutable; updates produce a new instance via the + * {@code with*} builders, so a {@link java.util.concurrent.ConcurrentMap#compute compute} + * call on this map gives atomic per-key transitions without an external lock. + * + *

Replaces the historical pair of parallel maps ({@code snapshot}, {@code appliedContent}) + * — every per-file operation (classify, apply, unregister, suspend, resume, persist, + * {@code /list}) now reads or replaces one entry on this single map instead of + * coordinating across two. The per-file lock orders writes that must be atomic w.r.t. + * each other (e.g. an engine's commit content-write + the orchestrator's snapshot + * state-write); inside that lock, individual {@code rules.compute} calls are themselves + * already atomic. + */ + @Getter + private final Map rules = new ConcurrentHashMap<>(); + + private final ModuleManager moduleManager; + /** SELF / PEER / BOTH origin transitions + dispatch park/unpark + self-heal sweep. + * Exposed via {@code @Getter} so callers (REST handler, cluster RPC handler) can + * reach Suspend/Resume directly without DSLManager carrying pass-through wrappers. */ + @Getter + private final SuspendResumeCoordinator suspendCoord; + /** REST 2-PC coordinator: stash / finalize / discard pending commits + the + * destructive commit tail shared by tick + REST. Exposed via {@code @Getter}. */ + @Getter + private final StructuralCommitCoordinator commitCoord; + + /** + * Elapsed time a bundle can stay in {@link DSLRuntimeState.LocalState#SUSPENDED} with the DB + * content unchanged before the dslManager unsuspends it to its retained old content. 60 s + * — the 60 s budget exceeds dslManager tick + ES refresh + storage replica lag + RPC jitter. + */ + @Getter + private final long selfHealThresholdMs; + + /** Lock-observability wrapper. Owned by the DSLManager; the REST handler borrows via + * {@link #getLockMetrics()} so every lock acquire path reports to the same histograms. */ + @Getter + private final LockMetrics lockMetrics; + + /** Bundle teardown primitive — see {@link DSLRuntimeUnregister}'s class Javadoc. */ + private final DSLRuntimeUnregister dslRuntimeUnregister; + + /** Apply orchestrator — symmetric to {@link DSLRuntimeUnregister}. Drives the engine + * phase pipeline (compile → fireSchemaChanges → verify → commit | rollback) for every + * classify result that warrants applying. */ + private final DSLRuntimeApply dslRuntimeApply; + + /** Destructive {@code /delete} pipeline. Re-registers prototypes locally then tears down + * under fullInstall so the backend cascade fires before the DAO row is deleted. + * Exposed via {@code @Getter}. */ + @Getter + private final DSLRuntimeDelete dslRuntimeDelete; + + /** Boot-time seed + tick-time rehydrate of static rules. Exposed via {@code @Getter} + * so the module provider can drive the boot-time load directly. */ + @Getter + private final StaticRuleLoader staticRuleLoader; + + /** One-tick body — DB diff + apply + gone-keys cleanup + static rehydrate. */ + private final RuleSync ruleSync; + + /** Catalog → engine lookup. Built once here from the per-DSL maps the scheduler owns; + * every apply / unregister path routes through this registry to the right engine. */ + @Getter + private final RuleEngineRegistry engineRegistry; + + public DSLManager(final ModuleManager moduleManager, + final long selfHealThresholdMs) { + this.moduleManager = Objects.requireNonNull(moduleManager, "moduleManager"); + this.engineRegistry = new RuleEngineRegistry(); + this.engineRegistry.register(new MalRuleEngine(this.rules, this.moduleManager)); + this.engineRegistry.register(new LalRuleEngine(this.rules, this.moduleManager)); + this.selfHealThresholdMs = selfHealThresholdMs; + this.lockMetrics = + new LockMetrics(moduleManager); + this.suspendCoord = new SuspendResumeCoordinator( + this.rules, this.moduleManager, this.selfHealThresholdMs, + this::readCurrentDbRules + ); + this.dslRuntimeApply = new DSLRuntimeApply(this.engineRegistry); + this.commitCoord = new StructuralCommitCoordinator( + this.rules, this.dslRuntimeApply, this.suspendCoord + ); + this.dslRuntimeUnregister = new DSLRuntimeUnregister( + this.rules, this.moduleManager, + this::invokeAlarmReset, this.engineRegistry + ); + this.dslRuntimeDelete = new DSLRuntimeDelete( + this.engineRegistry, this.moduleManager, + this.rules, this::invokeAlarmReset + ); + this.staticRuleLoader = new StaticRuleLoader( + this.engineRegistry, this.rules, + this.lockMetrics, this::applyOneRuleFile + ); + this.ruleSync = new RuleSync( + this.moduleManager, this.lockMetrics, this.rules, + this.staticRuleLoader, + this::applyOneRuleFile, this.dslRuntimeUnregister::unregister, this::tickStorageOpt + ); + } + + /** + * Runs on the single-threaded dslManager executor scheduled by {@code RuntimeRuleModuleProvider}. + * Never throws — the scheduler swallows the exception anyway, so errors are logged and the + * next tick proceeds from whatever state the last one left. + */ + public void tick() { + tick(false); + } + + /** + * Variant invoked once at boot from {@code RuntimeRuleModuleProvider.notifyAfterCompleted} + * with {@code atBoot=true}. The boot pass on a no-init OAP picks + * {@link StorageManipulationOpt#localCacheVerify()} so missing or shape-mismatched + * backend schema fails the bootstrap (k8s pod backloop) instead of silently + * proceeding. The scheduled executor calls the no-arg overload so subsequent ticks + * stay on the lenient {@code localCacheOnly} retry path. + * + *

Boot semantics are scoped to no-init mode only — init-mode OAPs continue to + * pick {@link StorageManipulationOpt#createIfAbsent()} (boot creates), and + * default-mode OAPs continue to pick by cluster main-ness. + */ + public void tick(final boolean atBoot) { + try { + sweepSuspendedForSelfHeal(); + applyDeltasFromDatabase(atBoot); + } catch (final Throwable t) { + if (atBoot) { + // Re-throw so the bootstrap aborts; pod backloops on k8s, operator sees + // the failure instead of silently starting against an unprepared backend. + throw new RuntimeException("runtime-rule dslManager boot pass failed", t); + } + log.error("runtime-rule dslManager tick failed; will retry on next interval", t); + } + } + + // Boot-time static-rule seeding lives on {@link StaticRuleLoader#loadAll}; callers reach + // it via {@link #getStaticRuleLoader()}. + + /** + * Recover bundles stuck in {@link DSLRuntimeState.LocalState#SUSPENDED} by a peer-origin + * Suspend whose main crashed before sending Resume. Only acts on PEER-only origins — + * SELF origin is the local REST apply's own bookkeeping, and BOTH origin indicates a + * SELF apply is in flight alongside a PEER broadcast (the local apply's finalize / + * discard path is the recovery, not self-heal). + * + *

Bundles whose DB content HAS advanced since the suspend are left for + * {@link #applyDeltasFromDatabase(boolean)} to pick up via the normal content-hash diff — those + * are the "main node succeeded, we're catching up" path. We deliberately do not flip + * those back to RUNNING here: the correct handlers for the new content haven't been + * installed yet, so a premature flip would resume dispatch against a bundle whose schema + * may already have moved. + * + *

Most main-side failures now clear peer-side SUSPENDED within an RPC round-trip via + * the Resume broadcast, so this sweep is a backstop for the narrow case where the main + * crashes after Suspend but before Resume. Self-heal threshold can be tuned via + * {@link #selfHealThresholdMs}. + * + *

Time arithmetic uses {@link System#nanoTime()} via + * {@link DSLRuntimeState#getEnteredCurrentStateAtNanos()} rather than wall clock. An NTP + * jump or backwards wall-clock tick on the host would otherwise either delay a + * legitimate self-heal indefinitely or fire one prematurely. Wall-clock stamps stay on + * DSLRuntimeState for operator readability on {@code /list}; threshold math reads the + * monotonic side. + */ + void sweepSuspendedForSelfHeal() { + suspendCoord.sweepSuspendedForSelfHeal(); + } + + /** + * One DAO fetch per tick. Returns a map keyed by {@code "catalog:name"} of every persisted + * runtime rule (BanyanDB property / ES document / JDBC row — same logical entity, three + * shapes), or {@code null} when the DAO isn't resolvable (early boot, some embedded test + * topologies). The caller treats {@code null} as "skip self-heal this tick" — a correct- + * but-conservative default when we can't observe the persisted state. + * + *

The full {@link RuntimeRuleManagementDAO.RuntimeRuleFile} is captured (not just the + * content hash) so self-heal can distinguish "content unchanged + ACTIVE" (the pre- + * suspend bundle is still authoritative → resume) from "content unchanged + INACTIVE" + * (the operator deliberately inactivated → leave SUSPENDED for the delta-apply path to + * tear down). + */ + private Map readCurrentDbRules() { + final RuntimeRuleManagementDAO dao; + try { + dao = moduleManager.find(StorageModule.NAME).provider() + .getService(RuntimeRuleManagementDAO.class); + } catch (final Throwable t) { + return null; + } + final Map rules = new HashMap<>(); + try { + for (final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile : dao.getAll()) { + rules.put(DSLScriptKey.key(ruleFile.getCatalog(), ruleFile.getName()), ruleFile); + } + } catch (final IOException ioe) { + log.debug("runtime-rule self-heal: DAO fetch failed this tick ({})", ioe.getMessage()); + return null; + } + return rules; + } + + /** Run one tick — DB diff + apply + gone-keys cleanup + static rehydrate. Delegates to + * {@link RuleSync}. */ + private void applyDeltasFromDatabase(final boolean atBoot) { + ruleSync.runOnce(atBoot); + } + + /** + * Synchronously apply one rule file on this node. Used by the REST handler's sync path so + * {@code /addOrUpdate} can return a precise {@code structural_applied} / {@code + * ddl_verify_failed} response instead of always 202. Acquires the per-file lock, runs + * the same {@link #applyOneRuleFile} path the 30-second tick uses, and reports the resulting + * {@link DSLRuntimeState} so the caller can distinguish success from + * {@code applyError}-annotated degradation. + * + *

Thread-safe with the dslManager tick: the per-file lock from + * {@link AppliedRuleScript#lockFor} serializes both paths on the same + * {@code (catalog, name)}. Other files' ticks run concurrently because the tick acquires + * per-file locks the same way. + */ + public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile) { + return applyNowForRuleFile(ruleFile, false); + } + + /** + * Synchronous apply overload that supports deferred commit. When + * {@code deferCommit=true} and the apply path reaches a successful MAL STRUCTURAL/NEW + * commit point, the destructive tail (drop removedMetrics, swap the engine-applied + * artefacts, retire old loader, alarm reset, advance snapshot) is stashed in + * {@link StructuralCommitCoordinator}'s pending-commits map rather than applied + * inline. The caller must then invoke {@link StructuralCommitCoordinator#finalizeCommit} + * or {@link StructuralCommitCoordinator#discardCommit} to drain the stash. + * + *

Used by the REST handler's STRUCTURAL path so row-persist failure can revert + * to the pre-apply state — including restoring metrics that would otherwise have been + * dropped by the commit — instead of leaving the node diverged from cluster state. + * Other call sites (the dslManager tick) pass {@code deferCommit=false} and get the + * inline commit they've always had. + */ + public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final boolean deferCommit) { + return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.fullInstall()); + } + + /** + * Storage-opt overload of {@link #applyNowForRuleFile(RuntimeRuleManagementDAO.RuntimeRuleFile, boolean)}. + * + *

The REST {@code /inactivate} path passes {@link StorageManipulationOpt#localCacheOnly()} + * here so the OAP-internal teardown — MeterSystem prototypes, MetricsStreamProcessor + * entry / persistent workers, BatchQueue handlers, retired RuleClassLoader — runs to + * completion while the backend's measure / table / index, and the data already stored + * under the pre-inactivate metric, are left intact. {@code /delete} (and STRUCTURAL + * {@code /addOrUpdate} that drops shape-broken metrics) keeps {@code fullInstall()} so + * the destructive cascade reaches the backend as before. + * + *

Other call sites should keep using the no-opt overload above so the documented + * "REST path = fullInstall, peer tick = localCacheOnly" routing rule is unchanged. + */ + public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final boolean deferCommit, + final StorageManipulationOpt storageOpt) { + final String key = DSLScriptKey.key(ruleFile.getCatalog(), ruleFile.getName()); + final String newHash = ContentHash.sha256Hex(ruleFile.getContent()); + final AppliedRuleScript prevScript = rules.get(key); + final DSLRuntimeState prev = prevScript == null ? null : prevScript.getState(); + final long nowMs = System.currentTimeMillis(); + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, + ruleFile.getCatalog(), ruleFile.getName()); + perFile.lock(); + try { + applyOneRuleFile(ruleFile, newHash, prev, nowMs, key, deferCommit, storageOpt); + final AppliedRuleScript after = rules.get(key); + return after == null ? null : after.getState(); + } finally { + perFile.unlock(); + } + } + + /** + * Apply one rule file's state to this node under the per-file lock already held by the + * caller. Dispatches on catalog: MAL catalogs ({@code otel-rules}, {@code log-mal-rules}) + * parse + register via {@link MalFileApplier}; LAL goes through {@link LalFileApplier} + * with the same classify → compile → swap structure. INACTIVE status routes to + * {@code dslRuntimeUnregister.unregister}. Both paths drive structural commits via the + * {@link PendingApplyCommit} stash so a persist failure can roll back cleanly. + */ + private void applyOneRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final String newHash, final DSLRuntimeState prev, + final long nowMs, final String key, + final boolean deferCommit, + final StorageManipulationOpt storageOpt) { + final boolean wasSuspended = prev != null + && prev.getLocalState() == DSLRuntimeState.LocalState.SUSPENDED; + final boolean isInactive = "INACTIVE".equals(ruleFile.getStatus()); + + if (engineRegistry.forCatalog(ruleFile.getCatalog()) == null) { + // Catalog has no engine registered with the runtime-rule receiver. Surface a warn + // so the operator sees the misroute and skip. The seed loop already drops these — + // this branch protects the tick / on-demand paths from a row that somehow reached + // them, e.g. via an explicit DAO write that named a catalog the runtime-rule + // receiver does not own. + log.warn("runtime-rule dslManager: ignoring rule {}/{} — catalog '{}' has no " + + "engine registered (recognised: {})", + ruleFile.getCatalog(), ruleFile.getName(), ruleFile.getCatalog(), + this.engineRegistry.engines()); + return; + } + + // DSL-agnostic apply driver. The scheduler does classify routing, ownership guard, + // snapshot transitions, and 2-PC stash bookkeeping; everything DSL-specific lives + // behind the engine SPI (engine.classify, engine.claimedKeys, engine.activeClaimsExcluding, + // engine.compile/verify/commit/rollback via DSLRuntimeApply). + handleApply(ruleFile, key, prev, wasSuspended, isInactive, newHash, nowMs, + deferCommit, storageOpt); + } + + /** + * DSL-agnostic apply driver. Routes classify outcomes, runs the cross-file ownership + * guard, drives the engine pipeline through {@link DSLRuntimeApply}, and stashes / + * commits via {@link StructuralCommitCoordinator}. Adding a new DSL needs zero edits + * here — register an engine with {@link RuleEngineRegistry} and the driver picks it up. + * + *

+     *   classify
+     *     ├─ INACTIVE   → unregisterBundle  + tombstone snapshot
+     *     ├─ NO_CHANGE  → snapshot hash refresh
+     *     └─ NEW / FILTER_ONLY / STRUCTURAL → continue:
+     *
+     *   ownership guard (engine.claimedKeys / engine.activeClaimsExcluding + DAO INACTIVE)
+     *     └─ conflict → snapshot error stamp
+     *
+     *   dslRuntimeApply.compileAndVerify
+     *     ├─ COMPILE_FAILED → snapshot error stamp (engine self-rolled-back)
+     *     ├─ VERIFY_FAILED  → snapshot error stamp (engine self-rolled-back)
+     *     └─ READY_TO_COMMIT → wrap in PendingApplyCommit:
+     *         ├─ deferCommit → commitCoord.stash    (REST 2-PC, drained on persist outcome)
+     *         └─ inline      → commitCoord.commitInline (tick / FILTER_ONLY)
+     * 
+ */ + private void handleApply(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final String key, final DSLRuntimeState prev, + final boolean wasSuspended, final boolean isInactive, + final String newHash, final long nowMs, + final boolean deferCommit, + final StorageManipulationOpt storageOpt) { + final RuleEngine engine = engineRegistry.forCatalog(ruleFile.getCatalog()); + final AppliedRuleScript priorScript = rules.get(key); + final String priorContent = priorScript == null ? null : priorScript.getContent(); + + // 1. Classify (folds isInactive in). + final Classification cl; + try { + cl = engine.classify(priorContent, ruleFile.getContent(), isInactive); + } catch (final RuntimeException ce) { + log.error("runtime-rule dslManager: classify FAILED for {}/{}: {}", + ruleFile.getCatalog(), ruleFile.getName(), ce.getMessage(), ce); + stampClassifyError(ruleFile, key, prev, nowMs, ce.getMessage()); + return; + } + if (log.isInfoEnabled()) { + log.info("runtime-rule dslManager: classification for {}/{} = {}", + ruleFile.getCatalog(), ruleFile.getName(), cl); + } + + // 2. INACTIVE — full tear-down via DSLRuntimeUnregister + tombstone state. Static + // fall-over does NOT fire here: the operator's /inactivate intent is "off" and + // bringing the bundled twin back instantly would defeat soft-pause. To restore + // bundled, the operator runs /delete (drops the row, gone-keys path reloads). + if (cl == Classification.INACTIVE) { + dslRuntimeUnregister.unregister( + ruleFile.getCatalog(), ruleFile.getName(), true, storageOpt); + log.info("runtime-rule dslManager: {}/{} INACTIVE — unregistered", + ruleFile.getCatalog(), ruleFile.getName()); + // Clear lastApplyError explicitly: withContentHash no-ops when the hash hasn't + // moved (the usual /inactivate case where content stays, status flips), so a + // stale error from a prior failed apply would otherwise leak via /list. + final DSLRuntimeState newState = prev == null + ? new DSLRuntimeState(ruleFile.getCatalog(), ruleFile.getName(), newHash, + DSLRuntimeState.LocalState.NOT_LOADED, DSLRuntimeState.LoaderGc.LIVE, + null, nowMs, nowMs) + : prev.withContentHash(newHash, nowMs) + .withLocalState(DSLRuntimeState.LocalState.NOT_LOADED, nowMs) + .withApplyError(null, nowMs); + rules.compute(key, (k, existing) -> existing == null + ? new AppliedRuleScript(ruleFile.getCatalog(), ruleFile.getName(), null, newState) + : existing.withState(newState)); + return; + } + + // 3. NO_CHANGE — content byte-identical and still ACTIVE. The caller's hash short- + // circuit usually catches this; if we're here a status flip, recovery state, or a + // REST {@code force=true} re-post brought us through. When the bundle is SELF- + // suspended on entry (REST main self-suspended before calling apply), there's no + // commit to stash and no commit-tail to drain — the resume side won't fire by + // itself. {@code localResume} clears SELF only, so a peer tick reaching this branch + // on a PEER-suspended bundle correctly leaves the PEER origin alone (the main's + // Resume broadcast or self-heal owns that side). + if (cl == Classification.NO_CHANGE) { + log.debug("runtime-rule dslManager: {}/{} no content change, skipping", + ruleFile.getCatalog(), ruleFile.getName()); + if (wasSuspended) { + suspendCoord.localResume(ruleFile.getCatalog(), ruleFile.getName()); + } + // localResume already updated the entry on a SELF-clear; re-read so the hash + // refresh below stamps the right base. + final AppliedRuleScript curScript = rules.get(key); + final DSLRuntimeState cur = curScript == null ? null : curScript.getState(); + if (cur != null) { + final DSLRuntimeState refreshed = cur.withContentHash(newHash, nowMs); + rules.compute(key, (k, existing) -> existing == null + ? new AppliedRuleScript(ruleFile.getCatalog(), ruleFile.getName(), + null, refreshed) + : existing.withState(refreshed)); + } + return; + } + + // 4. Cross-file ownership guard. + final List conflicts = checkOwnershipConflicts(engine, ruleFile, key); + if (!conflicts.isEmpty()) { + final String msg = "rule-name collision with other active files: " + conflicts; + log.error("runtime-rule dslManager CRITICAL: apply REJECTED for {}/{}: {}", + ruleFile.getCatalog(), ruleFile.getName(), msg); + stampApplyError(ruleFile, key, prev, nowMs, msg, true); + return; + } + + // 5. Engine pipeline — compile + fireSchemaChanges + verify. + final DSLRuntimeApply.Outcome outcome = dslRuntimeApply.compileAndVerify( + ruleFile, cl, buildApplyInputs(storageOpt)); + if (outcome.status == DSLRuntimeApply.Outcome.Status.COMPILE_FAILED) { + // Engine has already rolled back partial registrations. + log.error("runtime-rule dslManager CRITICAL: apply COMPILE_FAILED for {}/{}: {}", + ruleFile.getCatalog(), ruleFile.getName(), outcome.error); + stampApplyError(ruleFile, key, prev, nowMs, outcome.error, true); + return; + } + if (outcome.status == DSLRuntimeApply.Outcome.Status.VERIFY_FAILED) { + // Engine.rollback already ran. Stamp verify error. + log.error("runtime-rule dslManager CRITICAL: apply VERIFY_FAILED for {}/{}: {}", + ruleFile.getCatalog(), ruleFile.getName(), outcome.error); + final AppliedRuleScript currentScript = rules.get(key); + final DSLRuntimeState current = currentScript == null ? null : currentScript.getState(); + if (current != null) { + rules.put(key, currentScript.withState(current.withApplyError(outcome.error, nowMs))); + } + return; + } + + // 6. READY_TO_COMMIT — wrap and stash (REST 2-PC) or commit inline (tick / sync). + log.info("runtime-rule dslManager: apply OK for {}/{}", + ruleFile.getCatalog(), ruleFile.getName()); + final PendingApplyCommit pending = new PendingApplyCommit(outcome, prev, wasSuspended, nowMs); + if (deferCommit) { + commitCoord.stash(pending); + return; + } + commitCoord.commitInline(pending); + } + + private org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs buildApplyInputs( + final StorageManipulationOpt storageOpt) { + return new org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs( + moduleManager, storageOpt, + this::invokeAlarmReset, rules); + } + + /** + * Cross-file ownership guard. DSL-agnostic: routes through {@code engine.claimedKeys} for + * the planned set and {@code engine.activeClaimsExcluding} for ACTIVE peers' claims, plus + * the DAO for INACTIVE-row claims. Returns the list of human-readable conflict + * descriptions (empty when the planned key set is conflict-free). + * + *

Two ownership sources are checked: + *

    + *
  1. Active appliedX entries on this engine — covers runtime files this node has + * applied plus boot-seeded static rules.
  2. + *
  3. INACTIVE rows in the DAO — {@code /inactivate} clears appliedX but the row's + * content + status remain. Per the soft-pause contract, an inactive rule still + * HOLDS its claimed keys: the operator's recourse is to update or {@code /delete} + * that rule before reusing its keys in another file.
  4. + *
+ */ + private List checkOwnershipConflicts( + final RuleEngine engine, + final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final String selfKey) { + final Set planned = engine.claimedKeys( + ruleFile.getContent(), ruleFile.getCatalog() + "/" + ruleFile.getName()); + final List conflicts = new ArrayList<>(); + for (final Map.Entry> other : engine.activeClaimsExcluding(selfKey).entrySet()) { + for (final String pk : planned) { + if (other.getValue().contains(pk)) { + conflicts.add(pk + " owned by " + other.getKey()); + } + } + } + try { + final RuntimeRuleManagementDAO dao = moduleManager.find(StorageModule.NAME) + .provider().getService(RuntimeRuleManagementDAO.class); + if (dao != null) { + for (final RuntimeRuleManagementDAO.RuntimeRuleFile other : dao.getAll()) { + if (!engine.supportedCatalogs().contains(other.getCatalog())) { + continue; + } + final String otherKey = DSLScriptKey.key(other.getCatalog(), other.getName()); + if (selfKey.equals(otherKey)) { + continue; + } + if (!RuntimeRule.STATUS_INACTIVE.equals(other.getStatus())) { + continue; + } + final Set claimedByInactive = engine.claimedKeys( + other.getContent(), other.getCatalog() + "/" + other.getName()); + for (final String pk : planned) { + if (claimedByInactive.contains(pk)) { + conflicts.add(pk + " held by inactive " + otherKey + + " (update or /delete that rule first)"); + } + } + } + } + } catch (final Throwable t) { + log.warn("runtime-rule: inactive-claim check failed for {}/{}; relying on " + + "active-only result", ruleFile.getCatalog(), ruleFile.getName(), t); + } + return conflicts; + } + + /** State-transition helper: stamp classify-failure error without advancing + * contentHash (so the next tick retries). */ + private void stampClassifyError(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final String key, final DSLRuntimeState prev, + final long nowMs, final String message) { + final DSLRuntimeState newState = prev == null + ? DSLRuntimeState.failedFirstApply(ruleFile.getCatalog(), ruleFile.getName(), nowMs) + .withApplyError("classify failed: " + message, nowMs) + : prev.withApplyError("classify failed: " + message, nowMs); + rules.compute(key, (k, existing) -> existing == null + ? new AppliedRuleScript(ruleFile.getCatalog(), ruleFile.getName(), null, newState) + : existing.withState(newState)); + } + + /** State-transition helper for apply failures: stamp the error, optionally flip + * SUSPENDED → RUNNING so dispatch isn't left parked. Does NOT advance contentHash — + * the next tick re-classifies and retries on whatever content the operator pushes. */ + private void stampApplyError(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final String key, final DSLRuntimeState prev, + final long nowMs, final String message, + final boolean resumeIfSuspended) { + if (resumeIfSuspended && prev != null + && prev.getLocalState() == DSLRuntimeState.LocalState.SUSPENDED) { + suspendCoord.resumeDispatchForBundle(key); + } + final DSLRuntimeState newState = prev == null + ? DSLRuntimeState.failedFirstApply(ruleFile.getCatalog(), ruleFile.getName(), nowMs) + .withApplyError(message, nowMs) + : prev.withApplyError(message, nowMs) + .withLocalState(DSLRuntimeState.LocalState.RUNNING, nowMs); + rules.compute(key, (k, existing) -> existing == null + ? new AppliedRuleScript(ruleFile.getCatalog(), ruleFile.getName(), null, newState) + : existing.withState(newState)); + } + + /** + * Best-effort dispatch to the alarm-kernel service. If the alarm module is not loaded in + * this OAP deployment (some embedded / test topologies), the lookup fails and the reset + * is silently skipped — alarm windows self-heal within one evaluation period anyway. + */ + private void invokeAlarmReset(final Set affectedMetricNames) { + if (affectedMetricNames == null || affectedMetricNames.isEmpty()) { + return; + } + try { + final AlarmKernelService kernel = moduleManager.find(AlarmModule.NAME).provider() + .getService(AlarmKernelService.class); + kernel.reset(affectedMetricNames); + } catch (final Throwable t) { + log.debug("runtime-rule dslManager: alarm-kernel reset skipped ({}); alarm windows " + + "will self-heal within one evaluation period", t.getMessage()); + } + } + + /** + * Pick the {@link StorageManipulationOpt} for a tick-driven apply. + * + *

Two axes: + * + *

RunningMode (boot/init context). + *

    + *
  • {@code init} mode — OAP is the dedicated initialiser; install schema if + * absent. {@link StorageManipulationOpt#createIfAbsent()} matches what the + * rest of the static-rule install path does in init mode (idempotent against + * backends that already hold the table). + *
  • {@code no-init} mode — this OAP must NOT touch the backend; the init OAP + * owns schema. The opt depends on whether this is the synchronous boot pass + * or a scheduled tick: + *
      + *
    • Boot pass ({@code atBoot=true}) → + * {@link StorageManipulationOpt#localCacheVerify()}. Strict: backend + * resources must already exist with the declared shape. A missing or + * mismatched schema fails the bootstrap (k8s pod backloop) — operator must + * bring up the init OAP first, or align rule files with the backend. + *
    • Scheduled tick ({@code atBoot=false}) → + * {@link StorageManipulationOpt#localCacheOnly()}. Lenient: the timer + * retries forever without raising errors so transient absence (init OAP + * still catching up between ticks) self-heals. + *
    + *
  • default mode (regular running OAP) — branch on cluster main-ness, see below. + *
+ * + *

Cluster main-ness (default mode only). + *

    + *
  • Self is main → {@link StorageManipulationOpt#fullInstall()}. The REST path + * has the same shape; tick rarely runs on main because REST usually + * converges the main's state first. + *
  • Peer (someone else is main) → {@link StorageManipulationOpt#localCacheOnly()}. + * Local MeterSystem + MetadataRegistry populate so the peer dispatches samples + * correctly, but no server-side DDL fires. + *
+ * + *

When the cluster module isn't wired (embedded test topology), {@link + * MainRouter#isSelfMain} returns {@code true} and the default-mode branch falls + * through to {@code fullInstall} — single-process deployments are always main. + * + * @param atBoot true for the synchronous one-shot pass invoked from + * {@code RuntimeRuleModuleProvider.notifyAfterCompleted}; false for + * scheduled-executor ticks. + */ + private StorageManipulationOpt tickStorageOpt(final boolean atBoot) { + if (RunningMode.isInitMode()) { + return StorageManipulationOpt.createIfAbsent(); + } + if (RunningMode.isNoInitMode()) { + return atBoot + ? StorageManipulationOpt.localCacheVerify() + : StorageManipulationOpt.localCacheOnly(); + } + try { + final RemoteClientManager rcm = moduleManager.find(CoreModule.NAME).provider() + .getService(RemoteClientManager.class); + return MainRouter.isSelfMain(rcm) + ? StorageManipulationOpt.fullInstall() + : StorageManipulationOpt.localCacheOnly(); + } catch (final Throwable t) { + return StorageManipulationOpt.fullInstall(); + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java new file mode 100644 index 000000000000..e4b1277b28a2 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.CompiledDSL; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.EngineCompileException; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; + +/** + * DSL-agnostic apply orchestrator. Symmetric counterpart to {@link DSLRuntimeUnregister}: where + * {@code DSLRuntimeUnregister} runs the engine's tear-down phase, {@code DSLRuntimeApply} runs + * the engine's compile-through-commit phase pipeline. The scheduler calls into this class for + * every classify result that warrants applying (NEW / FILTER_ONLY / STRUCTURAL); INACTIVE + * routes to {@link DSLRuntimeUnregister}; NO_CHANGE never reaches here. + * + *

What this class owns: + *

    + *
  • Engine lookup via {@link RuleEngineRegistry} and the per-engine context construction + * ({@link RuleEngine#newApplyContext}).
  • + *
  • Phase pipeline: {@link RuleEngine#compile} → {@link RuleEngine#fireSchemaChanges} → + * {@link RuleEngine#verify} → {@link RuleEngine#rollback} (on verify failure) or + * {@link RuleEngine#commit}.
  • + *
  • Reporting outcomes via {@link Outcome} so the scheduler can drive snapshot transitions + * + persistence + suspend coordination uniformly across engines.
  • + *
+ * + *

What this class does NOT own — the scheduler keeps these because they cross engines + * and depend on snapshot / cluster state: + *

    + *
  • Cross-file ownership guard (queries DAO + appliedX across all engines).
  • + *
  • Snapshot transitions ({@link + * org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState} mutations).
  • + *
  • Suspend / Resume coordinator interactions.
  • + *
  • STRUCTURAL deferred-commit stash via {@link StructuralCommitCoordinator}.
  • + *
+ * + *

Deferred commit. The scheduler can invoke {@link #compileAndVerify} (no commit) + * for the STRUCTURAL REST 2-PC path, then drive {@link #commit} or {@link #rollback} + * separately after row-persist resolves. The simpler {@link #applyInline} variant does + * compile + verify + commit in one call for the tick path and FILTER_ONLY REST path. + */ +@Slf4j +public final class DSLRuntimeApply { + + private final RuleEngineRegistry engineRegistry; + + public DSLRuntimeApply(final RuleEngineRegistry engineRegistry) { + this.engineRegistry = engineRegistry; + } + + /** + * Run compile → fireSchemaChanges → verify → commit (or rollback on verify failure) inline. + * Used by the tick path and the FILTER_ONLY REST path where there is no row-persist gate + * to wait on. + */ + public Outcome applyInline(final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final ApplyInputs inputs) { + final RuleEngine engine = engineRegistry.forCatalog(file.getCatalog()); + if (engine == null) { + return Outcome.compileFailed( + "no engine registered for catalog '" + file.getCatalog() + "'", null); + } + return applyInlineTyped(engine, file, classification, inputs); + } + + /** + * Run compile → fireSchemaChanges → verify only. Returns an outcome the caller can hold; + * the caller drives {@link #commit} after its own external precondition resolves (row- + * persist for the REST STRUCTURAL path) or {@link #rollback} on the precondition failing. + */ + public Outcome compileAndVerify(final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final ApplyInputs inputs) { + final RuleEngine engine = engineRegistry.forCatalog(file.getCatalog()); + if (engine == null) { + return Outcome.compileFailed( + "no engine registered for catalog '" + file.getCatalog() + "'", null); + } + return compileAndVerifyTyped(engine, file, classification, inputs); + } + + /** Drive {@code engine.commit} on a previously {@link #compileAndVerify}-produced outcome. */ + public void commit(final Outcome outcome) { + if (outcome.compiled == null || outcome.engine == null || outcome.ctx == null) { + throw new IllegalStateException( + "DSLRuntimeApply.commit called on an outcome without compiled state: " + outcome.status); + } + commitTyped(outcome); + } + + /** Drive {@code engine.rollback} on a previously {@link #compileAndVerify}-produced + * outcome. Used when the orchestrator's row-persist (or any other post-verify external + * precondition) fails. */ + public void rollback(final Outcome outcome) { + if (outcome.compiled == null || outcome.engine == null || outcome.ctx == null) { + return; // nothing to roll back + } + rollbackTyped(outcome); + } + + private static Outcome applyInlineTyped( + final RuleEngine engine, + final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final ApplyInputs inputs) { + final Outcome step = compileAndVerifyTypedHelper(engine, file, classification, inputs); + if (step.status != Outcome.Status.READY_TO_COMMIT) { + return step; + } + @SuppressWarnings("unchecked") + final C ctx = (C) step.ctx; + engine.commit(step.compiled, ctx); + return Outcome.committed(engine, step.compiled, ctx); + } + + private static Outcome compileAndVerifyTyped( + final RuleEngine engine, + final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final ApplyInputs inputs) { + return compileAndVerifyTypedHelper(engine, file, classification, inputs); + } + + private static Outcome compileAndVerifyTypedHelper( + final RuleEngine engine, + final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final ApplyInputs inputs) { + final C ctx = engine.newApplyContext(inputs); + final CompiledDSL compiled; + try { + compiled = engine.compile(file, classification, ctx); + } catch (final EngineCompileException ece) { + log.error("runtime-rule apply: compile FAILED for {}/{}: {}", + file.getCatalog(), file.getName(), ece.getMessage(), ece); + return Outcome.compileFailed(ece.getMessage(), null); + } catch (final RuntimeException re) { + log.error("runtime-rule apply: compile threw unexpectedly for {}/{}: {}", + file.getCatalog(), file.getName(), re.getMessage(), re); + return Outcome.compileFailed(re.getMessage(), null); + } + // fireSchemaChanges is a no-op for both engines today; left as an SPI hook for future + // engines whose listener chain isn't fused with compile. + engine.fireSchemaChanges(compiled, ctx); + final String verifyError = engine.verify(compiled, ctx); + if (verifyError != null) { + engine.rollback(compiled, ctx); + return Outcome.verifyFailed(verifyError, engine, compiled, ctx); + } + return Outcome.readyToCommit(engine, compiled, ctx); + } + + @SuppressWarnings("unchecked") + private static void commitTyped(final Outcome outcome) { + final RuleEngine engine = (RuleEngine) outcome.engine; + engine.commit(outcome.compiled, (C) outcome.ctx); + } + + @SuppressWarnings("unchecked") + private static void rollbackTyped(final Outcome outcome) { + final RuleEngine engine = (RuleEngine) outcome.engine; + engine.rollback(outcome.compiled, (C) outcome.ctx); + } + + /** + * Result of an apply attempt. The scheduler reads {@link #status} to drive its own + * snapshot transition + persistence: + *

    + *
  • {@link Status#COMMITTED} — engine committed; scheduler advances snapshot to + * RUNNING with the new content hash.
  • + *
  • {@link Status#READY_TO_COMMIT} — engine compiled + verified, awaiting external + * precondition (row-persist) before {@link DSLRuntimeApply#commit} is invoked.
  • + *
  • {@link Status#COMPILE_FAILED} — engine threw on compile; engine has already rolled + * back its partial state. Scheduler stamps {@code applyError} on the snapshot + * without advancing the content hash so the next tick retries.
  • + *
  • {@link Status#VERIFY_FAILED} — compile succeeded, verify rejected; engine rollback + * has already run. Scheduler stamps {@code applyError}.
  • + *
+ */ + public static final class Outcome { + public enum Status { COMMITTED, READY_TO_COMMIT, COMPILE_FAILED, VERIFY_FAILED } + + public final Status status; + public final String error; + public final CompiledDSL compiled; + final RuleEngine engine; + final ApplyContext ctx; + + private Outcome(final Status status, final String error, final CompiledDSL compiled, + final RuleEngine engine, final ApplyContext ctx) { + this.status = status; + this.error = error; + this.compiled = compiled; + this.engine = engine; + this.ctx = ctx; + } + + static Outcome committed(final RuleEngine engine, final CompiledDSL compiled, + final ApplyContext ctx) { + return new Outcome(Status.COMMITTED, null, compiled, engine, ctx); + } + + static Outcome readyToCommit(final RuleEngine engine, final CompiledDSL compiled, + final ApplyContext ctx) { + return new Outcome(Status.READY_TO_COMMIT, null, compiled, engine, ctx); + } + + static Outcome compileFailed(final String error, final CompiledDSL compiled) { + return new Outcome(Status.COMPILE_FAILED, error, compiled, null, null); + } + + static Outcome verifyFailed(final String error, final RuleEngine engine, + final CompiledDSL compiled, final ApplyContext ctx) { + return new Outcome(Status.VERIFY_FAILED, error, compiled, engine, ctx); + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java new file mode 100644 index 000000000000..3bf4b259c8ea --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java @@ -0,0 +1,184 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import java.util.function.Consumer; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * Destructive {@code /delete} pipeline. Third orchestrator alongside {@link DSLRuntimeApply} + * (NEW / FILTER_ONLY / STRUCTURAL apply) and {@link DSLRuntimeUnregister} (INACTIVE / gone-keys + * tear-down). {@code /delete} is the one endpoint that physically drops backend schema — + * {@code /inactivate} preserves it for cheap re-activation. + * + *

This orchestrator is a thin dispatcher: it acquires the per-file lock, runs the cross- + * file ownership guard (defence-in-depth — {@code /addOrUpdate} should have caught it + * already), and routes to {@link RuleEngine#dropBackend}. Engines that own backend + * schema (MAL) execute the re-register-then-drop dance there; engines without backend (LAL) + * implement the SPI method as a no-op. + * + *

The caller (REST {@code /delete}) holds the per-file lock; this orchestrator re-acquires + * it (lock is reentrant) so the implementation is correct whether called inline or from a + * background path. + */ +@Slf4j +public class DSLRuntimeDelete { + + private final RuleEngineRegistry engineRegistry; + private final ModuleManager moduleManager; + private final Map rules; + private final Consumer> alarmResetter; + + public DSLRuntimeDelete(final RuleEngineRegistry engineRegistry, + final ModuleManager moduleManager, + final Map rules, + final Consumer> alarmResetter) { + this.engineRegistry = engineRegistry; + this.moduleManager = moduleManager; + this.rules = rules; + this.alarmResetter = alarmResetter; + } + + /** + * Discharge backend debt for the {@code (catalog, name)} bundle the REST handler is about + * to {@code /delete}. Routes to {@link RuleEngine#dropBackend} — engines that own + * backend schema do the re-register-then-drop dance; engines without backend no-op. + * + * @throws IllegalStateException if a cross-file ownership conflict is detected, or the + * engine cannot discharge its backend debt (MeterSystem unavailable, parse error in + * the inactive content). The caller (REST handler) aborts {@code dao.delete} on this + * throw — refusing to delete the row is the correct failure mode. + */ + public void dropBackendForDelete(final String catalog, final String name, final String content) { + final RuleEngine engine = engineRegistry.forCatalog(catalog); + if (engine == null) { + log.warn("runtime-rule dslManager: no engine registered for catalog '{}' on " + + "/delete of {}/{}; skipping", catalog, catalog, name); + return; + } + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); + perFile.lock(); + try { + // Defence-in-depth ownership guard. /addOrUpdate's check should have prevented + // this — if a race or DAO blip slipped one through, dropping the backend resource + // here would tear down a metric another active file is still using. + final List activeConflicts = checkOwnershipConflicts(engine, catalog, name, content); + if (!activeConflicts.isEmpty()) { + throw new IllegalStateException( + "/delete refused for " + catalog + "/" + name + ": claim(s) " + + activeConflicts + " are now owned by another active bundle. " + + "The /addOrUpdate cross-file ownership check should have caught " + + "this; this is a safety net. Update or /inactivate the conflicting " + + "bundle(s) first."); + } + // The engine's dropBackend handles both modes via bundledContent: + // * null → destructive cascade (drop everything runtime claimed) + // * non-null → delta drop (only runtime-only + shape-break metrics; bundled- + // shared at matching shape is preserved for bundled to reuse on + // its synchronous reload below). + final String bundledContent = + StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (bundledContent != null) { + log.info("runtime-rule /delete: bundled twin exists for {}/{} — running " + + "delta-aware cleanup (drop runtime-only / shape-break, keep bundled-shared)", + catalog, name); + } + dropBackend(engine, catalog, name, content, bundledContent); + } finally { + perFile.unlock(); + } + } + + private List checkOwnershipConflicts(final RuleEngine engine, final String catalog, + final String name, final String content) { + final String selfKey = DSLScriptKey.key(catalog, name); + final Set planned = engine.claimedKeys(content, catalog + "/" + name); + final List conflicts = new ArrayList<>(); + for (final Map.Entry> other : engine.activeClaimsExcluding(selfKey).entrySet()) { + for (final String pk : planned) { + if (other.getValue().contains(pk)) { + conflicts.add(pk + " owned by " + other.getKey()); + } + } + } + return conflicts; + } + + /** + * Synchronously reload the bundled rule into a fresh {@code static:} loader after a + * {@code /delete} of a row whose {@code (catalog, name)} has a bundled YAML on disk. + * The REST handler calls this so the operator's response reflects the post-delete + * reality (bundled is already serving) rather than waiting for the next tick. + * + * @return {@code true} when a bundled rule was reloaded; {@code false} when no bundled + * twin exists or the engine doesn't participate in static fall-over for this + * catalog. Errors are logged at WARN and surfaced as {@code false}. + */ + public boolean reloadBundledIfPresent(final String catalog, final String name) { + final RuleEngine engine = engineRegistry.forCatalog(catalog); + if (engine == null) { + return false; + } + if (!StaticRuleRegistry.active().find(catalog, name).isPresent()) { + return false; + } + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); + perFile.lock(); + try { + return engine.reloadStatic(catalog, name, alarmResetter, moduleManager); + } catch (final Throwable t) { + log.warn("runtime-rule /delete: bundled fall-over reload failed for {}/{}; " + + "peer tick will retry via gone-keys path", catalog, name, t); + return false; + } finally { + perFile.unlock(); + } + } + + /** + * Wildcard-capture helper. Threads {@code bundledContent} through to {@link + * RuleEngine#dropBackend}: a null value triggers the destructive cascade (drop + * everything runtime had); a non-null value triggers the delta drop (drop only + * metrics runtime had that bundled doesn't claim, preserve bundled-shared at + * matching shape). fullInstall makes the listener chain run. + */ + private void dropBackend( + final RuleEngine engine, final String catalog, final String name, + final String runtimeContent, final String bundledContent) { + final ApplyInputs inputs = new ApplyInputs( + moduleManager, StorageManipulationOpt.fullInstall(), + alarmResetter, rules + ); + final C ctx = engine.newApplyContext(inputs); + engine.dropBackend(catalog, name, runtimeContent, bundledContent, ctx); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java new file mode 100644 index 000000000000..50177efac1ee --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.util.Map; +import java.util.Set; +import java.util.function.Consumer; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; + +/** + * DSL-agnostic teardown orchestrator. The scheduler calls + * {@link #unregister(String, String, boolean, StorageManipulationOpt)} from every shared-pipeline + * step that removes registrations: the tick's INACTIVE branch and gone-keys cleanup, the apply + * path's {@code isInactive} short-circuit, and the destructive {@code /delete} dropper. + * + *

Routing: the orchestrator looks up the {@link RuleEngine} for the file's catalog via + * {@link RuleEngineRegistry}, asks it to build its own {@link ApplyContext} subtype from the + * shared {@link ApplyInputs}, and dispatches the engine's {@code unregister}. The engine owns + * everything DSL-specific (backend cascade, applied-entry removal, classloader retire, + * static-rule fallback, alarm reset target). The orchestrator owns the cross-DSL bookkeeping + * (clearing the content side of {@link AppliedRuleScript} on success). + * + *

After a successful teardown, the engine's {@code reloadStatic} hook is invoked so any + * bundled-static rule that the now-removed runtime override was masking gets brought back into + * service via a fresh {@code static:} loader from + * {@link org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager}. + * + *

{@code invokeAlarmOnRemove}. Two legitimate call modes: + *

    + *
  • Full tear-down ({@code status→INACTIVE}, {@code /delete}, gone-keys cleanup): pass + * {@code true}. The engine's prior-bundle metric set is the authoritative reset target — + * no new bundle is coming.
  • + *
  • Update path (the caller is about to re-register): pass {@code false}. The orchestrator + * hands the engine a no-op alarm resetter so the engine's existing reset call is + * neutralised; the caller drives the reset itself using the classifier's precise delta.
  • + *
+ */ +@Slf4j +public final class DSLRuntimeUnregister { + + private static final Consumer> NO_OP_ALARM_RESETTER = s -> { + }; + + private final Map rules; + private final ModuleManager moduleManager; + private final Consumer> alarmResetter; + private final RuleEngineRegistry engineRegistry; + + public DSLRuntimeUnregister(final Map rules, + final ModuleManager moduleManager, + final Consumer> alarmResetter, + final RuleEngineRegistry engineRegistry) { + this.rules = rules; + this.moduleManager = moduleManager; + this.alarmResetter = alarmResetter; + this.engineRegistry = engineRegistry; + } + + public boolean unregister(final String catalog, final String name, + final boolean invokeAlarmOnRemove, + final StorageManipulationOpt storageOpt) { + return unregister(catalog, name, invokeAlarmOnRemove, storageOpt, false); + } + + /** + * Tear down a bundle's local registrations. {@code reloadStaticAfter} controls whether + * the bundled rule (if any) is reinstalled after the unregister: + * + *
    + *
  • {@code false} — used by {@code /inactivate} and the apply path's INACTIVE + * classification. The operator deliberately turned the rule OFF; bringing the + * bundled twin back instantly would defeat the soft-pause contract. The local + * state is left at {@code NOT_LOADED}.
  • + *
  • {@code true} — used by the row-gone reconcile path (a {@code /delete} cleared + * the row, peer ticks observe the absence). The runtime override no longer + * exists, so the bundled YAML (if any) should serve again — engines reload via + * {@link RuleEngine#reloadStatic} into a fresh {@code static:} loader.
  • + *
+ * + * @return {@code true} when a bundled fall-over was actually installed (caller may want + * to retain the entry in the unified rules map rather than removing it); + * {@code false} otherwise (no engine, no bundled twin, reload failed, or + * {@code reloadStaticAfter=false}). + */ + public boolean unregister(final String catalog, final String name, + final boolean invokeAlarmOnRemove, + final StorageManipulationOpt storageOpt, + final boolean reloadStaticAfter) { + final RuleEngine engine = engineRegistry.forCatalog(catalog); + if (engine == null) { + log.warn("runtime-rule dslManager: no engine registered for catalog '{}' on " + + "unregister of {}/{}; skipping", catalog, catalog, name); + return false; + } + final Consumer> resetter = + invokeAlarmOnRemove ? alarmResetter : NO_OP_ALARM_RESETTER; + final ApplyInputs inputs = new ApplyInputs(moduleManager, storageOpt, resetter, rules); + runEngineUnregister(engine, catalog, name, inputs); + + // Cross-DSL bookkeeping: clear the cached raw content so the next classify call sees + // "no prior bundle". Engines deliberately don't touch this — it's shared between + // catalogs and the orchestrator owns the lifecycle. State is preserved (set + // elsewhere — INACTIVE tombstone, NOT_LOADED, or reset by reloadStatic below). + rules.computeIfPresent(DSLScriptKey.key(catalog, name), + (k, prev) -> prev.withContent(null)); + + if (!reloadStaticAfter) { + return false; + } + try { + return engine.reloadStatic(catalog, name, resetter, moduleManager); + } catch (final Throwable t) { + log.warn("runtime-rule dslManager: static fall-over reload failed for {}/{}; " + + "bundled rule may stay dark until a successful re-apply or restart", + catalog, name, t); + return false; + } + } + + /** Wildcard-capture helper that lets {@code engine.unregister} be called against a + * {@code RuleEngine} without an unchecked cast. */ + private static void runEngineUnregister( + final RuleEngine engine, final String catalog, final String name, + final ApplyInputs inputs) { + final C ctx = engine.newApplyContext(inputs); + engine.unregister(catalog, name, ctx); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLScriptKey.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLScriptKey.java new file mode 100644 index 000000000000..1f8c59084d33 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLScriptKey.java @@ -0,0 +1,88 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.LalFileApplier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal.LalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal.MalRuleEngine; + +/** + * Pure key-format helpers shared across the runtime-rule dslManager, REST handler, and + * cluster service. Lives outside {@link DSLManager} so consumers can reference these + * without pulling in the orchestrator type. + */ +public final class DSLScriptKey { + + private DSLScriptKey() { + } + + /** + * Snapshot key for a (catalog, name) pair. Same format every node uses, so cluster + * Suspend / Resume / Forward RPCs and REST {@code /list} resolve to the same entry + * the dslManager owns. + */ + public static String key(final String catalog, final String name) { + return catalog + ":" + name; + } + + /** + * Stringify a {@link LalFileApplier.RegisteredRule} into a {@code "layer:ruleName"} + * key used by the LAL apply-path diff to identify which old rule keys are + * truly-gone (not taken over via {@code factory.addOrReplace}) and therefore need + * explicit removal. Auto-layer rules serialize as the literal string "auto" to + * match how the rest of the LAL path represents them. + */ + public static String lalRuleKey(final LalFileApplier.RegisteredRule r) { + final String layer = r.getLayer() == null ? "auto" : r.getLayer().name(); + return layer + ":" + r.getRuleName(); + } + + /** + * First eight characters of a SHA-256 hex string, or {@code "none"} when the input + * is null. Used in log breadcrumbs where the full digest would be noise but + * operators still want enough discriminator to match an apply log line to its + * stored row. + */ + public static String shortHash(final String hash) { + if (hash == null || hash.length() <= 8) { + return hash == null ? "none" : hash; + } + return hash.substring(0, 8); + } + + /** + * True for catalogs whose rule files parse as MAL. Routes through the engine registry + * rather than a hardcoded string set so a catalog added to {@link MalRuleEngine#supportedCatalogs} + * (e.g. {@code telegraf-rules}) is automatically recognised by every {@code isMalCatalog} + * caller — no parallel string list to keep in sync. + */ + public static boolean isMalCatalog(final RuleEngineRegistry registry, final String catalog) { + final RuleEngine engine = registry.forCatalog(catalog); + return engine instanceof MalRuleEngine; + } + + /** True for catalogs whose rule files parse as LAL. Same registry-driven routing as + * {@link #isMalCatalog}. */ + public static boolean isLalCatalog(final RuleEngineRegistry registry, final String catalog) { + final RuleEngine engine = registry.forCatalog(catalog); + return engine instanceof LalRuleEngine; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/PendingApplyCommit.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/PendingApplyCommit.java new file mode 100644 index 000000000000..b18a20bd7abe --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/PendingApplyCommit.java @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; + +/** + * Pre-compiled-and-verified engine output the scheduler holds while the REST handler waits + * on row-persist. The opaque {@link DSLRuntimeApply.Outcome} wraps the engine's + * {@code CompiledDSL} + {@code ApplyContext} + the type-safe commit/rollback dispatch + * helpers; we hold it as-is so {@link StructuralCommitCoordinator} can finalize via + * {@code dslRuntimeApply.commit(outcome)} or discard via {@code dslRuntimeApply.rollback} + * without re-implementing engine work. + * + *

Scheduler-side state ({@link #prevSnapshot}, {@link #wasSuspended}, {@link #commitNowMs}) + * is what drives the snapshot transition + suspend resume after the engine commits. These + * don't belong on the engine's CompiledDSL because they're bookkeeping the scheduler owns. + */ +public final class PendingApplyCommit { + + final DSLRuntimeApply.Outcome outcome; + final DSLRuntimeState prevSnapshot; + final boolean wasSuspended; + final long commitNowMs; + + public PendingApplyCommit(final DSLRuntimeApply.Outcome outcome, + final DSLRuntimeState prevSnapshot, + final boolean wasSuspended, + final long commitNowMs) { + this.outcome = outcome; + this.prevSnapshot = prevSnapshot; + this.wasSuspended = wasSuspended; + this.commitNowMs = commitNowMs; + } + + public String catalog() { + return outcome.compiled.getCatalog(); + } + + public String name() { + return outcome.compiled.getName(); + } + + public String newContentHash() { + return outcome.compiled.getContentHash(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java new file mode 100644 index 000000000000..0cba691a911c --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java @@ -0,0 +1,264 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.metrics.LockMetrics; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; + +/** + * Periodic DB → local-state sync. Reads the current DAO state, diffs against the in-memory + * snapshot, and drives apply / unregister / rehydrate for each (catalog, name) pair through + * the orchestrators ({@link DSLRuntimeApply}, {@link DSLRuntimeUnregister}). DSL-agnostic — + * does NOT hold engine-specific applied state ({@code appliedMal} / {@code appliedLal}); + * those live behind the engine boundary and the orchestrators consult them on the timer's + * behalf. + * + *

The timer is the resilience boundary — per-rule failures are caught and logged so a + * single bad bundle can't stall the rest of the convergence pass. The timer is also + * idempotent: a skipped or partially-applied file gets re-attempted on the next interval. + * + *

Three sub-phases run in order: + *

    + *
  1. Apply loop — iterate every DB row, classify and route through {@link + * DSLManager#applyOneRuleFile} (which delegates to {@link DSLRuntimeApply} or {@link + * DSLRuntimeUnregister} via the per-DSL drivers). Honours the per-tick + * {@link StorageManipulationOpt} and the marker-debt promotion (peer that was + * localCacheOnly is now main → re-fire under fullInstall).
  2. + *
  3. Gone-keys cleanup — anything in the snapshot that's not in the DB and not + * static-shadowed gets {@link DSLRuntimeUnregister}'d. Snapshot removal is deferred + * past unregister so a transient teardown failure doesn't lose the retry.
  4. + *
  5. Static rehydrate — {@link StaticRuleLoader#loadIfMissing} brings any + * {@code /delete}d static rule back online from disk content.
  6. + *
+ * + *

Why this class only reads {@code snapshot}. The {@code snapshot} map carries the + * scheduler-side metadata of every apply attempt: last contentHash, localState + * (RUNNING/SUSPENDED/NOT_LOADED), suspendOrigin, applyError, timestamps. This is enough to + * decide: + *

    + *
  • Pre-compile short-circuit — if {@code prev.contentHash == newHash} and the + * active/inactive status matches, skip the file entirely.
  • + *
  • Gone-keys — bundles in {@code snapshot} but absent from the DB.
  • + *
+ * "Is this currently registered?" is an engine question; the orchestrators ask the engine + * (via {@code engine.activeClaimsExcluding} / engine-internal applied maps) when they need it. + */ +@Slf4j +public final class RuleSync { + + private final ModuleManager moduleManager; + private final LockMetrics lockMetrics; + private final Map rules; + private final StaticRuleLoader staticRuleLoader; + private final ApplyOneRuleFile applyOne; + private final Unregister unregister; + private final TickStorageOptPicker storageOptPicker; + + public RuleSync(final ModuleManager moduleManager, + final LockMetrics lockMetrics, + final Map rules, + final StaticRuleLoader staticRuleLoader, + final ApplyOneRuleFile applyOne, + final Unregister unregister, + final TickStorageOptPicker storageOptPicker) { + this.moduleManager = moduleManager; + this.lockMetrics = lockMetrics; + this.rules = rules; + this.staticRuleLoader = staticRuleLoader; + this.applyOne = applyOne; + this.unregister = unregister; + this.storageOptPicker = storageOptPicker; + } + + /** + * Run the full tick body once. {@code atBoot=true} on the synchronous first tick from + * {@code RuntimeRuleModuleProvider.notifyAfterCompleted}; the storage-opt picker uses this + * to choose {@code localCacheVerify} on no-init OAPs (fail boot if backend is not in shape). + */ + public void runOnce(final boolean atBoot) { + final RuntimeRuleManagementDAO dao; + try { + dao = moduleManager.find(StorageModule.NAME).provider() + .getService(RuntimeRuleManagementDAO.class); + } catch (final Throwable t) { + log.warn("RuntimeRuleManagementDAO not available from the active storage module; " + + "skipping tick", t); + return; + } + final List ruleFiles; + try { + ruleFiles = dao.getAll(); + } catch (final IOException e) { + log.warn("failed to read runtime_rule files; next tick will retry", e); + return; + } + ruleFiles.sort(Comparator + .comparing(RuntimeRuleManagementDAO.RuntimeRuleFile::getCatalog) + .thenComparing(RuntimeRuleManagementDAO.RuntimeRuleFile::getName)); + + // Capture the storage policy ONCE for the whole tick. Re-querying mid-tick is no + // more authoritative than the first read. + final StorageManipulationOpt tickOpt = storageOptPicker.pick(atBoot); + final Set seenKeys = new HashSet<>(); + final long nowMs = System.currentTimeMillis(); + + for (final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile : ruleFiles) { + applyOneFromDb(ruleFile, nowMs, tickOpt, seenKeys); + } + + cleanupGoneKeys(seenKeys, tickOpt); + + staticRuleLoader.loadIfMissing(seenKeys, nowMs, tickOpt); + } + + /** Per-row apply. Short-circuits when {@code dbActive == localActive} and the content + * hash matches; otherwise drives the per-file apply path. */ + private void applyOneFromDb(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final long nowMs, final StorageManipulationOpt tickOpt, + final Set seenKeys) { + final String key = DSLScriptKey.key(ruleFile.getCatalog(), ruleFile.getName()); + seenKeys.add(key); + final String newHash = ContentHash.sha256Hex(ruleFile.getContent()); + final AppliedRuleScript prevScript = rules.get(key); + final DSLRuntimeState prev = prevScript == null ? null : prevScript.getState(); + final boolean dbActive = !"INACTIVE".equals(ruleFile.getStatus()); + final boolean localEffectivelyActive = prev != null + && prev.getLocalState() != DSLRuntimeState.LocalState.NOT_LOADED; + if (prev != null + && dbActive == localEffectivelyActive + && Objects.equals(prev.getContentHash(), newHash)) { + return; + } + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, + ruleFile.getCatalog(), ruleFile.getName()); + if (!lockMetrics.tryAcquireForSyncTimer(perFile, ruleFile.getCatalog(), ruleFile.getName())) { + return; + } + try (HistogramMetrics.Timer ignored = lockMetrics.startSyncTimerHoldTimer()) { + try { + applyOne.applyOneRuleFile(ruleFile, newHash, prev, nowMs, key, false, tickOpt); + } catch (final Throwable t) { + // Per-rule isolation: one failing apply must not abort the tick. + log.warn("runtime-rule dslManager: apply path threw for {}/{}; tick continues " + + "with other rules, next tick will retry", + ruleFile.getCatalog(), ruleFile.getName(), t); + } + } finally { + perFile.unlock(); + } + } + + /** Tear down bundles whose DB row is gone. */ + private void cleanupGoneKeys(final Set seenKeys, final StorageManipulationOpt tickOpt) { + final List removedKeys = new ArrayList<>(); + for (final String existing : rules.keySet()) { + if (seenKeys.contains(existing)) { + continue; + } + // Skip boot-seeded bundled-only entries — DSLRuntimeState is null when the + // entry was created by the StaticRuleLoader and the operator hasn't touched it + // (no /addOrUpdate, no /inactivate). For those entries the DB never carried a + // row, so its absence is not a "removed" signal. Operator-touched entries + // (state != null) get teared down + bundled fall-over reload below. + final AppliedRuleScript script = rules.get(existing); + if (script != null && script.getState() == null) { + continue; + } + removedKeys.add(existing); + } + for (final String gone : removedKeys) { + final String[] parts = gone.split(":", 2); + if (parts.length != 2) { + continue; + } + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, parts[0], parts[1]); + if (!lockMetrics.tryAcquireForSyncTimer(perFile, parts[0], parts[1])) { + continue; + } + try (HistogramMetrics.Timer ignored = lockMetrics.startSyncTimerHoldTimer()) { + final AppliedRuleScript prevScript = rules.get(gone); + final DSLRuntimeState prev = prevScript == null ? null : prevScript.getState(); + log.info("runtime-rule dslManager: rule file deleted {} (last hash={})", + gone, prev == null ? "?" : DSLScriptKey.shortHash(prev.getContentHash())); + if (prevScript == null) { + rules.remove(gone); + continue; + } + // Map removal deferred to AFTER unregister succeeds. If unregister throws, + // the entry stays so the next tick retries via the same removedKeys path. + try { + // unregisterBundle with reloadStaticAfter=true: tear down the removed + // runtime registrations, then if the rule has a bundled twin install + // it fresh via a static: loader. Returns true when a bundled fall-over + // landed — in that case we KEEP the rules entry (reloadStatic re-seeded + // it as a bundled-served entry, equivalent to a boot-seeded one). + // Otherwise the entry is fully gone and we remove it. + final boolean staticReloaded = unregister.unregisterBundle( + parts[0], parts[1], true, tickOpt, true); + if (!staticReloaded) { + rules.remove(gone); + } + } catch (final Throwable t) { + log.warn("runtime-rule dslManager: teardown threw for removed rule {}; " + + "rule entry retained — next tick will retry", gone, t); + } + } finally { + perFile.unlock(); + } + } + } + + /** Functional handle for per-file apply — supplied by DSLManager. */ + @FunctionalInterface + public interface ApplyOneRuleFile { + void applyOneRuleFile(RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, String newHash, + DSLRuntimeState prev, long nowMs, String key, boolean deferCommit, + StorageManipulationOpt storageOpt); + } + + /** Functional handle for unregister-bundle. */ + @FunctionalInterface + public interface Unregister { + boolean unregisterBundle(String catalog, String name, boolean invokeAlarmOnRemove, + StorageManipulationOpt storageOpt, boolean reloadStaticAfter); + } + + /** Functional handle for per-tick storage-opt picking (init / no-init / main vs peer). */ + @FunctionalInterface + public interface TickStorageOptPicker { + StorageManipulationOpt pick(boolean atBoot); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java new file mode 100644 index 000000000000..3c112196f940 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.util.Map; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.metrics.LockMetrics; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; + +/** + * Loads static rule files (the on-disk rules the catalog loaders compiled at module start) + * into the runtime-rule dslManager's view of the world. Two entry points: + * + *
    + *
  • {@link #loadAll} — boot-time load. Asks every engine to load its catalog's static + * rules into the engine's internal applied state (via + * {@link RuleEngine#loadStaticRuleFile}), then seeds the shared + * {@code appliedContent} + {@code snapshot} maps so the first {@code /addOrUpdate} + * classifier and the first Suspend lookup see the bundle.
  • + *
  • {@link #loadIfMissing} — tick-time load. Re-loads any static rule whose DB row got + * {@code /delete}d while leaving the disk content intact; without this the rule would + * stay dormant until the next OAP restart, contradicting {@code /delete}'s + * "rule reverts to disk version" promise.
  • + *
+ * + *

Why this class exists. {@code MeterProcessService}, + * {@code OpenTelemetryMetricRequestProcessor}, and the LAL {@code Factory} compile and + * register static rules during their own module {@code start()} — by the time the runtime- + * rule receiver boots, MeterSystem has prototypes, MetricsStreamProcessor has workers, and + * the LAL factory has handlers, all live. But the runtime-rule dslManager doesn't see any + * of it because those registrations happened outside its pipeline. Without loading: + *

    + *
  • The first {@code /inactivate} against a shipped static rule no-ops (engine sees no + * prior applied entry); handlers keep serving the rule the operator just paused.
  • + *
  • The first {@code /addOrUpdate} classifies against {@code priorContent == null} and + * returns {@code NEW} even on a filter-only edit, mis-computing the shape-break / + * removed-metrics sets.
  • + *
  • Cluster Suspend RPCs return {@code NOT_PRESENT} (snapshot misses), bypassing the + * suspend window for that bundle.
  • + *
+ * + *

DSL-agnostic. The actual per-DSL load — building a synthetic Applied artifact + * with metric names (MAL) or registered-rule list (LAL) — happens behind + * {@link RuleEngine#loadStaticRuleFile}. This class only iterates {@code StaticRuleRegistry}, + * routes each entry to the matching engine, and updates the shared scheduler-side state on + * success. + */ +@Slf4j +public final class StaticRuleLoader { + + private final RuleEngineRegistry engineRegistry; + private final Map rules; + private final LockMetrics lockMetrics; + /** Tick-time per-file apply driver for {@link #loadIfMissing}. Bound at construction so + * this class doesn't depend on DSLManager directly. */ + private final ApplyOne applyOne; + + public StaticRuleLoader(final RuleEngineRegistry engineRegistry, + final Map rules, + final LockMetrics lockMetrics, + final ApplyOne applyOne) { + this.engineRegistry = engineRegistry; + this.rules = rules; + this.lockMetrics = lockMetrics; + this.applyOne = applyOne; + } + + /** + * Boot-time load: for every {@code (catalog, name)} in {@link StaticRuleRegistry}, ask + * the matching engine to load it. On success, also load the shared {@code appliedContent} + * + {@code snapshot} maps so the first {@code /addOrUpdate} classifier and the first + * Suspend lookup see the bundle. + */ + public void loadAll() { + final Map entries = StaticRuleRegistry.active().entries(); + if (entries.isEmpty()) { + return; + } + int loaded = 0; + final long nowMs = System.currentTimeMillis(); + for (final Map.Entry e : entries.entrySet()) { + final String[] parts = StaticRuleRegistry.splitKey(e.getKey()); + if (parts == null) { + continue; + } + final String catalog = parts[0]; + final String name = parts[1]; + final RuleEngine engine = engineRegistry.forCatalog(catalog); + if (engine == null) { + continue; + } + final String content = e.getValue(); + if (!engine.loadStaticRuleFile(catalog, name, content)) { + continue; + } + final String key = DSLScriptKey.key(catalog, name); + final String contentHash = ContentHash.sha256Hex(content); + // Without these the first REST /addOrUpdate would classify against null prior + // content and return NEW even on a filter-only edit; the first Suspend RPC + // would lookup-miss. + rules.putIfAbsent(key, new AppliedRuleScript(catalog, name, content, + DSLRuntimeState.running(catalog, name, contentHash, nowMs))); + loaded++; + } + if (loaded > 0) { + log.info("runtime-rule dslManager: loaded {} static rule file(s) from " + + "StaticRuleRegistry — /inactivate, /addOrUpdate classify, and Suspend " + + "broadcast now cover shipped static rules.", loaded); + } + } + + /** + * Tick-time load: re-applies static rules whose DB row got {@code /delete}d while leaving + * disk content intact. Skips rules with a DB row this tick (operator state wins) and + * rules already tracked in {@code snapshot} (boot load or prior tick covered them). + * Uses tryLock so a racing REST workflow defers to the next tick. + */ + public void loadIfMissing(final Set seenKeys, final long nowMs, + final StorageManipulationOpt tickOpt) { + final Map entries = StaticRuleRegistry.active().entries(); + if (entries.isEmpty()) { + return; + } + for (final Map.Entry e : entries.entrySet()) { + final String[] parts = StaticRuleRegistry.splitKey(e.getKey()); + if (parts == null) { + continue; + } + final String catalog = parts[0]; + final String name = parts[1]; + final String key = DSLScriptKey.key(catalog, name); + if (seenKeys.contains(key)) { + continue; + } + // Snapshot presence is the scheduler's "is this bundle tracked?" signal — engine + // ownership lives behind loadStaticRuleFile. If snapshot has the key, either the + // engine has it loaded or a runtime apply did it; either way nothing to redo. + if (rules.containsKey(key)) { + continue; + } + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); + if (!lockMetrics.tryAcquireForSyncTimer(perFile, catalog, name)) { + continue; + } + try (HistogramMetrics.Timer ignored = lockMetrics.startSyncTimerHoldTimer()) { + if (rules.containsKey(key)) { + continue; + } + final String content = e.getValue(); + final String hash = ContentHash.sha256Hex(content); + final RuntimeRuleManagementDAO.RuntimeRuleFile synthetic = + new RuntimeRuleManagementDAO.RuntimeRuleFile( + catalog, name, content, "ACTIVE", nowMs); + log.info("runtime-rule dslManager: re-loading static rule {}/{} from " + + "StaticRuleRegistry (no DB row, no applied state)", catalog, name); + applyOne.applyOneRuleFile(synthetic, hash, null, nowMs, key, false, tickOpt); + } finally { + perFile.unlock(); + } + } + } + + /** Per-file apply handle, supplied by DSLManager. */ + @FunctionalInterface + public interface ApplyOne { + void applyOneRuleFile(RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + String newHash, DSLRuntimeState prev, long nowMs, String key, + boolean deferCommit, StorageManipulationOpt storageOpt); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StructuralCommitCoordinator.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StructuralCommitCoordinator.java new file mode 100644 index 000000000000..24b7d3196296 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StructuralCommitCoordinator.java @@ -0,0 +1,166 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.ReentrantLock; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; + +/** + * REST two-phase commit stash for STRUCTURAL apply. Bridges the gap between + * {@link DSLRuntimeApply#compileAndVerify} (engine compile + verify succeeded — apply still + * reversible) and the REST handler's row-persist (where the apply becomes durable). + * + *

Three entry points: + *

    + *
  • {@link #stash} — apply pipeline parks a pending commit when the REST caller wants + * the destructive tail deferred until persist succeeds.
  • + *
  • {@link #finalizeCommit} — REST handler invokes after row-persist succeeds; routes + * the stashed outcome through {@link DSLRuntimeApply#commit} (the engine swaps + * appliedX, drops removedMetrics, retires the displaced loader, fires alarm reset), + * then runs the scheduler-side snapshot transition.
  • + *
  • {@link #discardCommit} — REST handler invokes after row-persist fails; routes the + * stashed outcome through {@link DSLRuntimeApply#rollback} (the engine drops just- + * registered metrics), then resumes dispatch + flips snapshot back to RUNNING.
  • + *
+ * + *

The {@link #commitInline} variant runs the same commit tail without the stash — + * used by the tick path where there is no row-persist gate to wait on. + * + *

What this class owns vs delegates. The 2-PC stash + scheduler-side state + * transitions (snapshot.put, suspendCoord.resumeDispatchForBundle) live here. The engine + * pipeline (commit / rollback) lives behind {@link DSLRuntimeApply}; this coordinator + * doesn't know how MAL's commit body works, only when to invoke it. + */ +@Slf4j +public class StructuralCommitCoordinator { + + private final Map pendingCommits = new ConcurrentHashMap<>(); + + private final Map rules; + private final DSLRuntimeApply dslRuntimeApply; + private final SuspendResumeCoordinator suspendCoord; + + public StructuralCommitCoordinator(final Map rules, + final DSLRuntimeApply dslRuntimeApply, + final SuspendResumeCoordinator suspendCoord) { + this.rules = rules; + this.dslRuntimeApply = dslRuntimeApply; + this.suspendCoord = suspendCoord; + } + + /** + * Park a pending commit until the REST handler's row-persist resolves. Caller must + * already hold the per-file lock (the apply pipeline does). + */ + public void stash(final PendingApplyCommit p) { + pendingCommits.put(DSLScriptKey.key(p.catalog(), p.name()), p); + } + + /** + * Drain the pending commit after the REST handler's row-persist succeeded. Acquires + * the per-file lock so the commit tail is consistent with concurrent applies on the + * same file. Returns {@code true} when a commit was actually drained, {@code false} + * when no pending commit existed (typical for {@code force=true} re-applies on byte- + * identical content — the engine classified as NO_CHANGE so nothing was stashed). The + * REST handler uses the return to decide whether peers still need a Resume broadcast. + */ + public boolean finalizeCommit(final String catalog, final String name) { + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); + perFile.lock(); + try { + final PendingApplyCommit p = pendingCommits.remove(DSLScriptKey.key(catalog, name)); + if (p == null) { + return false; + } + commitInline(p); + return true; + } finally { + perFile.unlock(); + } + } + + /** + * Drain the pending commit after the REST handler's row-persist failed. The engine's + * rollback drops just-registered metrics; snapshot stays at the pre-apply value so + * the local node re-aligns with cluster state on the next tick. + */ + public void discardCommit(final String catalog, final String name) { + final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); + perFile.lock(); + try { + final PendingApplyCommit p = pendingCommits.remove(DSLScriptKey.key(catalog, name)); + if (p == null) { + return; + } + // Engine drops the just-registered added + shape-break metrics. Old applied + // state is still intact (commit never ran), so unchanged metrics keep serving. + dslRuntimeApply.rollback(p.outcome); + // If this node came in SUSPENDED (peer broadcast or self-suspend), flip back + // to RUNNING + resume dispatch so samples for unchanged metrics flow again. + if (p.wasSuspended) { + final String pKey = DSLScriptKey.key(catalog, name); + suspendCoord.resumeDispatchForBundle(pKey); + final AppliedRuleScript curScript = rules.get(pKey); + final DSLRuntimeState cur = curScript == null ? null : curScript.getState(); + if (cur != null && cur.getLocalState() == DSLRuntimeState.LocalState.SUSPENDED) { + rules.put(pKey, curScript.withState( + cur.withLocalState(DSLRuntimeState.LocalState.RUNNING, System.currentTimeMillis()))); + } + } + } finally { + perFile.unlock(); + } + } + + /** + * Drain a {@link PendingApplyCommit} by routing through the engine's commit (which + * drops removedMetrics + swaps appliedMal/appliedContent + pushes the converter + + * retires the old loader + fires alarm reset), then runs the scheduler-side snapshot + * transition + suspend resume. + * + *

Called from the tick path directly (inline commit) and from {@link #finalizeCommit} + * (REST path, after row-persist succeeds). Both paths hold the per-file lock already. + */ + public void commitInline(final PendingApplyCommit p) { + // Engine does the full commit body. This call is the only place outside DSLRuntimeApply + // that drives engine.commit; the apply-inline path goes through dslRuntimeApply.applyInline. + dslRuntimeApply.commit(p.outcome); + + final String pKey = DSLScriptKey.key(p.catalog(), p.name()); + // Resume dispatch for unchanged metrics that were parked during Suspend. + if (p.wasSuspended) { + suspendCoord.resumeDispatchForBundle(pKey); + } + // Snapshot transition: advance contentHash to the newly-committed bundle, flip to + // RUNNING when the bundle came in SUSPENDED. + final DSLRuntimeState base = p.prevSnapshot == null + ? DSLRuntimeState.running(p.catalog(), p.name(), p.newContentHash(), p.commitNowMs) + : p.prevSnapshot.withContentHash(p.newContentHash(), p.commitNowMs); + final DSLRuntimeState newState = p.wasSuspended + ? base.withLocalState(DSLRuntimeState.LocalState.RUNNING, p.commitNowMs) + : base; + rules.compute(pKey, (k, prev) -> prev == null + ? new AppliedRuleScript(p.catalog(), p.name(), null, newState) + : prev.withState(newState)); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResult.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResult.java new file mode 100644 index 000000000000..2086cd822643 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResult.java @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +/** + * Outcome of a Suspend request through {@link SuspendResumeCoordinator#localSuspend} + * or {@link SuspendResumeCoordinator#peerSuspend}. Distinct values cover the four + * cases the cluster RPC handlers and REST handler need to disambiguate so the cluster + * RPC handlers can map the result to the wire protocol's ack states and the REST + * handler can distinguish "this node was already suspended by me" from "rejected + * because the other origin already holds it". + */ +public enum SuspendResult { + /** Bundle transitioned RUNNING → SUSPENDED; dispatch parked. */ + SUSPENDED, + /** Bundle was already SUSPENDED with this origin — idempotent replay, no state change. */ + ALREADY_SUSPENDED, + /** Bundle does not exist on this node. */ + NOT_PRESENT, + /** + * Request refused: the OTHER origin already holds this bundle SUSPENDED. Rejecting + * instead of merging to BOTH because correct routing never produces cross-origin + * concurrency — reaching this branch signals misrouted request or split-brain. + * Caller logs WARN and propagates rejection. + */ + REJECTED_ORIGIN_CONFLICT +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResumeCoordinator.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResumeCoordinator.java new file mode 100644 index 000000000000..cc555bf7b118 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/SuspendResumeCoordinator.java @@ -0,0 +1,328 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; + +import java.util.Map; +import java.util.Objects; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.locks.ReentrantLock; +import java.util.function.Supplier; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.EngineApplied; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; + +/** + * Suspend/Resume state machine on top of {@link DSLRuntimeState#getSuspendOrigin()}. Owns + * the SELF / PEER / BOTH origin transitions, the dispatch-side park/unpark fan-out across + * engines (driven through each rule's {@link EngineApplied} so this class never switches + * on MAL vs LAL), and the self-heal sweep that recovers from a peer main crashing between + * Suspend and Resume. + * + *

Lifecycle is driven by three callers: + *

    + *
  • REST {@code /addOrUpdate} / {@code /inactivate} / {@code /delete} — the + * local main calls {@link #localSuspend} before its DDL workflow and + * {@link #localResume} on its own rollback / discard path. + *
  • Cluster {@code Suspend} / {@code Resume} RPCs — incoming peer broadcasts + * call {@link #peerSuspend} / {@link #peerResume}. + *
  • DSLManager tick — calls {@link #sweepSuspendedForSelfHeal} once per tick + * to recover any PEER-origin entries whose main died between Suspend and Resume. + *
+ * + *

Lock contract: every state transition acquires the per-file {@link ReentrantLock} that + * lives on each {@link AppliedRuleScript} (lazy-created via + * {@link AppliedRuleScript#lockFor}) — the same mutex the apply pipeline uses, so suspend + * bookkeeping written here is consistent with apply-pipeline writes without a separate lock. + */ +@Slf4j +public class SuspendResumeCoordinator { + + /** + * Bound on how long an inbound Suspend will wait for the per-file lock before + * giving up. Short — longer than normal tick contention (which uses its own + * tryLock and defers within milliseconds), shorter than a typical apply + * workflow's hold on the lock. If we can't acquire within this window, the safe + * interpretation is "another apply owns this file locally" and the correct + * response is split-brain rejection. + */ + private static final long SUSPEND_LOCK_TIMEOUT_MS = 500L; + + private final Map rules; + private final ModuleManager moduleManager; + private final long selfHealThresholdMs; + private final Supplier> dbRulesReader; + + public SuspendResumeCoordinator(final Map rules, + final ModuleManager moduleManager, + final long selfHealThresholdMs, + final Supplier> dbRulesReader) { + this.rules = rules; + this.moduleManager = moduleManager; + this.selfHealThresholdMs = selfHealThresholdMs; + this.dbRulesReader = dbRulesReader; + } + + /** + * Local suspend: the local REST apply workflow is about to fire DDL on the main + * node, so dispatch must park and prior handlers must stop accepting samples. + * Records {@link DSLRuntimeState.SuspendOrigin#SELF} on the snapshot entry. Idempotent + * on SELF replay. + * + *

REJECTS with {@link SuspendResult#REJECTED_ORIGIN_CONFLICT} if PEER is + * already set — that means another OAP thinks it's the main for this file at the + * same time (routing failure or split-brain). The REST handler propagates the + * rejection to the operator with HTTP 409; correct routing never triggers this + * branch. + */ + public SuspendResult localSuspend(final String catalog, final String name) { + return applySuspend(catalog, name, DSLRuntimeState.SuspendOrigin.SELF); + } + + /** + * Peer-suspend: an inbound {@code Suspend} RPC from a peer main node. Records + * {@link DSLRuntimeState.SuspendOrigin#PEER}. Idempotent on PEER replay. + * + *

REJECTS with {@link SuspendResult#REJECTED_ORIGIN_CONFLICT} if SELF is + * already set — this node is itself mid-apply for the same file, so another node + * claiming to be main is a routing conflict. + */ + public SuspendResult peerSuspend(final String catalog, final String name) { + return applySuspend(catalog, name, DSLRuntimeState.SuspendOrigin.PEER); + } + + /** + * Clear SELF origin. Called by the REST handler on its own rollback / exception / + * discard path. If PEER is also set (BOTH), origin transitions BOTH → PEER and + * the bundle stays SUSPENDED waiting for the peer's Resume or self-heal. If SELF + * was the only origin, the bundle flips back to RUNNING and dispatch resumes. + */ + public int localResume(final String catalog, final String name) { + return applyResume(catalog, name, DSLRuntimeState.SuspendOrigin.SELF); + } + + /** + * Clear PEER origin. Called by the inbound {@code Resume} RPC handler and by the + * self-heal sweep when the peer main that issued Suspend never sent Resume. + */ + public int peerResume(final String catalog, final String name) { + return applyResume(catalog, name, DSLRuntimeState.SuspendOrigin.PEER); + } + + private SuspendResult applySuspend(final String catalog, final String name, + final DSLRuntimeState.SuspendOrigin incoming) { + final String key = DSLScriptKey.key(catalog, name); + // tryLock with a bounded deadline instead of blocking indefinitely. In the + // split-brain scenario two nodes both enter the per-file workflow, each holds + // its own per-file lock across its broadcastSuspend call, and each peer's + // Suspend handler would block on the other's lock until the gRPC deadline + // fires — the client then converts the timeout to an unreachable/null ack + // and BOTH sides proceed, racing on persist. Short timeout here turns that + // race into an immediate REJECTED_ORIGIN_CONFLICT. + final ReentrantLock lock = AppliedRuleScript.lockFor(rules, catalog, name); + final boolean acquired; + try { + acquired = lock.tryLock(SUSPEND_LOCK_TIMEOUT_MS, TimeUnit.MILLISECONDS); + } catch (final InterruptedException ie) { + Thread.currentThread().interrupt(); + log.warn("runtime-rule Suspend interrupted while waiting for per-file lock on " + + "{}/{}; surfacing as split-brain rejection.", catalog, name); + return SuspendResult.REJECTED_ORIGIN_CONFLICT; + } + if (!acquired) { + log.warn("runtime-rule Suspend could not acquire per-file lock on {}/{} within " + + "{} ms — another apply workflow is in flight locally; treating as " + + "split-brain (the local workflow already owns SELF origin).", + catalog, name, SUSPEND_LOCK_TIMEOUT_MS); + return SuspendResult.REJECTED_ORIGIN_CONFLICT; + } + try { + final AppliedRuleScript existingScript = rules.get(key); + final DSLRuntimeState existing = existingScript == null ? null : existingScript.getState(); + if (existing == null) { + return SuspendResult.NOT_PRESENT; + } + final DSLRuntimeState.SuspendOrigin current = existing.getSuspendOrigin(); + if (current == incoming) { + return SuspendResult.ALREADY_SUSPENDED; + } + if (current == DSLRuntimeState.SuspendOrigin.SELF + || current == DSLRuntimeState.SuspendOrigin.PEER + || current == DSLRuntimeState.SuspendOrigin.BOTH) { + log.warn("runtime-rule ORIGIN CONFLICT: {}/{} already suspended by {}; " + + "refusing {} suspend. Likely cause: cluster routing misfire or split-brain — " + + "two nodes think they own the main role for this file.", + catalog, name, current, incoming); + return SuspendResult.REJECTED_ORIGIN_CONFLICT; + } + // current == NONE: bundle was RUNNING. Park dispatch and flip to SUSPENDED. + suspendDispatchForBundle(key); + final long nowMs = System.currentTimeMillis(); + rules.put(key, existingScript.withState(existing.withSuspendOrigin(incoming, nowMs))); + return SuspendResult.SUSPENDED; + } finally { + lock.unlock(); + } + } + + private int applyResume(final String catalog, final String name, + final DSLRuntimeState.SuspendOrigin clearing) { + final String key = DSLScriptKey.key(catalog, name); + final ReentrantLock lock = AppliedRuleScript.lockFor(rules, catalog, name); + lock.lock(); + try { + final AppliedRuleScript existingScript = rules.get(key); + final DSLRuntimeState existing = existingScript == null ? null : existingScript.getState(); + if (existing == null + || existing.getLocalState() != DSLRuntimeState.LocalState.SUSPENDED) { + return 0; + } + final DSLRuntimeState.SuspendOrigin newOrigin = existing.getSuspendOrigin().remove(clearing); + if (newOrigin == existing.getSuspendOrigin()) { + return 0; + } + final long nowMs = System.currentTimeMillis(); + if (newOrigin == DSLRuntimeState.SuspendOrigin.NONE) { + final int resumed = resumeDispatchForBundle(key); + rules.put(key, existingScript.withState(existing.withSuspendOrigin(newOrigin, nowMs))); + return resumed; + } + rules.put(key, existingScript.withState(existing.withSuspendOrigin(newOrigin, nowMs))); + return 0; + } finally { + lock.unlock(); + } + } + + /** + * Park dispatch for whatever the bundle has applied — engine-agnostic via + * {@link EngineApplied#suspendDispatch}. Returns the count of dispatch primitives + * paused; {@code 0} when the bundle hasn't been committed yet (no applied artefact) + * or the engine's runtime services aren't resolvable. + */ + private int suspendDispatchForBundle(final String key) { + final AppliedRuleScript script = rules.get(key); + if (script == null) { + return 0; + } + final EngineApplied applied = script.getApplied(); + if (applied == null) { + return 0; + } + return applied.suspendDispatch(moduleManager); + } + + /** + * Inverse of {@link #suspendDispatchForBundle}. Public so the apply pipeline can drive + * the post-apply resume after a successful structural commit / shape-match without + * taking a second cluster RPC round-trip. + */ + public int resumeDispatchForBundle(final String key) { + final AppliedRuleScript script = rules.get(key); + if (script == null) { + return 0; + } + final EngineApplied applied = script.getApplied(); + if (applied == null) { + return 0; + } + return applied.resumeDispatch(moduleManager); + } + + /** + * Recover bundles stuck in {@link DSLRuntimeState.LocalState#SUSPENDED} by a + * peer-origin Suspend whose main crashed before sending Resume. Only acts on + * PEER-only origins — SELF origin is the local REST apply's own bookkeeping, and + * BOTH origin indicates a SELF apply is in flight alongside a PEER broadcast (the + * local apply's finalize / discard path is the recovery, not self-heal). + * + *

Bundles whose DB content has advanced since the suspend are left for the + * apply pipeline to pick up via the normal content-hash diff — those are the + * "main node succeeded, we're catching up" path. We deliberately do not flip + * those back to RUNNING here: the correct handlers for the new content haven't + * been installed yet. + * + *

Most main-side failures now clear peer-side SUSPENDED within an RPC + * round-trip via the Resume broadcast, so this sweep is a backstop for the + * narrow case where the main crashes after Suspend but before Resume. Self-heal + * threshold can be tuned via the constructor parameter. + */ + public void sweepSuspendedForSelfHeal() { + final long nowNanos = System.nanoTime(); + final long thresholdNanos = TimeUnit.MILLISECONDS.toNanos(selfHealThresholdMs); + + final Map dbRules = dbRulesReader.get(); + if (dbRules == null) { + log.debug("runtime-rule self-heal: storage DAO unavailable, skipping sweep"); + return; + } + + for (final AppliedRuleScript script : rules.values()) { + final DSLRuntimeState current = script.getState(); + if (current == null + || current.getLocalState() != DSLRuntimeState.LocalState.SUSPENDED) { + continue; + } + if (current.getSuspendOrigin() != DSLRuntimeState.SuspendOrigin.PEER) { + continue; + } + final long ageNanos = nowNanos - current.getEnteredCurrentStateAtNanos(); + if (ageNanos < thresholdNanos) { + continue; + } + + final String key = DSLScriptKey.key(current.getCatalog(), current.getName()); + final RuntimeRuleManagementDAO.RuntimeRuleFile currentDbRule = dbRules.get(key); + + if (currentDbRule == null) { + log.debug("runtime-rule self-heal: bundle {}/{} DB rule gone; delta-apply will drop", + current.getCatalog(), current.getName()); + continue; + } + if (RuntimeRule.STATUS_INACTIVE.equals(currentDbRule.getStatus())) { + log.debug("runtime-rule self-heal: bundle {}/{} DB rule INACTIVE; delta-apply " + + "will tear down — not resuming", current.getCatalog(), current.getName()); + continue; + } + final String currentDbHash = ContentHash.sha256Hex(currentDbRule.getContent()); + if (!Objects.equals(currentDbHash, current.getContentHash())) { + log.debug("runtime-rule self-heal: bundle {}/{} DB advanced ({} → {}); " + + "leaving delta-apply to handle", + current.getCatalog(), current.getName(), + DSLScriptKey.shortHash(current.getContentHash()), + DSLScriptKey.shortHash(currentDbHash)); + continue; + } + + final long ageMs = TimeUnit.NANOSECONDS.toMillis(ageNanos); + log.warn("runtime-rule self-heal: bundle {}/{} has been PEER-suspended for {} ms " + + "(threshold {} ms) and DB content unchanged at hash {} — clearing PEER origin. " + + "Likely cause: the main node that issued Suspend crashed before sending " + + "Resume (Resume broadcast is the primary recovery path; this is the backstop).", + current.getCatalog(), current.getName(), ageMs, selfHealThresholdMs, + DSLScriptKey.shortHash(current.getContentHash())); + peerResume(current.getCatalog(), current.getName()); + } + } + +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java new file mode 100644 index 000000000000..2af590618dce --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.rest; + +import lombok.Getter; + +/** + * The {@code ?mode=} parameter on {@code POST /runtime/rule/delete}. Parsed from the wire + * string at the REST boundary so the rest of the codebase deals with a typed value instead + * of free-form strings. + */ +public enum DeleteMode { + /** No mode flag — apply the default {@code /delete} behaviour. If the rule has a + * bundled YAML on disk for {@code (catalog, name)}, the row is removed and bundled is + * reinstalled into a {@code static:} loader; backend resources are preserved. If no + * bundled twin exists, the destructive cascade fires (drops the backend resource + + * removes the row). */ + DEFAULT(""), + /** Operator explicitly asked to revert this rule to its bundled YAML. Identical to + * {@link #DEFAULT} when a bundled twin exists; returns {@code 400 no_bundled_twin} + * when one does not (vs {@link #DEFAULT}, which would still drop the runtime row). */ + REVERT_TO_BUNDLED("revertToBundled"); + + @Getter + private final String wireValue; + + DeleteMode(final String wireValue) { + this.wireValue = wireValue; + } + + /** + * Parse the wire value (query-string form) to its enum. {@code null} or empty returns + * {@link #DEFAULT}; {@code "revertToBundled"} (case-insensitive) returns + * {@link #REVERT_TO_BUNDLED}; anything else throws {@link IllegalArgumentException}. + */ + public static DeleteMode of(final String wireValue) { + if (wireValue == null || wireValue.isEmpty()) { + return DEFAULT; + } + for (final DeleteMode m : values()) { + if (m.wireValue.equalsIgnoreCase(wireValue)) { + return m; + } + } + throw new IllegalArgumentException("Unknown delete mode: " + wireValue); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java new file mode 100644 index 000000000000..77084f614fcb --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.rest; + +import com.linecorp.armeria.common.HttpData; +import com.linecorp.armeria.common.HttpResponse; +import com.linecorp.armeria.server.annotation.Blocking; +import com.linecorp.armeria.server.annotation.Default; +import com.linecorp.armeria.server.annotation.Get; +import com.linecorp.armeria.server.annotation.Header; +import com.linecorp.armeria.server.annotation.Param; +import com.linecorp.armeria.server.annotation.Post; +import lombok.Getter; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; + +/** + * Armeria-annotated HTTP transport for the runtime-rule admin endpoints. All workflow + * logic lives behind {@link RuntimeRuleService}; this class only carries route bindings + * and parameter parsing. The cluster Forward RPC handler reaches the same workflow + * directly via {@code RuntimeRuleService.execute*} methods — there is no separate + * forward-target indirection. + * + *

The {@code catalog} query parameter is the one place untyped {@link String} arrives + * from the wire. We parse it to {@link Catalog} at the boundary and propagate the typed + * enum into {@link RuntimeRuleService}; an unknown catalog returns {@code 400 + * invalid_catalog} via {@link RuntimeRuleService#invalidCatalog} without touching any + * workflow code. + * + *

Operator and endpoint reference: + * {@code docs/en/setup/backend/backend-runtime-rule-api.md}. + */ +@Blocking +public class RuntimeRuleRestHandler { + + /** Exposed so the module provider can wire the same {@link RuntimeRuleService} + * instance into {@code RuntimeRuleClusterServiceImpl} for cluster-forward dispatch. */ + @Getter + private final RuntimeRuleService service; + + public RuntimeRuleRestHandler(final ModuleManager moduleManager, + final DSLManager dslManager, + final RuntimeRuleClusterClient clusterClient, + final long forwardRpcDeadlineMs) { + this.service = new RuntimeRuleService( + moduleManager, dslManager, clusterClient, forwardRpcDeadlineMs); + } + + // ---- Canonical routes ---- + + @Post("/runtime/rule/addOrUpdate") + public HttpResponse addOrUpdate(@Param("catalog") final String catalog, + @Param("name") final String name, + @Param("allowStorageChange") @Default("false") final String allowStorageChange, + @Param("force") @Default("false") final String force, + final HttpData body) { + final Catalog parsed = parseCatalogOrNull(catalog); + if (parsed == null) { + return service.invalidCatalog(catalog, name); + } + return service.addOrUpdate(parsed, name, allowStorageChange, force, body); + } + + @Post("/runtime/rule/inactivate") + public HttpResponse inactivate(@Param("catalog") final String catalog, + @Param("name") final String name) { + final Catalog parsed = parseCatalogOrNull(catalog); + if (parsed == null) { + return service.invalidCatalog(catalog, name); + } + return service.inactivate(parsed, name); + } + + @Post("/runtime/rule/delete") + public HttpResponse delete(@Param("catalog") final String catalog, + @Param("name") final String name, + @Param("mode") @Default("") final String mode) { + final Catalog parsedCatalog = parseCatalogOrNull(catalog); + if (parsedCatalog == null) { + return service.invalidCatalog(catalog, name); + } + final DeleteMode parsedMode; + try { + parsedMode = DeleteMode.of(mode); + } catch (final IllegalArgumentException badMode) { + return service.invalidDeleteMode(catalog, name, mode); + } + return service.delete(parsedCatalog, name, parsedMode); + } + + @Get("/runtime/rule/list") + public HttpResponse list(@Param("catalog") @Default("") final String catalog) { + // /list's catalog is a filter — empty means "all catalogs". Validation lives inside + // the service so the empty-string branch is handled in one place. + return service.list(catalog); + } + + @Get("/runtime/rule") + public HttpResponse get(@Param("catalog") final String catalog, + @Param("name") final String name, + @Param("source") @Default("") final String source, + @Header("Accept") @Default("") final String accept, + @Header("If-None-Match") @Default("") final String ifNoneMatch) { + final Catalog parsed = parseCatalogOrNull(catalog); + if (parsed == null) { + return service.invalidCatalog(catalog, name); + } + return service.get(parsed, name, source, accept, ifNoneMatch); + } + + @Get("/runtime/rule/bundled") + public HttpResponse listBundled(@Param("catalog") final String catalog, + @Param("withContent") @Default("true") final String withContent) { + final Catalog parsed = parseCatalogOrNull(catalog); + if (parsed == null) { + return service.invalidCatalog(catalog, null); + } + return service.listBundled(parsed, withContent); + } + + @Get("/runtime/rule/dump") + public HttpResponse dump() { + return service.dump(); + } + + @Get("/runtime/rule/dump/{catalog}") + public HttpResponse dumpCatalog(@Param("catalog") final String catalog) { + final Catalog parsed = parseCatalogOrNull(catalog); + if (parsed == null) { + return service.invalidCatalog(catalog, null); + } + return service.dumpCatalog(parsed); + } + + // ---- Shortcut routes — fixed catalog + name only ---- + // + // These hard-code the Catalog constant locally so the wire-string-to-enum conversion + // is compile-time, not request-time. + + @Post("/runtime/mal/otel/addOrUpdate") + public HttpResponse malOtelAddOrUpdate(@Param("name") final String name, + @Param("allowStorageChange") @Default("false") final String allowStorageChange, + @Param("force") @Default("false") final String force, + final HttpData body) { + return service.addOrUpdate(Catalog.OTEL_RULES, name, allowStorageChange, force, body); + } + + @Post("/runtime/mal/otel/inactivate") + public HttpResponse malOtelInactivate(@Param("name") final String name) { + return service.inactivate(Catalog.OTEL_RULES, name); + } + + @Post("/runtime/mal/otel/delete") + public HttpResponse malOtelDelete(@Param("name") final String name) { + return service.delete(Catalog.OTEL_RULES, name, DeleteMode.DEFAULT); + } + + @Post("/runtime/mal/log/addOrUpdate") + public HttpResponse malLogAddOrUpdate(@Param("name") final String name, + @Param("allowStorageChange") @Default("false") final String allowStorageChange, + @Param("force") @Default("false") final String force, + final HttpData body) { + return service.addOrUpdate(Catalog.LOG_MAL_RULES, name, allowStorageChange, force, body); + } + + @Post("/runtime/mal/log/inactivate") + public HttpResponse malLogInactivate(@Param("name") final String name) { + return service.inactivate(Catalog.LOG_MAL_RULES, name); + } + + @Post("/runtime/mal/log/delete") + public HttpResponse malLogDelete(@Param("name") final String name) { + return service.delete(Catalog.LOG_MAL_RULES, name, DeleteMode.DEFAULT); + } + + @Post("/runtime/lal/addOrUpdate") + public HttpResponse lalAddOrUpdate(@Param("name") final String name, + @Param("allowStorageChange") @Default("false") final String allowStorageChange, + @Param("force") @Default("false") final String force, + final HttpData body) { + return service.addOrUpdate(Catalog.LAL, name, allowStorageChange, force, body); + } + + @Post("/runtime/lal/inactivate") + public HttpResponse lalInactivate(@Param("name") final String name) { + return service.inactivate(Catalog.LAL, name); + } + + @Post("/runtime/lal/delete") + public HttpResponse lalDelete(@Param("name") final String name) { + return service.delete(Catalog.LAL, name, DeleteMode.DEFAULT); + } + + /** Parse the catalog query parameter. Returns {@code null} when the value is unknown + * so the handler can route through {@link RuntimeRuleService#invalidCatalog} for a + * uniform 400 response. */ + private static Catalog parseCatalogOrNull(final String wireValue) { + if (wireValue == null || wireValue.isEmpty()) { + return null; + } + try { + return Catalog.of(wireValue); + } catch (final IllegalArgumentException unknown) { + return null; + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java new file mode 100644 index 000000000000..d7c60749ca83 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -0,0 +1,1968 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.rest; + +import com.google.gson.Gson; +import com.google.gson.JsonArray; +import com.google.gson.JsonObject; +import com.linecorp.armeria.common.AggregatedHttpResponse; +import com.linecorp.armeria.common.HttpData; +import com.linecorp.armeria.common.HttpHeaderNames; +import com.linecorp.armeria.common.HttpResponse; +import com.linecorp.armeria.common.HttpStatus; +import com.linecorp.armeria.common.MediaType; +import com.linecorp.armeria.common.ResponseHeaders; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.time.Instant; +import java.time.format.DateTimeFormatter; +import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.concurrent.locks.ReentrantLock; +import java.util.regex.Pattern; +import java.util.zip.GZIPOutputStream; +import org.apache.commons.compress.archivers.tar.TarArchiveEntry; +import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream; +import lombok.extern.slf4j.Slf4j; +import java.nio.charset.StandardCharsets; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.classloader.Catalog; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; +import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.remote.client.Address; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DSLDelta; +import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DeltaClassifier; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.MainRouter; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendState; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeDelete; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLScriptKey; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; + +/** + * Armeria HTTP handler for the runtime rule admin surface. + * + *

Endpoints: + *

    + *
  • {@code /addOrUpdate} — raw body in, validate, compile-check via + * {@link DeltaClassifier}, reject shape-breaking edits without + * {@code allowStorageChange=true}, upsert via {@link RuntimeRuleManagementDAO#save} + * (per-backend explicit upsert; the generic ManagementDAO insert path was removed + * because it never persisted on BanyanDB and silently no-op'd updates on ES/JDBC), + * then drive the per-file apply inline via + * {@link DSLManager#applyNowForRuleFile}. Returns the resolved status (structural_applied, + * filter_only_applied, ddl_verify_failed, compile_failed, no_change, + * storage_change_requires_explicit_approval).
  • + *
  • {@code /inactivate} — the soft-pause path. Broadcasts Suspend, flips the row to + * INACTIVE, runs the OAP-internal teardown under + * {@link StorageManipulationOpt#localCacheOnly} via + * {@link DSLManager#applyNowForRuleFile}: dispatch handlers unregistered, prototypes + * and Models cleared, alarm windows reset. The BanyanDB measure and its data are + * explicitly preserved so reactivation via {@code /addOrUpdate} on the INACTIVE row + * is cheap and lossless. Peers observe the INACTIVE row on their next tick and run + * the same OAP-internal teardown. The inactive rule still HOLDS its metric / rule + * names per the soft-pause contract — another file claiming any of those names is + * rejected by the cross-file ownership guard.
  • + *
  • {@code /delete} — the destructive path. Requires the rule to already be INACTIVE + * (returns HTTP 409 {@code requires_inactivate_first} otherwise) — the two-step + * {@code /inactivate → /delete} workflow is enforced. {@code /delete} drives + * {@link DSLRuntimeDelete}: re-registers prototypes locally under + * {@code localCacheOnly} so the cascade has Models to walk, then runs the unregister + * path under {@code fullInstall} so the listener chain fires BanyanDB delete-measure + * on the live measure. Backend-drop failure aborts the row + * removal — an orphaned measure with no row left to retry is never possible. After + * the row is gone, if a static version exists on disk the rule reverts to that on + * the next dslManager tick.
  • + *
  • {@code /list} returns an NDJSON view of every row merged with the dslManager's + * per-node {@link DSLRuntimeState}. {@code /dump} streams a tar.gz of every row plus a + * manifest so the entire admin surface can be backed up and restored.
  • + *
  • {@code GET /runtime/rule?catalog=&name=} fetches a single rule's YAML with DAO + * row → static fallback → 404 lookup. Default raw YAML; JSON envelope on + * {@code Accept: application/json}. {@code ETag} / {@code If-None-Match} → 304. + * {@code GET /runtime/rule/bundled?catalog=} lists every static rule for the + * catalog with an {@code overridden} flag joined from runtime rows.
  • + *
+ * + *

Catalog shortcut routes ({@code /runtime/mal/otel/...}, {@code /runtime/mal/log/...}, + * {@code /runtime/lal/...}) normalize into the canonical handler methods so every entry shape + * reuses the same validation + persistence path. + */ +@Slf4j +public class RuntimeRuleService { + + /** Catalog membership is data-driven through the engine registry — a catalog is valid + * iff some registered engine claims it via {@link RuleEngine#supportedCatalogs}. */ + private boolean isValidCatalog(final String catalog) { + return dslManager.getEngineRegistry().forCatalog(catalog) != null; + } + + /** Set of catalogs accepted by REST, computed from the engine registry. Used only in + * error messages where the operator wants to see the recognised list. */ + private Set validCatalogs() { + final Set out = new LinkedHashSet<>(); + for (final RuleEngine engine : dslManager.getEngineRegistry().engines()) { + out.addAll(engine.supportedCatalogs()); + } + return out; + } + + /** + * Per-file lock acquisition timeout on the REST path. 35 s — covers the typical upper + * bound of a full STRUCTURAL workflow (classify + compile + DDL + persist on BanyanDB) + * with a margin. Requests that exceed this are backed up beyond what normal operator + * flow should produce; the handler returns 409 instead of parking the Armeria thread. + */ + private static final long REST_LOCK_TIMEOUT_MS = 35_000L; + + /** + * Name segments are {@code [A-Za-z0-9._-]+}, separated by {@code /}. No leading slash, no + * {@code ..}, no empty segments, no backslash. Matches what the filesystem loader tolerates + * and blocks path-traversal attempts on the dump tar + DB key. + */ + private static final Pattern VALID_NAME = Pattern.compile("^[A-Za-z0-9._-]+(/[A-Za-z0-9._-]+)*$"); + + private final ModuleManager moduleManager; + private final DSLManager dslManager; + private final RuntimeRuleClusterClient clusterClient; + /** + * Deadline for the forward-to-main RPC. Longer than the Suspend / Resume deadlines + * because the forwarded workflow includes compile + DDL + persist. Tunable via the + * module config. + */ + private final long forwardRpcDeadlineMs; + /** + * Resolved lazily on first routing decision. Null means this OAP has no cluster wiring + * (embedded topology, early boot), in which case self always handles the write. + */ + private volatile RemoteClientManager remoteClientManager; + + public RuntimeRuleService(final ModuleManager moduleManager, + final DSLManager dslManager, + final RuntimeRuleClusterClient clusterClient, + final long forwardRpcDeadlineMs) { + this.moduleManager = moduleManager; + this.dslManager = Objects.requireNonNull(dslManager, + "dslManager — runtime-rule REST handler cannot operate without it"); + this.clusterClient = clusterClient; + this.forwardRpcDeadlineMs = forwardRpcDeadlineMs; + } + + /** + * Route a write request. Three possible outcomes: + *

    + *
  • Self is main (or cluster empty) → returns null; caller runs the local workflow.
  • + *
  • Self is not main AND this request was NOT already forwarded → forward via gRPC + * to the main and relay the response. The operator sees a transparent result.
  • + *
  • Self is not main AND this request WAS forwarded (the incoming sender's cluster + * view disagreed with ours) → return HTTP 421 to bound ping-pong at one hop. + * Operator-facing signal that the cluster view is split.
  • + *
+ * + * @param forwarded true when the request arrived via the cluster Forward RPC, false + * for a direct HTTP caller. See {@link #executeAddOrUpdate} for where + * {@code forwarded} is set. + * @return non-null HttpResponse when routing decided the outcome (either forwarded or + * fail-safe 421); null when the caller should proceed with the local workflow. + */ + private HttpResponse routeOrNull(final String catalog, final String name, + final String operation, final byte[] body, + final boolean allowStorageChange, + final boolean forceReapply, + final boolean forwarded) { + final RemoteClientManager rcm = resolveRemoteClientManager(); + // {@link RemoteClientManager} reflects the cluster's current view; an empty peer + // list means either there's no cluster module wired (single-process) or the + // refresh momentarily returned no entries. Either way the local node is the + // operator's authority for this rule, so we proceed with the local workflow — + // {@link MainRouter#isSelfMain} treats empty as self-main, mirroring null-rcm. + if (rcm == null || MainRouter.isSelfMain(rcm)) { + return null; // self is main (or cluster empty) — run local workflow + } + final Address main = MainRouter.mainAddress(rcm); + if (forwarded) { + // Fail-safe: we got a forwarded request but WE also don't consider ourselves + // main. Two cluster views disagree. Refuse instead of re-forwarding; operator + // sees 421 and can investigate. + final String mainAddr = main == null ? "unknown" : main.toString(); + log.warn("runtime-rule routing conflict: forwarded request {}/{} arrived but " + + "self is not main (local main={}); refusing to re-forward", catalog, name, mainAddr); + return HttpResponse.of(HttpStatus.MISDIRECTED_REQUEST, MediaType.JSON_UTF_8, + routingErrorBody("cluster_view_split", catalog, name, mainAddr, + "forwarded request but self is not main under local cluster view; " + + "routing misfire or split-brain")); + } + // Normal case: forward to the main via gRPC. + return forwardToMain(main, operation, catalog, name, body, + allowStorageChange, forceReapply); + } + + private HttpResponse forwardToMain(final Address mainAddr, + final String operation, + final String catalog, final String name, + final byte[] body, + final boolean allowStorageChange, + final boolean forceReapply) { + if (clusterClient == null) { + // Tests may construct a bare handler without a cluster client. Fall back to + // running locally so the workflow is still exercised. + log.debug("runtime-rule: no cluster client wired; running {} {}/{} locally", + operation, catalog, name); + return null; + } + try { + log.info("runtime-rule routing: forwarding {} {}/{} to main {}", + operation, catalog, name, mainAddr); + final ForwardResponse response = clusterClient.forwardToMain( + mainAddr, operation, catalog, name, body, + allowStorageChange, forceReapply, forwardRpcDeadlineMs); + final HttpStatus status = HttpStatus.valueOf(response.getHttpStatus()); + return HttpResponse.of(status, MediaType.JSON_UTF_8, response.getBody()); + } catch (final Throwable t) { + log.error("runtime-rule routing: forward to main {} failed for {} {}/{}: {}", + mainAddr, operation, catalog, name, t.getMessage(), t); + return HttpResponse.of(HttpStatus.BAD_GATEWAY, MediaType.JSON_UTF_8, + routingErrorBody("forward_failed", catalog, name, + mainAddr == null ? "unknown" : mainAddr.toString(), + t.getMessage() == null ? t.getClass().getSimpleName() : t.getMessage())); + } + } + + private RemoteClientManager resolveRemoteClientManager() { + RemoteClientManager local = remoteClientManager; + if (local != null) { + return local; + } + if (moduleManager == null) { + return null; + } + try { + local = moduleManager.find(CoreModule.NAME).provider() + .getService(RemoteClientManager.class); + remoteClientManager = local; + return local; + } catch (final Throwable t) { + return null; + } + } + + // ---- Cluster-forward dispatch — invoked by RuntimeRuleClusterServiceImpl ---- + + /** Mirror of an HTTP response handed back across the cluster-forward RPC. Immutable; + * construction is cheap. The cluster service packs it into a {@code ForwardResponse} + * for the originating peer. */ + public static final class ForwardResult { + @lombok.Getter + private final int httpStatus; + @lombok.Getter + private final String jsonBody; + + public ForwardResult(final int httpStatus, final String jsonBody) { + this.httpStatus = httpStatus; + this.jsonBody = jsonBody == null ? "" : jsonBody; + } + } + + public ForwardResult executeAddOrUpdate(final String catalog, final String name, + final byte[] body, + final boolean allowStorageChange, + final boolean forceReapply) { + final HttpResponse resp = doAddOrUpdate(catalog, name, + body == null ? HttpData.empty() : HttpData.copyOf(body), + allowStorageChange, forceReapply, /* forwarded */ true); + return toResult(resp); + } + + public ForwardResult executeInactivate(final String catalog, final String name) { + return toResult(doInactivate(catalog, name, /* forwarded */ true)); + } + + public ForwardResult executeDelete(final String catalog, final String name, + final String mode) { + // Cluster forward arrives with the wire string the originator sent. Re-parse here so + // the typed flow inside the service is uniform; an invalid value at this stage is + // an internal bug in the originator (the REST handler validates first), so the + // throw → 500 is appropriate. + return toResult(doDelete(catalog, name, DeleteMode.of(mode), /* forwarded */ true)); + } + + /** + * Drain an Armeria {@link HttpResponse} into a {@link ForwardResult}. Blocks on + * aggregation; safe here because the Forward RPC handler runs on a blocking executor. + */ + private static ForwardResult toResult(final HttpResponse resp) { + final AggregatedHttpResponse agg = resp.aggregate().join(); + return new ForwardResult(agg.status().code(), agg.contentUtf8()); + } + + // ----- Canonical routes (raw body + catalog + name query params) ------------------------- + + /** + * Apply or recover a rule. Two control flags layer on top of the raw body: + *
    + *
  • {@code allowStorageChange=true} — accept shape-breaking edits that would otherwise + * be rejected with 409. Required for any update that drops or re-shapes a backing + * measure / storage schema, since the destructive cascade implies data loss for the + * affected metric. Routine pushes leave this {@code false}.
  • + *
  • {@code force=true} — recovery flag. Bypasses the byte-identical no_change HTTP + * short-circuit so re-posting known-good content (typically extracted from a prior + * {@code /runtime/rule/dump} tarball) is treated as a fresh apply request: the + * persisted row is re-written and any peers stuck mid-Suspend are re-Resumed. + * Engine state (compiled DSL, dispatch handlers, schema) is content-keyed, so a + * true no-op against a healthy node remains a no-op even with this flag. Use after + * a previous push failed and {@code /list} shows a {@code lastApplyError}, or to + * break a stuck SUSPENDED state. Combine with {@code allowStorageChange=true} + * when the recovery target re-shapes the measure.
  • + *
+ */ + public HttpResponse addOrUpdate(final Catalog catalog, + final String name, + final String allowStorageChange, + final String force, + final HttpData body) { + return doAddOrUpdate(catalog.getWireName(), name, body, + parseFlag(allowStorageChange), parseFlag(force)); + } + + public HttpResponse inactivate(final Catalog catalog, + final String name) { + return doInactivate(catalog.getWireName(), name); + } + + public HttpResponse delete(final Catalog catalog, + final String name, + final DeleteMode mode) { + return doDelete(catalog.getWireName(), name, mode); + } + + /** Surface a 400 for an unrecognised {@code mode=} query value. The REST handler + * catches the parse failure and routes here so the response shape matches the rest + * of the validation 400s. */ + public HttpResponse invalidDeleteMode(final String catalog, final String name, + final String rawMode) { + return badRequest("invalid_mode", catalog, name, + "mode must be omitted or one of " + DeleteMode.REVERT_TO_BUNDLED.getWireValue() + + "; received '" + (rawMode == null ? "" : rawMode) + "'"); + } + + /** Surface a 400 for an unrecognised {@code catalog=} query value. The REST handler + * parses the catalog string into a {@link Catalog} enum at the boundary; an unknown + * value lands here so the response is uniform with the other validation 400s. */ + public HttpResponse invalidCatalog(final String rawCatalog, final String name) { + return badRequest("invalid_catalog", rawCatalog, name, + "catalog must be one of " + validCatalogs()); + } + + public HttpResponse list(final String catalogFilter) { + // Merged per-node view of what the dslManager has seen, joined with what is actually + // in storage. Returns a single JSON envelope: + // {generatedAt, loaderStats:{active,pending}, rules:[ ... ]} + // so a UI consumer can JSON.parse() once and an operator can `jq '.rules[]'`. + // + // catalogFilter (empty / null => no filter) narrows the output to one catalog — + // useful when scripting against a single catalog. Validated against the same set as + // the write endpoints; an unknown non-empty catalog returns 400 instead of an empty + // body so the operator gets a clear "you typed it wrong" signal. + final String filter = catalogFilter == null ? "" : catalogFilter.trim(); + if (!filter.isEmpty() && !isValidCatalog(filter)) { + return badRequest("invalid_catalog", filter, null, + "catalog must be one of " + validCatalogs()); + } + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return serverError("dao_unavailable", null, null, + "RuntimeRuleManagementDAO not resolvable — storage module may not be active"); + } + final List ruleFiles; + try { + ruleFiles = dao.getAll(); + } catch (final IOException e) { + return serverError("list_failed", null, null, e.getMessage()); + } + + // Index local dslManager state by (catalog, name) for O(1) join. + final Map localByKey = new HashMap<>(); + for (final Map.Entry e : dslManager.getRules().entrySet()) { + final DSLRuntimeState s = e.getValue().getState(); + if (s != null) { + localByKey.put(e.getKey(), s); + } + } + + final JsonArray rows = new JsonArray(); + for (final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile : ruleFiles) { + if (!filter.isEmpty() && !filter.equals(ruleFile.getCatalog())) { + continue; + } + final String key = DSLScriptKey.key(ruleFile.getCatalog(), ruleFile.getName()); + final DSLRuntimeState local = localByKey.remove(key); + rows.add(renderListEntry(ruleFile, local)); + } + // Snapshot entries with no DAO row fall into two buckets: + // 1. Bundled-only — shipped rule on disk, never operator-overridden. The dslManager + // seeded the snapshot from StaticRuleRegistry at boot, and /inactivate + tick + // rehydrate keep it in sync. These are healthy — status=BUNDLED. + // 2. True orphans — runtime row was just deleted, the dslManager hasn't swept yet. + // Transient; the next tick clears them. Surface for operator visibility. + for (final Map.Entry entry : localByKey.entrySet()) { + final DSLRuntimeState local = entry.getValue(); + if (!filter.isEmpty() && !filter.equals(local.getCatalog())) { + continue; + } + final boolean isBundled = StaticRuleRegistry.active() + .find(local.getCatalog(), local.getName()) + .isPresent(); + rows.add(isBundled ? renderBundledEntry(local) : renderOrphanEntry(local)); + } + + final JsonObject loaderStats = new JsonObject(); + loaderStats.addProperty("active", DSLClassLoaderManager.INSTANCE.activeCount()); + loaderStats.addProperty("pending", DSLClassLoaderManager.INSTANCE.pendingCount()); + final JsonObject envelope = new JsonObject(); + envelope.addProperty("generatedAt", System.currentTimeMillis()); + envelope.add("loaderStats", loaderStats); + envelope.add("rules", rows); + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(envelope)); + } + + private static final Gson GSON = new Gson(); + + /** + * Single-rule fetch. Studio's catalog → row click and the editor both need the YAML + * source, which {@code /list} intentionally omits. Lookup order: + *
    + *
  1. DAO row for {@code (catalog, name)} regardless of status — INACTIVE rules keep + * their content under the soft-pause contract so the editor can re-edit.
  2. + *
  3. {@link StaticRuleRegistry} fallback — bundled rules that have never been + * overridden by the operator. Returned with synthetic status {@code STATIC} and + * source {@code static}.
  4. + *
  5. Otherwise 404 {@code not_found}.
  6. + *
+ * + *

Default response is raw YAML ({@code Content-Type: application/x-yaml; charset=utf-8}) + * so a round-trip through {@code /addOrUpdate} is byte-exact. With {@code Accept: + * application/json} the response is the envelope {@code {catalog, name, status, source, + * contentHash, updateTime, content}} where {@code content} is a standard JSON-escaped + * UTF-8 string (no base64). Either mode emits the same metadata as response headers + * ({@code X-Sw-Content-Hash}, {@code X-Sw-Status}, {@code X-Sw-Source}, + * {@code X-Sw-Update-Time}) and an {@code ETag} based on the content hash, so an editor + * reload with {@code If-None-Match} gets a cheap 304. + * + *

No cluster routing — reads are stateless and any node can serve from its local + * DAO + {@link StaticRuleRegistry}. + */ + public HttpResponse get(final Catalog catalog, + final String name, + final String source, + final String accept, + final String ifNoneMatch) { + return doGet(catalog.getWireName(), name, source, accept, ifNoneMatch); + } + + private HttpResponse doGet(final String catalog, + final String name, + final String source, + final String accept, + final String ifNoneMatch) { + final HttpResponse validationError = validate(catalog, name); + if (validationError != null) { + return validationError; + } + final boolean forceBundled = "bundled".equalsIgnoreCase(source); + if (!forceBundled && source != null && !source.isEmpty() + && !"runtime".equalsIgnoreCase(source)) { + return badRequest("invalid_source", catalog, name, + "source must be 'runtime' (default) or 'bundled'"); + } + + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null && !forceBundled) { + return serverError("dao_unavailable", catalog, name, + "RuntimeRuleManagementDAO not resolvable — storage module may not be active"); + } + + // 1. DAO row — only when source != bundled. The bundled-source path is the operator's + // explicit "show me what's on disk" request and must NEVER fall through to the DAO, + // even when both copies exist. + if (!forceBundled) { + final RuntimeRuleManagementDAO.RuntimeRuleFile row; + try { + row = findRule(dao, catalog, name); + } catch (final IOException ioe) { + log.warn("runtime-rule /get: DAO lookup failed for {}/{}", catalog, name, ioe); + return HttpResponse.of(HttpStatus.SERVICE_UNAVAILABLE, MediaType.JSON_UTF_8, + jsonBody("storage_unavailable", catalog, name, + "DAO lookup failed: " + ioe.getMessage())); + } + if (row != null) { + return renderGetResponse(catalog, name, row.getContent(), row.getStatus(), + "runtime", row.getUpdateTime(), accept, ifNoneMatch); + } + } + + // 2. Bundled (StaticRuleRegistry). Primary path when source=bundled, fallback otherwise. + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent != null) { + // updateTime=0 — capturing the actual file mtime would require threading it through + // StaticRuleRegistry.record. Editor doesn't need precise; "0" is honest about it. + return renderGetResponse(catalog, name, staticContent, "BUNDLED", + "bundled", 0L, accept, ifNoneMatch); + } + + // 3. 404 — message reflects which mode the operator asked for. + return HttpResponse.of(HttpStatus.NOT_FOUND, MediaType.JSON_UTF_8, + jsonBody("not_found", catalog, name, + forceBundled + ? "no bundled rule for this (catalog, name); source=bundled was requested" + : "no runtime rule and no bundled rule for this (catalog, name)")); + } + + /** + * Build the {@code GET /runtime/rule} response. Honours {@code Accept: application/json} + * for the JSON envelope; defaults to raw YAML otherwise. Always emits the metadata + * headers and {@code ETag} so the raw and JSON modes are equally introspectable. + * Returns {@code 304 Not Modified} when the client's {@code If-None-Match} matches the + * current content hash. + */ + private static HttpResponse renderGetResponse(final String catalog, final String name, + final String content, final String status, + final String source, final long updateTime, + final String accept, final String ifNoneMatch) { + final String contentHash = ContentHash.sha256Hex(content); + final String eTag = "\"" + contentHash + "\""; + if (eTag.equals(ifNoneMatch == null ? "" : ifNoneMatch.trim())) { + return HttpResponse.of( + ResponseHeaders.builder(HttpStatus.NOT_MODIFIED) + .add("X-Sw-Content-Hash", contentHash) + .add("X-Sw-Status", status) + .add("X-Sw-Source", source) + .add("X-Sw-Update-Time", Long.toString(updateTime)) + .add(HttpHeaderNames.ETAG, eTag) + .build()); + } + final boolean json = accept != null + && accept.toLowerCase(Locale.ROOT).contains("application/json"); + if (json) { + final JsonObject env = new JsonObject(); + env.addProperty("catalog", catalog); + env.addProperty("name", name); + env.addProperty("status", status); + env.addProperty("source", source); + env.addProperty("contentHash", contentHash); + env.addProperty("updateTime", updateTime); + env.addProperty("content", content); + final String body = GSON.toJson(env); + return HttpResponse.of( + ResponseHeaders.builder(HttpStatus.OK) + .contentType(MediaType.JSON_UTF_8) + .add("X-Sw-Content-Hash", contentHash) + .add("X-Sw-Status", status) + .add("X-Sw-Source", source) + .add("X-Sw-Update-Time", Long.toString(updateTime)) + .add(HttpHeaderNames.ETAG, eTag) + .build(), + HttpData.ofUtf8(body)); + } + return HttpResponse.of( + ResponseHeaders.builder(HttpStatus.OK) + .contentType(MediaType.create("application", "x-yaml").withCharset(StandardCharsets.UTF_8)) + .add("X-Sw-Content-Hash", contentHash) + .add("X-Sw-Status", status) + .add("X-Sw-Source", source) + .add("X-Sw-Update-Time", Long.toString(updateTime)) + .add(HttpHeaderNames.ETAG, eTag) + .build(), + HttpData.ofUtf8(content == null ? "" : content)); + } + + /** + * Read-only view of every static rule shipped with OAP for the given catalog. Studio's + * catalogue browser merges this with {@code /list} (runtime overrides) for a unified + * "available rules" view; the {@code overridden} flag on each entry tells the UI which + * static rules currently have an operator override in place so they can be rendered + * with the right state. + * + *

Always JSON: the body is an array of {@code {name, kind, contentHash, content?, + * overridden}} objects. {@code content} is included by default and elided when + * {@code withContent=false} so a catalogue browse can stay small (per-rule content can + * then be fetched lazily via {@code GET /runtime/rule}). + * + *

Catalog scope: {@code otel-rules}, {@code log-mal-rules}, {@code telegraf-rules}, + * {@code lal} — the same + * allowlist the write paths use. {@code .oal} files are not exposed here; they live + * outside the runtime-rule plugin's scope today. + */ + public HttpResponse listBundled(final Catalog catalog, + final String withContentRaw) { + return doListBundled(catalog.getWireName(), withContentRaw); + } + + private HttpResponse doListBundled(final String catalog, final String withContentRaw) { + final boolean withContent = parseFlag(withContentRaw) + || "true".equalsIgnoreCase(withContentRaw == null ? "true" : withContentRaw.trim()); + // Cross-join with the DAO so each entry's `overridden` flag reflects current state. + // Failure to read the DAO is non-fatal — we still return the bundled view; just mark + // every entry overridden=false (best-effort) and log so operators can see the gap. + final Set overriddenNames = new HashSet<>(); + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao != null) { + try { + for (final RuntimeRuleManagementDAO.RuntimeRuleFile rule : dao.getAll()) { + if (catalog.equals(rule.getCatalog())) { + overriddenNames.add(rule.getName()); + } + } + } catch (final IOException ioe) { + log.warn("runtime-rule /bundled: DAO read failed for catalog={}; " + + "overridden flags will all be false this call", catalog, ioe); + } + } + final List rules = + StaticRuleRegistry.active().findByCatalog(catalog); + final JsonArray out = new JsonArray(); + for (final StaticRuleRegistry.NamedRule rule : rules) { + final JsonObject row = new JsonObject(); + row.addProperty("name", rule.getName()); + row.addProperty("kind", "bundled"); + row.addProperty("contentHash", ContentHash.sha256Hex(rule.getContent())); + row.addProperty("overridden", overriddenNames.contains(rule.getName())); + if (withContent) { + row.addProperty("content", rule.getContent()); + } + out.add(row); + } + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(out)); + } + + public HttpResponse dump() { + return doDump(null); + } + + public HttpResponse dumpCatalog(final Catalog catalog) { + return doDump(catalog.getWireName()); + } + // ----- Shared handlers --------------------------------------------------------------------- + + private HttpResponse doAddOrUpdate(final String catalog, final String name, final HttpData body, + final boolean allowStorageChange) { + return doAddOrUpdate(catalog, name, body, allowStorageChange, false, false); + } + + private HttpResponse doAddOrUpdate(final String catalog, final String name, final HttpData body, + final boolean allowStorageChange, final boolean forceReapply) { + return doAddOrUpdate(catalog, name, body, allowStorageChange, forceReapply, false); + } + + /** + * @param forceReapply when true, bypass the byte-identical no_change short-circuit so a + * re-post of known-good content is not silently eaten. The request + * enters the structural pipeline (Suspend broadcast + persist + + * Resume), but if the engine sees no delta the apply itself is a + * NO_CHANGE — the explicit Resume broadcast at commit-tail is what + * unsticks peers that were left SUSPENDED by a prior failed push. + * Set by {@code /addOrUpdate?force=true}; the default false keeps + * CI idempotency working as designed. + * @param forwarded true when the request arrived via the cluster Forward RPC (one + * of the {@code execute*} entry points); false for direct HTTP + * callers. + * Controls the routing path: direct callers forward to the main + * when self isn't main; forwarded callers hit the fail-safe 421 + * instead of re-forwarding. + */ + private HttpResponse doAddOrUpdate(final String catalog, final String name, final HttpData body, + final boolean allowStorageChange, final boolean forceReapply, + final boolean forwarded) { + final HttpResponse validationError = validate(catalog, name); + if (validationError != null) { + return validationError; + } + final String content = body == null ? "" : body.toStringUtf8(); + if (content.isEmpty()) { + return badRequest("empty_body", catalog, name, "request body must be the raw rule content"); + } + // Single-main routing. Self is main → null → run local workflow. Non-main + not + // forwarded → forward to main and relay response. Non-main + forwarded → fail-safe + // 421 (cluster view split; refuse to re-forward). + // + // The operation string MUST match exactly one of the cases the cluster receiver's + // switch in RuntimeRuleClusterServiceImpl handles ("addOrUpdate", "inactivate", + // "delete"); anything else returns 400 forward_unknown_operation. The forceReapply + // flag rides on the protobuf body's own field, not in the operation string — early + // versions encoded it as "addOrUpdate?force=true" which the receiver never decoded. + final HttpResponse routed = routeOrNull(catalog, name, + "addOrUpdate", + content.getBytes(StandardCharsets.UTF_8), + allowStorageChange, forceReapply, forwarded); + if (routed != null) { + return routed; + } + + // Hold the per-file lock across the ENTIRE workflow (prior-file lookup, classify, + // guardrail, Suspend, apply, persist, finalize/discard, Resume). The lock is + // reentrant, so the dslManager's internal acquires nest safely. This serializes + // concurrent REST requests for the same (catalog, name) on this OAP — otherwise the + // pendingCommits stash between apply and persist could be overwritten by a racing + // second request, and the first request's finalize would drain the wrong content. + // Different files do not contend (per-file lock cache). + // + // Uses LockMetrics.acquireForRest which wraps tryLock with: + // - bounded timeout (REST_LOCK_TIMEOUT_MS) — returns false on timeout instead of + // parking the Armeria thread for an unbounded time + // - runtime_rule_lock_wait_seconds histogram (path=rest) for every attempt + // - runtime_rule_lock_contention_total counter (path=rest,outcome=timeout) on false + // - WARN log line when an acquire took > 1s (catches pathological waits even + // without operators looking at the dashboard) + final ReentrantLock perFile = AppliedRuleScript.lockFor(dslManager.getRules(), catalog, name); + if (!dslManager.getLockMetrics().acquireForRest(perFile, REST_LOCK_TIMEOUT_MS, catalog, name)) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("update_in_progress", catalog, name, + "another update for this rule file is in progress on this OAP; retry")); + } + try (HistogramMetrics.Timer ignored = + dslManager.getLockMetrics().startRestHoldTimer()) { + return doAddOrUpdateLocked(catalog, name, content, allowStorageChange, forceReapply); + } finally { + perFile.unlock(); + } + } + + /** Full workflow with the per-file lock held. See {@link #doAddOrUpdate} for rationale. */ + private HttpResponse doAddOrUpdateLocked(final String catalog, final String name, + final String content, + final boolean allowStorageChange, + final boolean forceReapply) { + // Full prior-file lookup (not just content): the no_change short-circuit must + // distinguish ACTIVE-same-content (true no-op) from INACTIVE-same-content + // (reactivation request — must persist + apply so the handlers come back). Feeds + // the compile_failed check, the no_change short-circuit, and the allowStorageChange + // guardrail. Lookup failure is surfaced as 503 — silently treating it as "no prior + // row" would let storageChangeGuardrail wave a destructive STRUCTURAL change through + // because priorContent==null reads as first-time create. + final RuntimeRuleManagementDAO.RuntimeRuleFile priorRuleFile; + try { + priorRuleFile = currentRuleFile(catalog, name); + } catch (final IOException ioe) { + log.warn("runtime-rule: prior-row lookup failed for {}/{}", catalog, name, ioe); + return HttpResponse.of(HttpStatus.SERVICE_UNAVAILABLE, MediaType.JSON_UTF_8, + jsonBody("storage_unavailable", catalog, name, + "prior-row lookup failed: " + ioe.getMessage())); + } + // When there is no DB row yet, fall back to the static content captured by the + // runtime-rule extension at boot. Without this fallback the delta classifier would see + // null and treat the first /addOrUpdate against a shipped static rule as "new rule" — + // masking shape-breaking edits the storage-change guardrail should have caught. + final String priorContent = priorRuleFile != null + ? priorRuleFile.getContent() + : StaticRuleRegistry.active().find(catalog, name).orElse(null); + // A static-only rule (no DB row but static content exists) is implicitly ACTIVE — + // RuleSetMerger recorded the on-disk bytes at boot via StaticRuleRegistry. Treat + // it as ACTIVE for the no_change short-circuit so a re-post of the static bytes is + // a cheap no-op. + final boolean priorActive = priorRuleFile != null + ? !RuntimeRule.STATUS_INACTIVE.equals(priorRuleFile.getStatus()) + : priorContent != null; + + // Byte-identical short-circuit. Only fires for ACTIVE-and-same-content AND when the + // caller didn't force a re-apply. A re-post of the same bytes on an INACTIVE row is + // an explicit reactivation and must run through the full apply pipeline. + // /addOrUpdate?force=true sets forceReapply=true so a same-content recovery push + // isn't silently eaten. + if (!forceReapply + && priorActive + && priorContent != null + && priorContent.equals(content)) { + return ok(HttpStatus.OK, "no_change", catalog, name, + "content byte-identical to current ACTIVE row; no-op"); + } + + // Classify the delta to emit the right response shape and to drive the guardrail + // below. Parse failures (malformed YAML on the new side, or a MAL expression that + // can't even AST-parse) surface as 400 compile_failed. The classifier is cheap + // (AST walk, no Javassist codegen) so doing this synchronously on the HTTP thread + // is fine. + final DSLDelta delta; + try { + // Engine-driven classification — routes via RuleEngineRegistry so a catalog + // declared on MalRuleEngine.supportedCatalogs (e.g., telegraf-rules) classifies + // as MAL automatically, no parallel string list to maintain. + delta = DSLScriptKey.isMalCatalog(dslManager.getEngineRegistry(), catalog) + ? DeltaClassifier.classifyMal(priorContent, content) + : DeltaClassifier.classifyLal(priorContent, content); + } catch (final RuntimeException pe) { + log.warn("runtime-rule: compile_failed during classify for {}/{}: {}", + catalog, name, pe.getMessage()); + return badRequest("compile_failed", catalog, name, pe.getMessage()); + } + + // Destructive-edit guardrail fires BEFORE any Suspend / persist work — rejected + // requests must not drain peers or touch the row. Narrow by design: only MAL + // shape-break (scope type or explicit downsampling moved) and LAL outputType / + // rule-key changes. FILTER_ONLY body tweaks never trigger. + final String storageChangeRejection = storageChangeGuardrail( + catalog, name, priorContent, content, allowStorageChange); + if (storageChangeRejection != null) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("storage_change_requires_explicit_approval", catalog, name, + storageChangeRejection)); + } + + // FILTER_ONLY fast path: pure local body/filter swap. No Suspend broadcast, no DDL, + // no alarm reset. Apply locally; on success, persist; on failure, return 500 without + // persist. Peers observe the row on their next tick and run their own fast path. + if (delta.classification() == Classification.FILTER_ONLY) { + return applyFilterOnly(catalog, name, content, delta); + } + + // STRUCTURAL / NEW path. Order: + // 1. local self-suspend (park entry dispatch for the prior metrics; SELF origin) + // 2. peer Suspend broadcast (bounded per-peer deadline; unreachable peers self-heal) + // 3. local apply — compile, register, DDL through CreatingListeners, isExists verify + // 4. If apply fails → local Resume + broadcast Resume + return 500 without persist. + // Peers flip back to RUNNING within an RPC round-trip. + // 5. If apply succeeds → persist. On persist success: finalize commit. On persist + // failure: discard commit + broadcast Resume + return 500. + return applyStructural(catalog, name, content, delta); + } + + /** + * Fast-path apply for body/filter edits that do not move metric shape. Persist the row + * first — the design's commit point — then swap the compiled body in locally. No + * Suspend broadcast is sent because no storage identity is moving. + * + *

Persist-first preserves the persist-as-commit invariant: if the DB write fails, no + * local state advances and the operator's 500 {@code persist_failed} response is + * honest. Previously this path applied locally first and, on persist failure, returned + * 500 while the local node kept serving the new bundle — a one-node divergence window + * that closed only on the next dslManager tick replaying the old DB content. FILTER_ONLY + * has no DDL to roll back, so the worst case after "persist succeeded, local apply + * failed" is a brief local-old-vs-DB-new gap that the next tick converges (same + * semantics peers already observe when they catch up by tick). STRUCTURAL still + * apply-first + stash-and-commit because its DDL cannot be undone by a simple row + * revert. + */ + private HttpResponse applyFilterOnly(final String catalog, final String name, + final String content, final DSLDelta delta) { + final long updateTime = System.currentTimeMillis(); + // 1. Persist first — commit point. Nothing local has changed yet; a persist failure + // here leaves the node serving the pre-edit bundle, which matches the response. + final HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); + if (persistError != null) { + return persistError; + } + // 2. Apply locally. An unexpected compile/register failure after persist would leave + // the DB row ahead of local state for up to one tick interval; the next tick + // re-reads the DB and retries the apply. Peers already converge via the same + // tick-driven path (FILTER_ONLY never broadcasts), so this failure mode is + // indistinguishable from the existing "peer catches up on its next tick" path — + // no new divergence semantics to document. + final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( + catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); + final DSLRuntimeState postApply; + try { + postApply = dslManager.applyNowForRuleFile(ruleFile); + } catch (final Throwable t) { + log.error("runtime-rule FILTER_ONLY apply failed after persist for {}/{} — DB " + + "reflects the new content; this node will converge on the next dslManager " + + "tick (same path peers use).", catalog, name, t); + return ok(HttpStatus.OK, "filter_only_persisted", catalog, name, + "row persisted; local apply deferred to next tick: " + t.getMessage()); + } + if (postApply != null && postApply.getLastApplyError() != null) { + log.warn("runtime-rule FILTER_ONLY apply recorded an error after persist for " + + "{}/{}: {}. Next tick will retry.", + catalog, name, postApply.getLastApplyError()); + return ok(HttpStatus.OK, "filter_only_persisted", catalog, name, + "row persisted; local apply deferred to next tick: " + + postApply.getLastApplyError()); + } + return ok(HttpStatus.OK, "filter_only_applied", catalog, name, + "body/filter edits applied; no DDL, no alarm reset"); + } + + /** + * STRUCTURAL / NEW apply: local Suspend → peer Suspend broadcast → local compile + DDL + + * verify → (on success) persist + resume; (on failure) local resume without persist. + * The in-memory state machine is the source of truth during the apply window; the row + * write is the commit point that lets peers converge. + */ + private HttpResponse applyStructural(final String catalog, final String name, + final String content, final DSLDelta delta) { + // Local self-suspend first so this node stops serving the old bundle before anyone + // else learns of the new content. The suspend records SuspendOrigin.SELF so a racing + // peer Suspend (should not happen under correct single-main routing) is rejected with + // HTTP 409 rather than merged into BOTH. + final SuspendResult local = dslManager.getSuspendCoord().localSuspend(catalog, name); + if (local == SuspendResult.REJECTED_ORIGIN_CONFLICT) { + // Another OAP thinks it's the main for this file. Reject the operator's request; + // correct routing never hits this branch. Do NOT broadcast Suspend (peer state + // already reflects the other main's activity). + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("origin_conflict", catalog, name, + "peer origin already holds this bundle — cluster routing misfire or " + + "split-brain; refusing to run a second main-node apply concurrently")); + } + log.info("runtime-rule STRUCTURAL apply for {}/{}: local suspend result = {}", + catalog, name, local); + + // Peer broadcast. Bounded deadline per peer; unreachable peers recover via the + // dslManager's self-heal sweep when Resume is later broadcast or after + // selfHealThresholdMs if the main crashes before sending Resume. Inspect the acks for + // REJECTED — a peer rejects Suspend when IT holds SELF origin (mid-apply), which + // means two OAPs both think they are the main for this file. Abort here rather than + // double-applying; the local self-suspend is reverted and the caller gets a 409. + final List suspendAcks = broadcastSuspend(catalog, name, "addOrUpdate"); + final SuspendAck rejected = firstRejected(suspendAcks); + if (rejected != null) { + dslManager.getSuspendCoord().localResume(catalog, name); + // Resume the peers that DID accept (SUSPENDED / ALREADY_SUSPENDED entries) so they + // flip back to RUNNING within one RPC round-trip. The rejecting peer ignores the + // Resume because it never transitioned to PEER-suspend under our sender id. + broadcastResume(catalog, name, "split_brain_detected"); + log.error("runtime-rule STRUCTURAL apply ABORTED for {}/{} — peer {} already " + + "holds SELF origin: {}. Cluster routing misfire; refusing to double-apply.", + catalog, name, rejected.getNodeId(), rejected.getDetail()); + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("split_brain_detected", catalog, name, + "peer " + rejected.getNodeId() + " reports a concurrent apply in flight " + + "(origin conflict); only one main per (catalog, name) is permitted. " + + "Re-run once cluster membership stabilizes.")); + } + + // Try the local apply with deferCommit=true. applyNowForRuleFile internally calls + // MalFileApplier.apply which runs the CreatingListener chain (DDL + isExists verify + // in the dslManager's verifyPostApply). Verify failure lands in + // DSLRuntimeState.lastApplyError. On success the commit's destructive tail (drop + // removedMetrics, swap appliedMal/appliedContent, retire old loader, alarm reset, + // advance snapshot) is stashed in the dslManager — we drain it below once persist + // resolves. + final long updateTime = System.currentTimeMillis(); + final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( + catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); + final DSLRuntimeState postApply; + try { + postApply = dslManager.applyNowForRuleFile(ruleFile, true); + } catch (final Throwable t) { + log.error("runtime-rule STRUCTURAL apply threw for {}/{}", catalog, name, t); + dslManager.getSuspendCoord().localResume(catalog, name); + // Peers went SUSPENDED on our earlier broadcast; let them know the apply + // aborted so they flip back to RUNNING within an RPC round-trip. + broadcastResume(catalog, name, "apply_threw"); + return serverError("apply_failed", catalog, name, t.getMessage()); + } + if (postApply != null && postApply.getLastApplyError() != null) { + // Apply failed (DDL verify mismatch, compile surprise, applier exception). Row + // is NOT yet persisted. applyOneRuleFile already rolled back its own partial + // registration on the exception path; the pendingCommits stash is only + // populated after verifyPostApply passes, so no pending drain to do here. + // Resume the retained pre-suspend bundle locally so this node goes back to + // serving samples, then broadcast Resume so peers recover immediately instead + // of waiting on the 60 s self-heal window. + dslManager.getSuspendCoord().localResume(catalog, name); + broadcastResume(catalog, name, "apply_failed"); + final String err = postApply.getLastApplyError(); + if (err.contains("isExists verify FAILED") + || err.contains("ddl") + || err.contains("install")) { + return serverError("ddl_verify_failed", catalog, name, err); + } + // Cross-file ownership conflict — the operator's rule names a metric + // already claimed by another active file. Operator-fixable, not a server + // error: surface as 409 so callers (and the e2e) can treat it the same as + // the other apply rejections (allowStorageChange, /delete-on-ACTIVE, ...). + if (err.contains("rule-name collision")) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("ownership_conflict", catalog, name, err)); + } + return serverError("apply_failed", catalog, name, err); + } + + // Apply succeeded + verified. Commit the row — the design's commit point. Retry a + // couple of times on transient failures before giving up; the per-backend + // RuntimeRuleManagementDAO.save can throw on a brief storage outage. A narrow retry + // here avoids turning a blip into a cluster-divergence event. + HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); + if (persistError != null) { + try { + Thread.sleep(100L); + } catch (final InterruptedException ie) { + Thread.currentThread().interrupt(); + } + persistError = persistRuleSync(catalog, name, content, updateTime); + } + if (persistError != null) { + // Persist still failing. The local node has registered added + shape-break + // metrics in MeterSystem (DDL fired, isExists verified) while the DB and peers + // remain on the old content. Discard drains the pending commit by removing only + // the added + shape-break metrics — it does NOT drop removedMetrics (the commit + // was stashed before that step, so those are still alive) and does NOT swap + // appliedMal/appliedContent (still on the pre-apply bundle). Net outcome: + // local node converges back to the pre-apply bundle exactly, no divergence from + // what the DB still says is current. + try { + dslManager.getCommitCoord().discardCommit(catalog, name); + } catch (final Throwable rt) { + log.error("runtime-rule CRITICAL: persist-failure discard itself failed for " + + "{}/{}; state is inconsistent and requires operator intervention", + catalog, name, rt); + } + // Peers are still SUSPENDED on our earlier broadcast. The DB didn't advance, + // so self-heal would eventually flip them back, but broadcasting Resume now + // cuts the dispatch gap from 60 s to a single RPC round-trip. + broadcastResume(catalog, name, "persist_failed"); + log.error("runtime-rule CRITICAL: STRUCTURAL persist FAILED after successful apply " + + "for {}/{} — discarded pending commit; local node re-aligned with old " + + "content. Operator action: re-push via /addOrUpdate once storage is healthy.", + catalog, name); + return persistError; + } + + // Persist succeeded — drain the pending commit now that the DB reflects the new + // content. commitCoord.finalizeCommit drops removedMetrics, swaps the applied + // pointers, retires the old loader, fires alarm reset, and advances the snapshot. + // + // Commit-tail failure handling: the DB row is durable (persist already succeeded), + // so peers converge from the DB — but on THIS node the local drop+recreate may + // not have fully landed. Return 500 commit_deferred so the operator sees a clear + // "DB row flipped, local commit threw" signal and can retry. Returning 200 would + // tell the operator "done" while the backend schema on this node may still be + // stale — that's the failure mode the review flagged. + Throwable commitFailure = null; + boolean drained = false; + try { + drained = dslManager.getCommitCoord().finalizeCommit(catalog, name); + } catch (final Throwable t) { + commitFailure = t; + log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist " + + "succeeded — DB is authoritative, peers will converge. Operator action: " + + "inspect log for the underlying cause.", catalog, name, t); + } + if (commitFailure != null) { + return serverError("commit_deferred", catalog, name, + "DB row persisted, but local commit-tail threw — backend shape on this " + + "node may not have fully landed. Peers converge from DB; this node " + + "will retry on the next dslManager tick. Cause: " + + commitFailure.getMessage()); + } + + // No commit was drained — typical for {@code force=true} re-applies on byte- + // identical content (engine returned NO_CHANGE so nothing was stashed). Peers are + // still PEER-suspended from our earlier broadcast and would only converge via the + // 60 s self-heal window without an explicit Resume. Send the Resume now so peers + // recover within an RPC round-trip. + if (!drained) { + broadcastResume(catalog, name, "force_no_change"); + } + + return ok(HttpStatus.OK, "structural_applied", catalog, name, + "structural apply succeeded" + describeDelta(delta)); + } + + /** + * Write the row through {@link RuntimeRuleManagementDAO#save} so a DAO failure is surfaced + * to the caller instead of silently swallowed. The earlier ManagementStreamProcessor path + * routed through the generic {@code IManagementDAO.insert}, which BanyanDB never persisted + * (just logged) and ES/JDBC short-circuited on duplicate row — both broke the + * persist-is-commit invariant for {@code /addOrUpdate} updates and {@code /inactivate} + * status flips. The DAO contract is now an explicit upsert per backend. + * + * @return {@code null} when the row is durable in storage; a 500 {@link HttpResponse} + * otherwise. Callers chain on null so the happy-path stays readable. + */ + private HttpResponse persistRuleSync(final String catalog, final String name, + final String content, final long updateTime) { + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return serverError("persist_failed", catalog, name, + "RuntimeRuleManagementDAO unavailable"); + } + final RuntimeRule rule = new RuntimeRule(); + rule.setCatalog(catalog); + rule.setName(name); + rule.setContent(content); + rule.setStatus(RuntimeRule.STATUS_ACTIVE); + rule.setUpdateTime(updateTime); + try { + dao.save(rule); + return null; + } catch (final Throwable t) { + log.error("failed to persist runtime rule {}/{}", catalog, name, t); + return serverError("persist_failed", catalog, name, t.getMessage()); + } + } + + /** + * Degraded-mode fallback used when the dslManager isn't wired (early boot, embedded test + * topologies). Persist the row so storage is durable and the dslManager can catch up on + * its own tick when it comes online; return 202. + */ + private HttpResponse persistRowAndReturnPending(final String catalog, final String name, + final String content, final DSLDelta delta) { + final long updateTime = System.currentTimeMillis(); + final HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); + if (persistError != null) { + return persistError; + } + return ok(HttpStatus.ACCEPTED, "persisted_apply_pending", catalog, name, + "row written; classification=" + delta.classification().name() + + "; dslManager will apply within the tick interval"); + } + + private static String describeDelta(final DSLDelta delta) { + final StringBuilder sb = new StringBuilder(); + if (!delta.addedMetrics().isEmpty()) { + sb.append("; added=").append(delta.addedMetrics().size()); + } + if (!delta.removedMetrics().isEmpty()) { + sb.append("; removed=").append(delta.removedMetrics().size()); + } + if (!delta.shapeBreakMetrics().isEmpty()) { + sb.append("; shape-break=").append(delta.shapeBreakMetrics().size()); + } + return sb.toString(); + } + + /** + * Returns a rejection message when the edit is storage-affecting and the guardrail flag + * is not set; null when the edit is safe or the flag permits it. "Storage-affecting" is + * narrow by design: only shape-breaking MAL edits (scope type or explicit downsampling + * moved), or LAL edits that change a rule's outputType / add-remove rule keys. FILTER_ONLY + * body tweaks never trigger the guardrail — those don't touch storage. + */ + private String storageChangeGuardrail(final String catalog, final String name, + final String priorContent, final String newContent, + final boolean allowStorageChange) { + if (allowStorageChange) { + return null; + } + final org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine engine = + dslManager.getEngineRegistry().forCatalog(catalog); + if (engine == null) { + return null; + } + final Set storageAffected; + try { + storageAffected = engine.storageImpactKeys(priorContent, newContent); + } catch (final RuntimeException e) { + return "classify failed (cannot evaluate storage impact): " + e.getMessage(); + } + if (storageAffected.isEmpty()) { + return null; + } + return "update would trigger a storage-level change for " + catalog + "/" + name + + " affecting " + storageAffected + + "; retry with allowStorageChange=true to accept data loss (measure drop + " + + "downsampling re-class on BanyanDB, orphaned rows on JDBC/ES)"; + } + + /** + * Full prior-row lookup. Returns null when the DAO is unavailable (early boot, some + * embedded test topologies) or when no row exists for {@code (catalog, name)}. The + * caller reads the row's content + status fields: the no_change short-circuit needs + * status to distinguish ACTIVE from INACTIVE (a re-post of the same content on an + * INACTIVE row reactivates rather than becoming a no-op), and the delta classifier + * reads content directly. Uses {@link RuntimeRuleManagementDAO#getAll()} + in-memory + * filter because the rule count is small (dozens per cluster in practice) and adding + * a per-row getter would be a cross-module API change for a handful of callers. + * + *

Storage read failures propagate to the caller as {@link IOException} — they MUST + * NOT be silently swallowed and surfaced as "no prior row". The + * {@link #storageChangeGuardrail} treats a null priorContent as a first-time create and + * skips the check; if a transient DAO blip turned a real STRUCTURAL update into an + * apparent first-time create, the guardrail would let a destructive change through. + * Callers translate the IOException into a 503 so the operator can retry. + */ + private RuntimeRuleManagementDAO.RuntimeRuleFile currentRuleFile(final String catalog, final String name) + throws IOException { + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return null; + } + for (final RuntimeRuleManagementDAO.RuntimeRuleFile r : dao.getAll()) { + if (catalog.equals(r.getCatalog()) && name.equals(r.getName())) { + return r; + } + } + return null; + } + + /** + * Accept common truthy forms: "true"/"1"/"yes" (case-insensitive) → true. Everything else, + * including null and empty → false. Query-string bool is notoriously inconsistent across + * clients (curl, browsers, scripts), so we normalize here rather than trusting any single + * form. + */ + private static boolean parseFlag(final String raw) { + if (raw == null || raw.isEmpty()) { + return false; + } + final String v = raw.trim().toLowerCase(); + return "true".equals(v) || "1".equals(v) || "yes".equals(v); + } + + /** + * Fire Suspend to every non-self peer on the OAP cluster bus. Returns the aggregated ack + * list — null entries are unreachable peers. Unreachable peers log and self-heal via the + * dslManager's 60s rule. The caller inspects the list for {@link SuspendState#REJECTED} + * entries before proceeding: a REJECTED ack means another OAP is concurrently mid-apply + * for the same (catalog, name) under its own SELF origin (routing misfire / split-brain). + * Ignoring it would let both OAPs apply-and-persist for the same file. + */ + private List broadcastSuspend(final String catalog, final String name, final String reason) { + if (clusterClient == null) { + return Collections.emptyList(); // Not expected in production; guard for tests. + } + try { + return clusterClient.broadcastSuspend(catalog, name, reason); + } catch (final Throwable t) { + log.warn("runtime-rule Suspend broadcast failed for {}/{}; peers will self-heal " + + "via dslManager next tick", catalog, name, t); + return Collections.emptyList(); + } + } + + /** + * Inspect Suspend acks for the split-brain guard: if any peer responded with REJECTED + * (origin conflict — it believes it is the main and is mid-apply), surface that to the + * caller so we can abort before persisting. Unreachable peers (null entries) are ignored + * here — they recover via self-heal. Returns null when no peer rejected. + */ + private static SuspendAck firstRejected(final List acks) { + if (acks == null) { + return null; + } + for (final SuspendAck ack : acks) { + if (ack != null && ack.getState() == SuspendState.REJECTED) { + return ack; + } + } + return null; + } + + /** + * Fire Resume to every non-self peer. Called on every failure branch of the main-node's + * STRUCTURAL apply so peers flip back to RUNNING within an RPC round-trip instead of + * waiting for the 60 s self-heal threshold. In the 99% case (compile / verify / persist + * fails on the main and the main is alive to broadcast), peers resume immediately. In the + * 1% case (main crashes between Suspend and Resume), self-heal remains the backstop. + */ + private void broadcastResume(final String catalog, final String name, final String reason) { + if (clusterClient == null) { + return; + } + try { + clusterClient.broadcastResume(catalog, name, reason); + } catch (final Throwable t) { + log.warn("runtime-rule Resume broadcast failed for {}/{} (reason={}); peers will " + + "self-heal via dslManager after selfHealThresholdMs", + catalog, name, reason, t); + } + } + + private HttpResponse doInactivate(final String catalog, final String name) { + return doInactivate(catalog, name, false); + } + + private HttpResponse doInactivate(final String catalog, final String name, + final boolean forwarded) { + final HttpResponse validationError = validate(catalog, name); + if (validationError != null) { + return validationError; + } + final HttpResponse routed = routeOrNull(catalog, name, "inactivate", + new byte[0], false, false, forwarded); + if (routed != null) { + return routed; + } + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return serverError("dao_unavailable", catalog, name, + "RuntimeRuleManagementDAO not resolvable — storage module may not be active"); + } + // Hold the same per-file lock /addOrUpdate holds so the Suspend → row flip → (peer) + // tick pipeline serializes with a racing /addOrUpdate on the same file. Without this + // a concurrent update could land its pending commit between our broadcastSuspend and + // the status UPSERT, producing a bundle that's live-with-content on peers and INACTIVE + // in the DB. + final ReentrantLock perFile = AppliedRuleScript.lockFor(dslManager.getRules(), catalog, name); + if (!dslManager.getLockMetrics().acquireForRest(perFile, REST_LOCK_TIMEOUT_MS, catalog, name)) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("update_in_progress", catalog, name, + "another update for this rule file is in progress on this OAP; retry")); + } + try (HistogramMetrics.Timer ignored = + dslManager.getLockMetrics().startRestHoldTimer()) { + return doInactivateLocked(catalog, name, dao); + } finally { + perFile.unlock(); + } + } + + private HttpResponse doInactivateLocked(final String catalog, final String name, + final RuntimeRuleManagementDAO dao) { + final RuntimeRuleManagementDAO.RuntimeRuleFile existing; + try { + existing = findRuleFile(dao, catalog, name); + } catch (final IOException ioe) { + log.error("failed to look up runtime rule {}/{} for inactivate", catalog, name, ioe); + return serverError("inactivate_failed", catalog, name, ioe.getMessage()); + } + if (existing == null) { + // No DB row — fall back to the static content captured by the runtime-rule + // extension at boot. If a static version of this rule exists on disk, the + // operator is asking to silence it: persist an INACTIVE tombstone carrying the + // static content (so /dump and re-activation both have the authoritative body) + // and proceed with the destructive pipeline below. If neither row nor static + // exists, there is genuinely nothing to inactivate. + final String staticContent = StaticRuleRegistry.active().find(catalog, name).orElse(null); + if (staticContent == null) { + return ok(HttpStatus.OK, "not_found", catalog, name, + "no runtime-rule row and no static version on disk; nothing to inactivate"); + } + return doInactivateStaticTombstone(catalog, name, staticContent); + } + if (RuntimeRule.STATUS_INACTIVE.equals(existing.getStatus())) { + return ok(HttpStatus.OK, "already_inactive", catalog, name, + "rule is already INACTIVE"); + } + return runInactivePipeline(catalog, name, existing.getContent(), false); + } + + /** + * Insert an {@code INACTIVE} tombstone row carrying the static content and drive the + * same destructive pipeline an existing-row /inactivate would run. Used when an operator + * inactivates a rule that only exists on disk (static file, no DB row yet) — the + * tombstone row becomes the source of truth so a reboot skips the static load and every + * peer converges on "not running" via the dslManager tick. + */ + private HttpResponse doInactivateStaticTombstone(final String catalog, final String name, + final String staticContent) { + return runInactivePipeline(catalog, name, staticContent, true); + } + + /** + * Shared pipeline for {@code /inactivate}: + *

    + *
  1. Local self-suspend — main stops dispatching before peers learn of the removal.
  2. + *
  3. Broadcast {@code Suspend} — peers park dispatch; origin conflict → abort with + * {@code Resume} + 409.
  4. + *
  5. Persist {@code INACTIVE} row synchronously; failure → {@code Resume} + 500 + * (rollback point, cluster state never diverges).
  6. + *
  7. Drive local teardown inline so main doesn't keep serving the bundle for up to + * one tick interval after the status flip.
  8. + *
+ * + * @param staticTombstone {@code true} when the row didn't previously exist and we're + * creating it from static content; changes the applyStatus string + * returned to the operator so /list + dashboards can distinguish + * the two cases. + */ + private HttpResponse runInactivePipeline(final String catalog, final String name, + final String content, final boolean staticTombstone) { + // Local self-suspend first so main stops serving the old bundle before the peer + // broadcast. Without this the main kept serving while peers were told to Suspend — + // peers stop first, cluster state diverges until the main's next tick drives teardown. + final SuspendResult local = dslManager.getSuspendCoord().localSuspend(catalog, name); + if (local == SuspendResult.REJECTED_ORIGIN_CONFLICT) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("origin_conflict", catalog, name, + "peer origin already holds this bundle — cluster routing misfire; " + + "refusing to inactivate while a peer reports concurrent apply")); + } + // Inactivate removes every metric the bundle owned across the cluster. Suspend peer + // dispatch before the status flip propagates so samples arriving between the UPSERT + // and the peer dslManager tick don't land in the soon-to-be-dropped bundle. + final List suspendAcks = broadcastSuspend(catalog, name, "inactivate"); + final SuspendAck rejected = firstRejected(suspendAcks); + if (rejected != null) { + dslManager.getSuspendCoord().localResume(catalog, name); + broadcastResume(catalog, name, "split_brain_detected"); + log.error("runtime-rule inactivate ABORTED for {}/{} — peer {} already holds " + + "SELF origin: {}", catalog, name, rejected.getNodeId(), rejected.getDetail()); + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("split_brain_detected", catalog, name, + "peer " + rejected.getNodeId() + " reports a concurrent apply in flight")); + } + + final RuntimeRuleManagementDAO inactivateDao = resolveDao(); + if (inactivateDao == null) { + dslManager.getSuspendCoord().localResume(catalog, name); + broadcastResume(catalog, name, "inactivate_persist_failed"); + return serverError("inactivate_failed", catalog, name, + "RuntimeRuleManagementDAO unavailable"); + } + final RuntimeRule rule = new RuntimeRule(); + rule.setCatalog(catalog); + rule.setName(name); + rule.setContent(content); + rule.setStatus(RuntimeRule.STATUS_INACTIVE); + rule.setUpdateTime(System.currentTimeMillis()); + try { + inactivateDao.save(rule); + } catch (final Throwable t) { + // Suspend is already in-flight; if we don't Resume, peers sit suspended for + // selfHealThresholdMs. Send Resume now so they recover within one RPC round-trip. + dslManager.getSuspendCoord().localResume(catalog, name); + broadcastResume(catalog, name, "inactivate_persist_failed"); + log.error("failed to inactivate runtime rule {}/{}", catalog, name, t); + return serverError("inactivate_failed", catalog, name, t.getMessage()); + } + + // Drive local teardown immediately now that the DB row reflects INACTIVE — main owns + // the write, so the dslManager's tick would eventually do this, but waiting means the + // main keeps serving the removed bundle for up to one tick interval (30 s by default). + // applyNowForRuleFile is idempotent; if the tick fires first, the second call is a + // fast no-op on the matching hash. + // + // SOFT-PAUSE semantics: pass {@link StorageManipulationOpt#localCacheOnly()} so the + // teardown unregisters every OAP-internal artefact (MeterSystem prototypes, + // MetricsStreamProcessor entry / persistent workers, BatchQueue handlers, retired + // RuleClassLoader) without firing the backend dropTable cascade. The measure / table + // / index and any data already persisted under the pre-inactivate metric stay + // intact — operators reactivate via {@code /addOrUpdate} and the existing data + // remains queryable through the new bundle. {@code /delete} is the only path that + // drops the backend schema. + // + // Teardown failure handling: surface as 500 teardown_deferred rather than 200 + // inactivated. The DB row IS INACTIVE (persist already succeeded above) so peers + // converge from the DB — but on THIS node the OAP-internal teardown may not have + // completed (MalFileApplier swallowed per-metric failures, MetricsStreamProcessor + // worker drain threw, etc.). Returning 200 would tell the operator "done" while + // dispatch is still live; 500 + "teardown_deferred" accurately signals retriable + // state — the next dslManager tick re-runs the same localCacheOnly teardown. + final RuntimeRuleManagementDAO.RuntimeRuleFile inactiveFile = + new RuntimeRuleManagementDAO.RuntimeRuleFile( + catalog, name, content, + RuntimeRule.STATUS_INACTIVE, rule.getUpdateTime()); + try { + dslManager.applyNowForRuleFile(inactiveFile, false, + StorageManipulationOpt.localCacheOnly()); + } catch (final Throwable t) { + log.warn("runtime-rule inactivate: local teardown deferred to tick for {}/{}", + catalog, name, t); + return serverError("teardown_deferred", catalog, name, + "DB row flipped to INACTIVE, but local teardown threw — OAP-internal " + + "register cleanup on this node may not have completed. Tick will " + + "retry. Cause: " + t.getMessage()); + } + return ok(HttpStatus.OK, staticTombstone ? "static_tombstoned" : "inactivated", + catalog, name, + staticTombstone + ? "static rule tombstoned with INACTIVE row; local handlers unregistered; " + + "peers converge on next tick" + : "status set to INACTIVE; local handlers unregistered; peers converge on next tick"); + } + + private HttpResponse doDelete(final String catalog, final String name, final DeleteMode mode) { + return doDelete(catalog, name, mode, false); + } + + private HttpResponse doDelete(final String catalog, final String name, + final DeleteMode mode, + final boolean forwarded) { + final HttpResponse validationError = validate(catalog, name); + if (validationError != null) { + return validationError; + } + // Forward to main with the mode's wire value preserved as request body bytes — the + // receiver unpacks it via executeDelete(..., new String(body, UTF_8)) and re-parses. + // Empty body for DEFAULT. + final byte[] modeBody = mode == DeleteMode.DEFAULT + ? new byte[0] + : mode.getWireValue().getBytes(StandardCharsets.UTF_8); + final HttpResponse routed = routeOrNull(catalog, name, "delete", + modeBody, false, false, forwarded); + if (routed != null) { + return routed; + } + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return serverError("dao_unavailable", catalog, name, + "RuntimeRuleManagementDAO not resolvable — storage module may not be active"); + } + // Same locking contract as /inactivate. + final ReentrantLock perFile = AppliedRuleScript.lockFor(dslManager.getRules(), catalog, name); + if (!dslManager.getLockMetrics().acquireForRest(perFile, REST_LOCK_TIMEOUT_MS, catalog, name)) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("update_in_progress", catalog, name, + "another update for this rule file is in progress on this OAP; retry")); + } + try (HistogramMetrics.Timer ignored = + dslManager.getLockMetrics().startRestHoldTimer()) { + return doDeleteLocked(catalog, name, mode, dao); + } finally { + perFile.unlock(); + } + } + + private HttpResponse doDeleteLocked(final String catalog, final String name, + final DeleteMode mode, + final RuntimeRuleManagementDAO dao) { + // /delete is the one destructive endpoint. /inactivate is a soft-pause that runs the + // OAP-internal teardown under localCacheOnly, deliberately preserving the BanyanDB + // measure + its data so a re-activation via /addOrUpdate is cheap and lossless. + // /delete drops the backend measure first, then removes the tombstone row. + // + // The two-step workflow (/inactivate → /delete) is enforced by the INACTIVE-status + // check below: an ACTIVE rule cannot be deleted in one shot. This separation makes + // the destructive moment explicit and lets operators reverse the soft-pause for a + // bounded window before committing to data loss. + final RuntimeRuleManagementDAO.RuntimeRuleFile prior; + try { + prior = findRule(dao, catalog, name); + } catch (final IOException ioe) { + log.error("runtime-rule delete: prior-row lookup failed for {}/{}", catalog, name, ioe); + return serverError("dao_unavailable", catalog, name, + "prior-row lookup failed: " + ioe.getMessage()); + } + if (prior == null) { + // Idempotent: the desired end state (no row) is already achieved. Return 200 with + // an explicit applyStatus so operators can distinguish a no-op from a real delete. + return ok(HttpStatus.OK, "not_found", catalog, name, + "no row present for this rule; nothing to delete"); + } + if (!RuntimeRule.STATUS_INACTIVE.equals(prior.getStatus())) { + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("requires_inactivate_first", catalog, name, + "rule is ACTIVE; POST /runtime/rule/inactivate first, then /runtime/rule/delete. " + + "Inactivate runs the soft-pause (handlers stop dispatching; backend " + + "measure preserved); delete drops the backend measure and removes the row.")); + } + + final boolean bundledTwinExists = + StaticRuleRegistry.active().find(catalog, name).isPresent(); + if (mode == DeleteMode.REVERT_TO_BUNDLED && !bundledTwinExists) { + // Operator scripted the revert mode for a rule that has no bundled twin — + // /delete cannot revert to anything. Surface a 400 with a clear error so the + // script knows to either drop the mode flag or the operator's assumption was + // wrong about which rules exist on disk. + return badRequest("no_bundled_twin", catalog, name, + "mode=revertToBundled requires a bundled YAML on disk for this " + + "(catalog, name); none was found"); + } + + // Backend drop. /inactivate preserved the BanyanDB measure under localCacheOnly; + // discharge that debt now via the dslManager before the row goes away. The + // orchestrator skips the destructive cascade when a bundled twin exists (bundled + // will reuse the backend resource on the synchronous reload below). LAL has no + // backend schema so the call is a no-op for the lal catalog. A throw here aborts + // the row deletion — we do NOT proceed with dao.delete on backend-drop failure: + // that would orphan the measure with no way to find it again. + try { + dslManager.getDslRuntimeDelete().dropBackendForDelete(catalog, name, prior.getContent()); + } catch (final IllegalStateException refused) { + // Cross-file ownership conflict /addOrUpdate's guard didn't catch. Surface as + // 409 so the operator sees a clear "fix and retry" signal rather than 500. + log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, refused.getMessage()); + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("delete_refused", catalog, name, refused.getMessage())); + } catch (final Throwable t) { + log.error("runtime-rule /delete: backend drop threw for {}/{}", catalog, name, t); + return serverError("delete_backend_drop_failed", catalog, name, t.getMessage()); + } + try { + dao.delete(catalog, name); + } catch (final IOException e) { + log.error("failed to delete runtime rule {}/{}", catalog, name, e); + return serverError("delete_failed", catalog, name, e.getMessage()); + } + + // Synchronously reload the bundled rule (if any) so the operator's response + // reflects the post-delete reality — bundled is already serving via a static: + // loader on this node. Peer nodes converge via the gone-keys reconcile path on + // their next tick. A reload failure is logged and surfaced as a partial-success + // response (200 with applyStatus=reverted_to_bundled_partial) — the row is gone, + // the operator's intent landed, but bundled didn't compile cleanly on this node. + if (bundledTwinExists) { + final boolean reloaded = dslManager.getDslRuntimeDelete() + .reloadBundledIfPresent(catalog, name); + return ok(HttpStatus.OK, + reloaded ? "reverted_to_bundled" : "reverted_to_bundled_partial", + catalog, name, + reloaded + ? "runtime row removed; bundled rule reinstalled into a static: loader " + + "on this node; peers converge on next tick" + : "runtime row removed; bundled reload deferred (compile failed or " + + "engine unavailable); peers will retry via the gone-keys " + + "reconcile on their next tick"); + } + return ok(HttpStatus.OK, "deleted", catalog, name, + "backend measure dropped, runtime row removed from storage; rule is fully gone"); + } + + /** + * Look up the current rule file for {@code (catalog, name)} via the DAO. Returns + * {@code null} when no such rule exists; propagates {@link IOException} so callers that + * need a definitive answer (notably {@link #doDeleteLocked}) can fail loud instead of + * treating a DAO blip as "rule is absent". + */ + private RuntimeRuleManagementDAO.RuntimeRuleFile findRule(final RuntimeRuleManagementDAO dao, + final String catalog, + final String name) throws IOException { + for (final RuntimeRuleManagementDAO.RuntimeRuleFile r : dao.getAll()) { + if (catalog.equals(r.getCatalog()) && name.equals(r.getName())) { + return r; + } + } + return null; + } + + private HttpResponse validate(final String catalog, final String name) { + if (catalog == null || catalog.isEmpty()) { + return badRequest("missing_catalog", catalog, name, "catalog query parameter is required"); + } + if (!isValidCatalog(catalog)) { + return badRequest("invalid_catalog", catalog, name, + "catalog must be one of " + validCatalogs()); + } + if (name == null || name.isEmpty()) { + return badRequest("missing_name", catalog, name, "name query parameter is required"); + } + if (name.startsWith("/") || name.contains("..") || name.contains("\\") + || name.contains("\u0000") || !VALID_NAME.matcher(name).matches()) { + return badRequest("invalid_name", catalog, name, + "name must match segments [A-Za-z0-9._-]+ joined by '/' with no leading slash, " + + "no '..', no empty segments, no backslash"); + } + return null; + } + + private RuntimeRuleManagementDAO resolveDao() { + try { + return moduleManager.find(StorageModule.NAME).provider() + .getService(RuntimeRuleManagementDAO.class); + } catch (final Throwable t) { + log.error("RuntimeRuleManagementDAO lookup failed", t); + return null; + } + } + + private RuntimeRuleManagementDAO.RuntimeRuleFile findRuleFile( + final RuntimeRuleManagementDAO dao, final String catalog, final String name) throws IOException { + for (final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile : dao.getAll()) { + if (catalog.equals(ruleFile.getCatalog()) && name.equals(ruleFile.getName())) { + return ruleFile; + } + } + return null; + } + + // ----- Response helpers -------------------------------------------------------------------- + + private static HttpResponse ok(final HttpStatus status, final String applyStatus, + final String catalog, final String name, final String message) { + return HttpResponse.of(status, MediaType.JSON_UTF_8, jsonBody(applyStatus, catalog, name, message)); + } + + private static HttpResponse badRequest(final String applyStatus, final String catalog, + final String name, final String message) { + return HttpResponse.of(HttpStatus.BAD_REQUEST, MediaType.JSON_UTF_8, + jsonBody(applyStatus, catalog, name, message)); + } + + private static HttpResponse serverError(final String applyStatus, final String catalog, + final String name, final String message) { + return HttpResponse.of(HttpStatus.INTERNAL_SERVER_ERROR, MediaType.JSON_UTF_8, + jsonBody(applyStatus, catalog, name, message)); + } + + private static HttpResponse notImplemented(final String op, final String catalog, final String name) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", "not_implemented"); + body.addProperty("op", op); + body.addProperty("catalog", catalog == null ? "" : catalog); + body.addProperty("name", name == null ? "" : name); + return HttpResponse.of(HttpStatus.NOT_IMPLEMENTED, MediaType.JSON_UTF_8, GSON.toJson(body)); + } + + private static String jsonBody(final String applyStatus, final String catalog, + final String name, final String message) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", applyStatus); + body.addProperty("catalog", catalog == null ? "" : catalog); + body.addProperty("name", name == null ? "" : name); + body.addProperty("message", message == null ? "" : message); + return GSON.toJson(body); + } + + /** {@link #jsonBody} variant that also carries the resolved cluster-main address — used + * by the routing-failure responses ({@code cluster_view_split}, {@code forward_failed}) + * so the operator sees which peer was attempted. */ + private static String routingErrorBody(final String applyStatus, final String catalog, + final String name, final String mainNode, + final String message) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", applyStatus); + body.addProperty("catalog", catalog == null ? "" : catalog); + body.addProperty("name", name == null ? "" : name); + body.addProperty("mainNode", mainNode == null ? "" : mainNode); + body.addProperty("message", message == null ? "" : message); + return GSON.toJson(body); + } + + /** + * RFC 8259 §7 JSON-string escape. Handles {@code "}, {@code \}, control chars + * (newlines, tabs, etc. — the case that the original two-char replace did NOT cover), + * and other characters below {@code U+0020}. Non-ASCII printable characters pass through + * unchanged — the response is {@code application/json; charset=utf-8} so multi-byte + * UTF-8 (Chinese comments, emoji in tags) is carried as-is. + */ + private static String escape(final String s) { + if (s == null) { + return ""; + } + final StringBuilder sb = new StringBuilder(s.length() + 8); + for (int i = 0; i < s.length(); i++) { + final char c = s.charAt(i); + switch (c) { + case '"': + sb.append("\\\""); + break; + case '\\': + sb.append("\\\\"); + break; + case '\n': + sb.append("\\n"); + break; + case '\r': + sb.append("\\r"); + break; + case '\t': + sb.append("\\t"); + break; + case '\b': + sb.append("\\b"); + break; + case '\f': + sb.append("\\f"); + break; + default: + if (c < 0x20) { + sb.append(String.format("\\u%04x", (int) c)); + } else { + sb.append(c); + } + break; + } + } + return sb.toString(); + } + + /** + * Render one /list line for a rule that exists in storage. {@code local} is the dslManager's + * DSLRuntimeState (may be null if the dslManager hasn't observed this row yet — first-tick + * window). The merged line carries both the persisted and transient per-node pieces so an + * operator sees everything needed to diagnose convergence gaps. + */ + private static JsonObject renderListEntry(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, + final DSLRuntimeState local) { + final JsonObject row = new JsonObject(); + row.addProperty("catalog", ruleFile.getCatalog()); + row.addProperty("name", ruleFile.getName()); + row.addProperty("status", ruleFile.getStatus()); + row.addProperty("localState", local == null + ? DSLRuntimeState.LocalState.NOT_LOADED.name() + : local.getLocalState().name()); + row.addProperty("suspendOrigin", local == null + ? DSLRuntimeState.SuspendOrigin.NONE.name() + : local.getSuspendOrigin().name()); + row.addProperty("loaderGc", local == null + ? DSLRuntimeState.LoaderGc.LIVE.name() + : local.getLoaderGc().name()); + addLoaderFields(row, ruleFile.getCatalog(), ruleFile.getName()); + row.addProperty("contentHash", ContentHash.sha256Hex(ruleFile.getContent())); + addBundledFields(row, ruleFile.getCatalog(), ruleFile.getName()); + row.addProperty("updateTime", ruleFile.getUpdateTime()); + row.addProperty("lastApplyError", + local == null || local.getLastApplyError() == null ? "" : local.getLastApplyError()); + return row; + } + + /** Look up the per-rule loader the manager has installed for {@code (catalog, name)} and + * add {@code loaderKind} / {@code loaderName} to {@code row}. {@code loaderKind=NONE} + * with empty {@code loaderName} when no per-file loader exists for this key (typical + * for bundled-only rules served from the shared default loader). */ + private static void addLoaderFields(final JsonObject row, final String catalog, + final String name) { + final Catalog c; + try { + c = Catalog.of(catalog); + } catch (final IllegalArgumentException unknown) { + row.addProperty("loaderKind", "NONE"); + row.addProperty("loaderName", ""); + return; + } + final Optional loader = + DSLClassLoaderManager.INSTANCE.active(c, name); + row.addProperty("loaderKind", + loader.map(l -> l.getKind().name()).orElse("NONE")); + row.addProperty("loaderName", loader.map(RuleClassLoader::getName).orElse("")); + } + + /** Add {@code bundled} (boolean) and {@code bundledContentHash} (string, omitted when + * no bundled twin exists) so the UI can render an "Override" / "Modified from bundled" + * badge without a second roundtrip to {@code /bundled}. */ + private static void addBundledFields(final JsonObject row, final String catalog, + final String name) { + final Optional bundled = StaticRuleRegistry.active().find(catalog, name); + row.addProperty("bundled", bundled.isPresent()); + bundled.ifPresent(content -> + row.addProperty("bundledContentHash", ContentHash.sha256Hex(content))); + } + + /** + * Assemble a tar.gz of rule rows. Nested directory layout mirrors the on-disk static + * catalog tree so the archive is re-POSTable through {@code addOrUpdate} for DR restore. + * ACTIVE rows go under {@code /.yaml}; INACTIVE rows go under + * {@code inactive//.yaml} so restore can reproduce tombstones explicitly. + * A top-level {@code manifest.yaml} records per-row metadata for audit / integrity. + */ + private HttpResponse doDump(final String catalogFilter) { + final RuntimeRuleManagementDAO dao = resolveDao(); + if (dao == null) { + return serverError("dao_unavailable", catalogFilter, null, + "RuntimeRuleManagementDAO not resolvable — storage module may not be active"); + } + final List ruleFiles; + try { + ruleFiles = dao.getAll(); + } catch (final IOException e) { + return serverError("dump_failed", catalogFilter, null, e.getMessage()); + } + + final ByteArrayOutputStream buffer = new ByteArrayOutputStream(); + final DateTimeFormatter iso = DateTimeFormatter.ISO_INSTANT; + final String dumpedAt = iso.format(Instant.now()); + try (GZIPOutputStream gzip = new GZIPOutputStream(buffer); + TarArchiveOutputStream tar = new TarArchiveOutputStream(gzip)) { + tar.setLongFileMode(TarArchiveOutputStream.LONGFILE_POSIX); + + final StringBuilder manifest = new StringBuilder(); + manifest.append("dumpedAt: \"").append(dumpedAt).append("\"\n"); + manifest.append("catalogFilter: \"") + .append(catalogFilter == null ? "" : escape(catalogFilter)) + .append("\"\n"); + manifest.append("entries:\n"); + + int emitted = 0; + for (final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile : ruleFiles) { + if (catalogFilter != null && !catalogFilter.equals(ruleFile.getCatalog())) { + continue; + } + emitted++; + final boolean inactive = RuntimeRule.STATUS_INACTIVE.equals(ruleFile.getStatus()); + final String prefix = inactive ? "runtime-rule-dump/inactive/" : "runtime-rule-dump/"; + final String entryPath = prefix + ruleFile.getCatalog() + "/" + ruleFile.getName() + ".yaml"; + writeTarEntry(tar, entryPath, ruleFile.getContent()); + + final String sha = ContentHash.sha256Hex(ruleFile.getContent()); + manifest.append(" - catalog: ").append(ruleFile.getCatalog()).append("\n"); + manifest.append(" name: \"").append(escape(ruleFile.getName())).append("\"\n"); + manifest.append(" status: ").append(ruleFile.getStatus()).append("\n"); + manifest.append(" updateTime: ").append(ruleFile.getUpdateTime()).append("\n"); + manifest.append(" sha256: \"").append(sha).append("\"\n"); + } + writeTarEntry(tar, "runtime-rule-dump/manifest.yaml", manifest.toString()); + log.info("runtime-rule dump assembled: {} row(s) {}", + emitted, catalogFilter == null ? "(all catalogs)" : "(catalog=" + catalogFilter + ")"); + } catch (final IOException e) { + return serverError("dump_failed", catalogFilter, null, e.getMessage()); + } + + final String filename = "runtime-rule-dump-" + dumpedAt.replace(":", "-") + ".tar.gz"; + return HttpResponse.of( + ResponseHeaders.builder(HttpStatus.OK) + .contentType(MediaType.OCTET_STREAM) + .add(HttpHeaderNames.CONTENT_DISPOSITION, + "attachment; filename=\"" + filename + "\"") + .build(), + HttpData.copyOf(buffer.toByteArray())); + } + + private static void writeTarEntry( + final TarArchiveOutputStream tar, + final String path, final String body) throws IOException { + final byte[] bytes = body == null ? new byte[0] : body.getBytes(StandardCharsets.UTF_8); + final TarArchiveEntry entry = new TarArchiveEntry(path); + entry.setSize(bytes.length); + tar.putArchiveEntry(entry); + tar.write(bytes); + tar.closeArchiveEntry(); + } + + /** + * Render a /list row for an in-memory bundle whose runtime row has been deleted but the + * dslManager hasn't yet swept — transient, typically gone within one tick. Surfacing it + * helps operators watching a delete propagate. + */ + private static JsonObject renderOrphanEntry(final DSLRuntimeState local) { + final JsonObject row = new JsonObject(); + row.addProperty("catalog", local.getCatalog()); + row.addProperty("name", local.getName()); + row.addProperty("status", "n/a"); + row.addProperty("localState", local.getLocalState().name()); + row.addProperty("loaderGc", local.getLoaderGc().name()); + addLoaderFields(row, local.getCatalog(), local.getName()); + row.addProperty("contentHash", local.getContentHash()); + addBundledFields(row, local.getCatalog(), local.getName()); + row.addProperty("pendingUnregister", true); + return row; + } + + /** + * Render a /list row for a bundled-only rule — shipped on disk, no runtime override. + * Status is reported as {@code BUNDLED} so operators can distinguish it from an + * {@code ACTIVE} runtime override and from the transient orphan state. + */ + private static JsonObject renderBundledEntry(final DSLRuntimeState local) { + final JsonObject row = new JsonObject(); + row.addProperty("catalog", local.getCatalog()); + row.addProperty("name", local.getName()); + row.addProperty("status", "BUNDLED"); + row.addProperty("localState", local.getLocalState().name()); + row.addProperty("loaderGc", local.getLoaderGc().name()); + addLoaderFields(row, local.getCatalog(), local.getName()); + row.addProperty("contentHash", local.getContentHash()); + addBundledFields(row, local.getCatalog(), local.getName()); + row.addProperty("pendingUnregister", false); + return row; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScript.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScript.java new file mode 100644 index 000000000000..9c7112ff1742 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScript.java @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.state; + +import java.util.Map; +import java.util.concurrent.locks.ReentrantLock; +import lombok.Getter; + +/** + * One DSL rule script as the dslManager currently holds it on this node — every per-file + * piece of state in one immutable record. Updates produce a new instance via the + * {@code with*} builders, so a {@link java.util.concurrent.ConcurrentMap#compute compute} on + * the dslManager's {@code rules} map gives atomic per-key transitions without an external + * lock. + * + *

Fields: + *

    + *
  • {@link #catalog}, {@link #name} — identity.
  • + *
  • {@link #content} — the YAML last successfully applied. The engine's {@code classify} + * reads this as the "old side" of the next delta; {@code /list} surfaces it (or its + * hash) to the operator. {@code null} until the first successful commit, cleared back + * to {@code null} on unregister.
  • + *
  • {@link #state} — operator-facing per-key view ({@link DSLRuntimeState}: RUNNING / + * SUSPENDED / NOT_LOADED, {@code suspendOrigin}, {@code lastApplyError}, timestamps). + * Returned verbatim on {@code /list}.
  • + *
  • {@link #lock} — per-file outermost {@link ReentrantLock}. Identity is stable across + * {@code with*} builders so consecutive transitions on the same rule serialize on the + * same mutex. Cluster Suspend RPCs, REST workflows, the dslManager tick, and inline + * sync apply paths all acquire this lock by going through + * {@code rules.computeIfAbsent(key, k -> empty(catalog, name)).getLock()} — the entry + * is lazy-created on first lock so callers never have to ask "does the rule exist + * yet?" before locking.
  • + *
  • {@link #applied} — engine-opaque artefact the engine wrote on its last successful + * commit. {@code null} until the first commit, cleared on unregister. The + * {@link EngineApplied} interface lets cross-DSL code (Suspend/Resume coordinator, + * cross-file ownership guard, classloader graveyard hand-off) drive dispatch and + * claim queries polymorphically without switching on MAL vs LAL; engines cast to + * their richer subtype when they need the full Applied.
  • + *
+ * + *

This class consolidates what used to be four parallel per-key maps on the dslManager + * (snapshot {@code DSLRuntimeState}, {@code appliedContent} YAML, {@code PerFileLockMap} + * locks, {@code appliedMal}/{@code appliedLal} engine artefacts). Per-rule operations — + * classify, apply, unregister, suspend, resume, persist, /list — read or replace one + * {@code AppliedRuleScript} instead of coordinating across maps. + */ +@Getter +public final class AppliedRuleScript { + + private final String catalog; + private final String name; + private final String content; + private final DSLRuntimeState state; + private final ReentrantLock lock; + private final EngineApplied applied; + + /** + * Construct a fresh entry with a brand-new {@link ReentrantLock} and no applied artefact. + * Used on the first time a {@code (catalog, name)} pair is seen — either via lazy lock + * acquire on the rules map or an explicit static-rule load. + */ + public AppliedRuleScript(final String catalog, final String name, final String content, + final DSLRuntimeState state) { + this(catalog, name, content, state, new ReentrantLock(), null); + } + + /** + * Internal constructor used by the {@code with*} builders to preserve the lock identity + * + applied artefact. Public so engines / tests can construct a freshly-applied script + * with a specific {@link EngineApplied} when needed. + */ + public AppliedRuleScript(final String catalog, final String name, final String content, + final DSLRuntimeState state, final ReentrantLock lock, + final EngineApplied applied) { + this.catalog = catalog; + this.name = name; + this.content = content; + this.state = state; + this.lock = lock; + this.applied = applied; + } + + /** + * Build a fresh entry for {@code (catalog, name)} with no content, no state, no applied + * artefact, and a fresh {@link ReentrantLock}. Used by the rules map's + * {@code computeIfAbsent} so callers can lock for a {@code (catalog, name)} before the + * rule has any state of its own. + */ + public static AppliedRuleScript empty(final String catalog, final String name) { + return new AppliedRuleScript(catalog, name, null, null); + } + + public AppliedRuleScript withContent(final String newContent) { + return new AppliedRuleScript(catalog, name, newContent, state, lock, applied); + } + + public AppliedRuleScript withState(final DSLRuntimeState newState) { + return new AppliedRuleScript(catalog, name, content, newState, lock, applied); + } + + public AppliedRuleScript withApplied(final EngineApplied newApplied) { + return new AppliedRuleScript(catalog, name, content, state, lock, newApplied); + } + + public AppliedRuleScript withContentAndState(final String newContent, + final DSLRuntimeState newState) { + return new AppliedRuleScript(catalog, name, newContent, newState, lock, applied); + } + + public AppliedRuleScript withContentAndApplied(final String newContent, + final EngineApplied newApplied) { + return new AppliedRuleScript(catalog, name, newContent, state, lock, newApplied); + } + + /** + * Lazy-acquire the per-file {@link ReentrantLock} for {@code (catalog, name)} on + * {@code rules}. Used by every caller that needs to serialise on a {@code (catalog, name)} + * before the rule has any state of its own — the entry is auto-created with + * {@link #empty} on first call and the lock returned has stable identity across + * subsequent {@code with*} replacements of the entry. + * + *

Centralised here so the lazy-create pattern lives in one place instead of being + * duplicated at every dependent's call site. + */ + public static ReentrantLock lockFor(final Map rules, + final String catalog, final String name) { + return rules.computeIfAbsent(catalog + ":" + name, + k -> empty(catalog, name)).getLock(); + } +} + diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeState.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeState.java new file mode 100644 index 000000000000..f52fb63d14d8 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeState.java @@ -0,0 +1,312 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.state; + +import java.util.Objects; + +/** + * Immutable per-node snapshot of what a runtime rule is currently doing on this OAP instance. + * + *

Stored in the dslManager's {@code ConcurrentHashMap<(catalog, name), DSLRuntimeState>} so one + * {@code get} returns a fully-populated state; readers never see a half-transitioned entry. + * This is intentionally the opposite of {@code AlarmStatusWatcher.getAlarmRuleContext} which + * field-reads without a snapshot and can surface mixed-generation results — the runtime-rule + * {@code /list} surface avoids that trap from day one by replacing the record wholesale on + * every state transition. + * + *

Records are not used here because the project must compile on JDK 11. A plain final-field + * class with explicit getters and {@code with*} copy-constructor helpers gives the same + * immutability contract on the supported baseline. + */ +public final class DSLRuntimeState { + + /** Local lifecycle state on this node. Distinct from the DB {@code status} column. */ + public enum LocalState { + /** Handlers registered, samples flowing. */ + RUNNING, + /** Transiently removed from the dispatch map during a structural apply or + * missed-broadcast recovery. Samples for this bundle's metrics are dropped + * for the duration. Always paired with a {@link SuspendOrigin} other than + * {@link SuspendOrigin#NONE} to describe WHY the bundle is suspended. */ + SUSPENDED, + /** No compile has succeeded yet for this (catalog, name) on this node. */ + NOT_LOADED + } + + /** + * Reason this node is SUSPENDED. Distinct origins must be tracked because Resume-from-peer + * must not undo a local self-suspend that's mid-apply. + * + *

    + *
  • {@link #NONE} — localState is not SUSPENDED (informational default).
  • + *
  • {@link #SELF} — this node entered SUSPENDED itself, right before its own + * STRUCTURAL apply. Only {@code SuspendResumeCoordinator#localResume} + * (called by the REST handler on its own failure / commit tail) clears this.
  • + *
  • {@link #PEER} — a peer main node broadcast Suspend to this node. Only the peer's + * subsequent Resume broadcast or the 60 s self-heal rule clears this.
  • + *
  • {@link #BOTH} — reserved lattice slot. Under correct single-main routing this + * value is unreachable: {@code applySuspend} rejects a cross-origin incoming + * {@code Suspend} with {@link org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult#REJECTED_ORIGIN_CONFLICT} + * before reaching the lattice merge, so the state cannot transition into BOTH. + * The enum value stays to keep the {@link #add} / {@link #remove} lattice total + * (and to surface unambiguously if a bug ever did let both origins coexist).
  • + *
+ * + *

The bundle is SUSPENDED iff this origin is anything other than {@link #NONE}. + */ + public enum SuspendOrigin { + NONE, SELF, PEER, BOTH; + + /** + * Merge a new origin into the existing one. {@code add(X, Y)} returns {@code X ∪ Y} + * in the lattice {NONE < SELF/PEER < BOTH}. Idempotent: adding an origin that's + * already included returns the input unchanged. + */ + public SuspendOrigin add(final SuspendOrigin other) { + if (other == null || other == NONE || this == other) { + return this; + } + if (this == NONE) { + return other; + } + // One side is SELF, the other is PEER (they differ and neither is NONE). + return BOTH; + } + + /** + * Remove an origin from the existing one. {@code remove(BOTH, SELF) == PEER}, etc. + * Returns {@link #NONE} when the last origin is removed; caller uses that to flip + * {@link LocalState#SUSPENDED} back to {@link LocalState#RUNNING}. + */ + public SuspendOrigin remove(final SuspendOrigin other) { + if (other == null || other == NONE || this == NONE) { + return this; + } + if (this == BOTH) { + return other == SELF ? PEER : SELF; + } + return this == other ? NONE : this; + } + } + + /** Coarse hint about whether the bundle's {@link + * org.apache.skywalking.oap.server.core.classloader.RuleClassLoader} has been retired + * and, if so, whether the JVM has confirmed collection. */ + public enum LoaderGc { + /** The loader is alive, serving active classes. */ + LIVE, + /** The loader is retired but the manager's graveyard phantom reference has not + * fired yet. Brief window is expected; persistent presence is the leak signal. */ + PENDING, + /** Confirmed collected by the JVM (phantom fired). */ + COLLECTED + } + + private final String catalog; + private final String name; + private final String contentHash; + private final LocalState localState; + /** + * Why the bundle is SUSPENDED. {@link SuspendOrigin#NONE} when {@link #localState} is not + * SUSPENDED; one of SELF / PEER / BOTH otherwise. The REST handler's own apply contributes + * SELF; an inbound Suspend RPC from a peer contributes PEER. Split tracking is load-bearing + * for the Resume RPC — Resume must only undo PEER; it must never race the main-node's own + * in-flight apply by clearing a SELF origin. + */ + private final SuspendOrigin suspendOrigin; + private final LoaderGc loaderGc; + private final String lastApplyError; + private final long lastAppliedAtMs; + private final long enteredCurrentStateAtMs; + /** + * Monotonic clock stamp paired with {@link #enteredCurrentStateAtMs}. Wall-clock is kept + * for {@code /list} operator readability; this field is the source used for threshold + * arithmetic (self-heal timeout, stale-loader WARN) so an NTP jump or a backwards wall- + * clock tick can never make a SUSPENDED bundle appear younger or older than it actually + * is. Both stamps are advanced together on every state transition. + */ + private final long enteredCurrentStateAtNanos; + + public DSLRuntimeState(final String catalog, final String name, final String contentHash, + final LocalState localState, final LoaderGc loaderGc, + final String lastApplyError, final long lastAppliedAtMs, + final long enteredCurrentStateAtMs) { + this(catalog, name, contentHash, localState, SuspendOrigin.NONE, loaderGc, + lastApplyError, lastAppliedAtMs, enteredCurrentStateAtMs, System.nanoTime()); + } + + public DSLRuntimeState(final String catalog, final String name, final String contentHash, + final LocalState localState, final LoaderGc loaderGc, + final String lastApplyError, final long lastAppliedAtMs, + final long enteredCurrentStateAtMs, + final long enteredCurrentStateAtNanos) { + this(catalog, name, contentHash, localState, SuspendOrigin.NONE, loaderGc, + lastApplyError, lastAppliedAtMs, enteredCurrentStateAtMs, enteredCurrentStateAtNanos); + } + + public DSLRuntimeState(final String catalog, final String name, final String contentHash, + final LocalState localState, final SuspendOrigin suspendOrigin, + final LoaderGc loaderGc, + final String lastApplyError, final long lastAppliedAtMs, + final long enteredCurrentStateAtMs, + final long enteredCurrentStateAtNanos) { + this.catalog = catalog; + this.name = name; + this.contentHash = contentHash; + this.localState = localState; + this.suspendOrigin = suspendOrigin == null ? SuspendOrigin.NONE : suspendOrigin; + this.loaderGc = loaderGc; + this.lastApplyError = lastApplyError; + this.lastAppliedAtMs = lastAppliedAtMs; + this.enteredCurrentStateAtMs = enteredCurrentStateAtMs; + this.enteredCurrentStateAtNanos = enteredCurrentStateAtNanos; + } + + public static DSLRuntimeState running(final String catalog, final String name, + final String contentHash, final long nowMs) { + return new DSLRuntimeState(catalog, name, contentHash, LocalState.RUNNING, + SuspendOrigin.NONE, LoaderGc.LIVE, null, nowMs, nowMs, System.nanoTime()); + } + + /** + * First-time apply failed before any registration completed — nothing is live locally, + * so {@link LocalState#NOT_LOADED} is the correct state, and {@code contentHash} is + * deliberately left {@code null} so the dslManager tick's short-circuit (which compares + * {@code prev.getContentHash()} against the DB's current hash) does NOT skip the file. + * The next tick will re-classify and retry the apply. + * + *

Callers chain {@link #withApplyError} to record the diagnostic. + */ + public static DSLRuntimeState failedFirstApply(final String catalog, final String name, final long nowMs) { + return new DSLRuntimeState(catalog, name, /* contentHash */ null, LocalState.NOT_LOADED, + SuspendOrigin.NONE, LoaderGc.LIVE, null, nowMs, nowMs, System.nanoTime()); + } + + public DSLRuntimeState withLocalState(final LocalState newState, final long nowMs) { + if (this.localState == newState) { + return this; + } + // Non-SUSPENDED transitions clear origin — it's only meaningful while SUSPENDED. + // Callers flipping SUSPENDED origin should use withSuspendOrigin which handles the + // paired SUSPENDED↔RUNNING flip when the origin lattice drains. + final SuspendOrigin newOrigin = newState == LocalState.SUSPENDED + ? suspendOrigin : SuspendOrigin.NONE; + return new DSLRuntimeState(catalog, name, contentHash, newState, newOrigin, loaderGc, + lastApplyError, lastAppliedAtMs, nowMs, System.nanoTime()); + } + + /** + * Apply an origin mutation. Transitions: + *

    + *
  • RUNNING + non-NONE origin → SUSPENDED with that origin.
  • + *
  • SUSPENDED + NONE origin → RUNNING (origin lattice drained).
  • + *
  • SUSPENDED + non-NONE origin → SUSPENDED with that origin (e.g., peer cleared + * but self still set → stays SUSPENDED, origin flips from BOTH to SELF).
  • + *
+ * {@code enteredCurrentState*} timestamps advance on every origin transition — not only + * SUSPENDED↔RUNNING flips — because self-heal measures "how long has the bundle been at + * its current effective (state, origin) tuple". In particular, when origin transitions + * BOTH→PEER (local REST apply finishes, peer suspend still in effect), self-heal's + * threshold must count from that moment, not from when the bundle first became + * SUSPENDED with SELF origin. Otherwise self-heal fires prematurely before the PEER-only + * window has actually elapsed. + */ + public DSLRuntimeState withSuspendOrigin(final SuspendOrigin newOrigin, final long nowMs) { + final SuspendOrigin effective = newOrigin == null ? SuspendOrigin.NONE : newOrigin; + final LocalState newLocal = effective == SuspendOrigin.NONE + ? (localState == LocalState.SUSPENDED ? LocalState.RUNNING : localState) + : LocalState.SUSPENDED; + if (newLocal == localState && effective == suspendOrigin) { + return this; + } + return new DSLRuntimeState(catalog, name, contentHash, newLocal, effective, loaderGc, + lastApplyError, lastAppliedAtMs, + nowMs, System.nanoTime()); + } + + public DSLRuntimeState withLoaderGc(final LoaderGc newGc) { + if (this.loaderGc == newGc) { + return this; + } + return new DSLRuntimeState(catalog, name, contentHash, localState, suspendOrigin, newGc, + lastApplyError, lastAppliedAtMs, enteredCurrentStateAtMs, enteredCurrentStateAtNanos); + } + + public DSLRuntimeState withApplyError(final String err, final long nowMs) { + return new DSLRuntimeState(catalog, name, contentHash, localState, suspendOrigin, loaderGc, + err, nowMs, enteredCurrentStateAtMs, enteredCurrentStateAtNanos); + } + + /** + * Advance the bundle's content hash and mark this as a successful apply. Clears any + * {@code lastApplyError} carried over from a previous failed attempt (a successful apply + * on the new content means whatever error the old content raised is no longer relevant) + * and stamps {@code lastAppliedAtMs} with the current wall-clock time. + * + *

No-op when the hash is unchanged — state is already current. + */ + public DSLRuntimeState withContentHash(final String newHash, final long nowMs) { + if (Objects.equals(this.contentHash, newHash)) { + return this; + } + return new DSLRuntimeState(catalog, name, newHash, localState, suspendOrigin, loaderGc, + /* lastApplyError */ null, /* lastAppliedAtMs */ nowMs, + nowMs, System.nanoTime()); + } + + public String getCatalog() { + return catalog; + } + + public String getName() { + return name; + } + + public String getContentHash() { + return contentHash; + } + + public LocalState getLocalState() { + return localState; + } + + public SuspendOrigin getSuspendOrigin() { + return suspendOrigin; + } + + public LoaderGc getLoaderGc() { + return loaderGc; + } + + public String getLastApplyError() { + return lastApplyError; + } + + public long getLastAppliedAtMs() { + return lastAppliedAtMs; + } + + public long getEnteredCurrentStateAtMs() { + return enteredCurrentStateAtMs; + } + + public long getEnteredCurrentStateAtNanos() { + return enteredCurrentStateAtNanos; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/EngineApplied.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/EngineApplied.java new file mode 100644 index 000000000000..c2fe3f49face --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/EngineApplied.java @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.state; + +import java.util.Set; +import org.apache.skywalking.oap.server.library.module.ModuleManager; + +/** + * Engine-opaque per-rule applied artefact slot on {@link AppliedRuleScript}. Every engine's + * concrete {@code Applied} type ({@code MalFileApplier.Applied}, + * {@code LalFileApplier.Applied}) implements this interface, so cross-DSL code — the + * Suspend/Resume coordinator, cross-file ownership guard, classloader graveyard hand-off — + * can drive dispatch and claim queries polymorphically without switching on MAL vs LAL. + * + *

Engines cast back to their richer concrete type whenever they need the full applied + * shape (e.g. MAL's compile path needs the {@code Rule} + {@code MetricConvert} that the + * generic interface deliberately doesn't expose). + * + *

Two design constraints worth flagging: + *

    + *
  • No held module references. {@link #suspendDispatch}/{@link #resumeDispatch} + * receive a {@link ModuleManager} on each call so the {@code Applied} stays a plain + * data carrier and survives module reloads / test harness rewires that swap + * MeterSystem or LAL Factory under it.
  • + *
  • Empty ≠ unsupported. Engines without alarm semantics (LAL) return + * {@link java.util.Collections#emptySet()} from {@link #alarmResetTargets()}; the + * coordinator interprets empty as "nothing to reset" rather than "this engine doesn't + * support alarms" — both readings drive the same no-op.
  • + *
+ */ +public interface EngineApplied { + + /** + * Park dispatch / mark this bundle as suspended for sample handling. For MAL: route + * {@link org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem#suspendDispatch} + * across the registered metric names. For LAL: drive + * {@code LogFilterListener.Factory.suspend} on the registered rule keys. + * + * @param moduleManager looked up on each call to avoid holding a stale reference + * @return number of dispatch primitives successfully paused (metric names for MAL, + * rule keys for LAL); {@code 0} if the engine's runtime services aren't + * resolvable (early boot, embedded test topology) — caller treats as a no-op. + */ + int suspendDispatch(ModuleManager moduleManager); + + /** Inverse of {@link #suspendDispatch}: resume dispatch for this bundle, return the + * count of primitives un-parked. */ + int resumeDispatch(ModuleManager moduleManager); + + /** + * Cluster-wide unique keys this bundle claims on the active side — metric names (MAL) + * or {@code (layer, ruleName)} keys (LAL). The cross-file ownership guard reads this + * to detect collisions: another active file claiming the same key is a config error + * the operator must resolve via {@code /inactivate} or {@code /delete} on one of them. + * + * @return immutable, possibly empty set of claimed keys + */ + Set claimedKeys(); + + /** + * Per-file classloader that owns generated DSL classes for this bundle, or {@code null} + * for bundles applied without a dedicated loader (boot-seeded static rules; legacy + * 2-arg LAL apply entry point). Cross-DSL teardown reads this to retire the loader + * through {@code ClassLoaderGc} so GC of the generated classes is observable. + * + *

Returned as {@link Object} so the {@code state} package doesn't need to import the + * concrete {@code RuleClassLoader} type from {@code classloader}; teardown code casts + * before passing to {@code ClassLoaderGc.retire}. + */ + Object classLoader(); + + /** + * Metric names whose alarm window the engine wants reset on tear-down. MAL returns the + * metric set the bundle owned; LAL returns an empty set (alarm windows key off metric + * names, not log rules). + * + * @return immutable, possibly empty set of metric names + */ + Set alarmResetTargets(); +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHash.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHash.java new file mode 100644 index 000000000000..0b4f525c8d18 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHash.java @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.util; + +import java.nio.charset.StandardCharsets; +import java.security.MessageDigest; +import java.security.NoSuchAlgorithmException; + +/** + * SHA-256 hex digest used throughout the dslManager as the byte-identity of a rule file's + * content. Replaces a stored monotonic version column — last-write-wins storage plus + * content-hash comparison on every tick is the design's convergence mechanism. + */ +public final class ContentHash { + + private ContentHash() { + } + + public static String sha256Hex(final String content) { + if (content == null) { + return ""; + } + try { + final MessageDigest md = MessageDigest.getInstance("SHA-256"); + final byte[] digest = md.digest(content.getBytes(StandardCharsets.UTF_8)); + final StringBuilder sb = new StringBuilder(64); + for (final byte b : digest) { + sb.append(String.format("%02x", b)); + } + return sb.toString(); + } catch (final NoSuchAlgorithmException e) { + // SHA-256 is required by every JVM per the specification — this cannot happen. + throw new IllegalStateException("SHA-256 not available on this JVM", e); + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/proto/runtime-rule-cluster.proto new file mode 100644 index 000000000000..2e3ecd6c233b --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/proto/runtime-rule-cluster.proto @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +syntax = "proto3"; + +package skywalking.v3.runtime_rule; + +option java_multiple_files = true; +option java_package = "org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1"; + +// Cluster-internal RPCs used during a structural runtime-rule update. Updates are routed to +// a single deterministic main node per (catalog, name); the main broadcasts Suspend to every +// peer so they stop running the old bundle while the main applies DDL and verifies it. On +// success the main persists the new row; peers converge on the next dslManager tick. On +// failure (compile, verify, persist) the main broadcasts Resume so peers flip back to RUNNING +// within an RPC round-trip instead of waiting the 60 s self-heal window. +// +// Self-heal in the dslManager is the backstop for the narrow case where the main crashes +// after Suspend but before Resume — peers whose SUSPENDED state has been held by a peer +// origin past the self-heal threshold and whose DB content has not advanced flip themselves +// back to RUNNING on the retained old bundle. +service RuntimeRuleClusterService { + rpc Suspend(SuspendRequest) returns (SuspendAck); + rpc Resume(ResumeRequest) returns (ResumeAck); + // Forward a write request to the hash-selected main node for (catalog, name). Any OAP + // that receives an /addOrUpdate / /inactivate / /delete HTTP request for a file it isn't + // the main of wraps the HTTP intent in a ForwardRequest and calls this RPC against the + // main's cluster channel. The main runs the full local workflow and returns the HTTP + // status + JSON body in the ack. Eliminates the "client must resubmit" round-trip from + // the old 421 Misdirected Request behaviour. + rpc Forward(ForwardRequest) returns (ForwardResponse); +} + +message SuspendRequest { + // otel-rules | log-mal-rules | lal + string catalog = 1; + // relative path under the catalog root, no extension, may contain '/' + string name = 2; + // informational; logged on the receiver for diagnostic trails + string reason = 3; + // main node's instance id — receivers compare against their own id to suppress any + // accidental self-broadcast loops that would otherwise drain the local bundle twice + string sender_node_id = 4; + // epoch millis on the main node. Not used for causality decisions (self-heal uses + // monotonic clock on the receiver) but useful for latency diagnostics in logs. + int64 issued_at_ms = 5; +} + +message SuspendAck { + string node_id = 1; + SuspendState state = 2; + // human-readable audit trail (e.g. "drained 7 metric classes, 2 lal rules") + string detail = 3; +} + +enum SuspendState { + SUSPEND_STATE_UNSPECIFIED = 0; + // was ACTIVE on this node, now suspended — new response for this tick + SUSPENDED = 1; + // idempotent replay of a Suspend the receiver has already honored — no state change + ALREADY_SUSPENDED = 2; + // bundle not present locally; main node proceeds, nothing to drain here + NOT_PRESENT = 3; + // receiver refused because of an origin conflict (this node is itself mid-apply as + // SELF origin, meaning two OAPs believe they're main for this file — routing failure + // or split-brain). `detail` carries the diagnostic message. + REJECTED = 4; +} + +// Resume is the inverse of Suspend. Sent by the main node right after a local apply failure +// (compile / verify / persist) so peers flip back to RUNNING without waiting for the 60 s +// self-heal threshold. Clears only the PEER origin on the receiver — a receiver that happens +// to be SELF-suspended for its own in-flight apply ignores Resume for the SELF bit. +message ResumeRequest { + string catalog = 1; + string name = 2; + string reason = 3; + string sender_node_id = 4; + int64 issued_at_ms = 5; +} + +message ResumeAck { + string node_id = 1; + ResumeState state = 2; + string detail = 3; +} + +enum ResumeState { + RESUME_STATE_UNSPECIFIED = 0; + // bundle transitioned SUSPENDED → RUNNING (PEER origin cleared; no other origin held) + RESUMED = 1; + // PEER origin was not set on this node (never received Suspend, or already cleared by + // self-heal, or another Resume replay). Idempotent, no state change. + NOT_SUSPENDED_BY_SENDER = 2; + // bundle not present locally — nothing to flip + RESUME_NOT_PRESENT = 3; + // PEER cleared but SELF is still set (rare BOTH state); bundle stays SUSPENDED until + // local apply finishes. `detail` describes the BOTH situation for audit. + PARTIALLY_RESUMED = 4; +} + +// ForwardRequest wraps an incoming HTTP write request so the receiving (non-main) node can +// hand the work off to the hash-selected main. Receiver runs the operation as if it received +// the HTTP directly, then packages the HTTP response into ForwardResponse. Two operator- +// visible behaviours: +// +// 1. Normal case: non-main wraps + forwards; main runs the workflow; operator sees the +// main's response transparently without having to resubmit. +// 2. Routing-misfire fail-safe: if the receiver of a ForwardRequest ALSO isn't the main +// (its own cluster view disagrees with the sender's), it does NOT re-forward — +// responds with http_status=421 and a diagnostic body. This bounds cluster ping-pong +// at one hop and signals an operator that two nodes disagree about cluster membership. +message ForwardRequest { + // "addOrUpdate" | "inactivate" | "delete" + string operation = 1; + string catalog = 2; + string name = 3; + // Raw HTTP body — the rule YAML bytes for addOrUpdate; empty for inactivate / delete. + // Bytes not string so a binary-safe round-trip is guaranteed even though YAML is UTF-8 + // text in practice. + bytes body = 4; + // Maps to the /addOrUpdate?allowStorageChange= query param; ignored for inactivate / + // delete. + bool allow_storage_change = 5; + // Maps to the /addOrUpdate?force= query param. True for recovery pushes — forces the + // bypass of the byte-identical no-change short-circuit so a same-content re-push drives + // the full pipeline. Ignored for inactivate / delete. + bool force_reapply = 6; + // Sender's instance id. Receivers log it for diagnostic trails; the fail-safe also uses + // it to spot pathological self-forward loops before they cause damage. + string sender_node_id = 7; + // Epoch millis on the forwarding node. Diagnostic only; not used for causality. + int64 issued_at_ms = 8; +} + +message ForwardResponse { + // Mirror of the HTTP status the main's REST handler would have returned (200, 400, + // 409, 421, 500, etc.). Forwarder relays this verbatim to the original HTTP caller. + int32 http_status = 1; + // Mirror of the HTTP response body (JSON) the main would have returned. + string body = 2; + // Instance id of the node that actually produced the response — normally the main, + // but on the fail-safe path (receiver also isn't main) it's still the receiver so the + // operator can correlate which node refused. + string node_id = 3; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver new file mode 100644 index 000000000000..81670580fc53 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.core.rule.ext.RuntimeRuleOverrideResolver @@ -0,0 +1,18 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +org.apache.skywalking.oap.server.receiver.runtimerule.extension.DbOverrideRuntimeRuleResolver diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleDefine b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleDefine new file mode 100644 index 000000000000..1aea677bc563 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleDefine @@ -0,0 +1,19 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# + +org.apache.skywalking.oap.server.receiver.runtimerule.module.RuntimeRuleModule diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleProvider b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleProvider new file mode 100644 index 000000000000..18706c532f68 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/resources/META-INF/services/org.apache.skywalking.oap.server.library.module.ModuleProvider @@ -0,0 +1,19 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# + +org.apache.skywalking.oap.server.receiver.runtimerule.module.RuntimeRuleModuleProvider diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/TestSampleFamily.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/TestSampleFamily.java new file mode 100644 index 000000000000..884a12877323 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/TestSampleFamily.java @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2.dsl; + +import com.google.common.collect.ImmutableMap; + +/** + * Test-only factory for {@link SampleFamily}. Lives in the same package as {@code SampleFamily} + * so it can call the package-private {@code SampleFamily.build(RunningContext, Sample...)} + * factory — the public constructor is private so production callers come through + * {@code SampleFamily.build}; ITs need the same privileged access to feed synthetic samples + * into the real MAL pipeline without going through a receiver. + * + *

Used by the runtime-rule ITs: tests construct a {@code SampleFamily} with controlled + * labels / values / timestamps, hand it to the applied rule's {@code MetricConvert.toMeter} + * and then verify the derived measure lands in BanyanDB. + */ +public final class TestSampleFamily { + + private TestSampleFamily() { + } + + /** + * Build a non-empty {@link SampleFamily} from one or more {@link Sample}s. + * Delegates to {@code SampleFamily.build} which filters {@code NaN} samples and returns + * {@link SampleFamily#EMPTY} if all are filtered — identical semantics to the production + * path that reaches this factory from a receiver. + */ + public static SampleFamily of(final Sample... samples) { + return SampleFamily.build(SampleFamily.RunningContext.EMPTY, samples); + } + + /** + * Build a {@link Sample} — convenience wrapper that hides Lombok's Builder behind a + * call-site-friendly signature. Labels are passed as an {@link ImmutableMap} so tests + * don't import {@code ImmutableMap.Builder} boilerplate. + */ + public static Sample sample(final String metricName, + final long timestampMillis, + final double value, + final ImmutableMap labels) { + return Sample.builder() + .name(metricName) + .labels(labels) + .value(value) + .timestamp(timestampMillis) + .build(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDeltaTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDeltaTest.java new file mode 100644 index 000000000000..f70e5f4db95e --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DSLDeltaTest.java @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; + +import java.util.Collections; +import java.util.HashSet; +import java.util.Set; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class DSLDeltaTest { + + @Test + void noChangeCarriesEmptySetsAndReason() { + final DSLDelta d = DSLDelta.noChange(); + assertEquals(Classification.NO_CHANGE, d.classification()); + assertTrue(d.addedMetrics().isEmpty()); + assertTrue(d.removedMetrics().isEmpty()); + assertTrue(d.shapeBreakMetrics().isEmpty()); + // Reason string is surfaced in HTTP 200 response bodies for observability — asserting + // on it pins the contract that operator scripts and dashboards parse. + assertEquals("content byte-identical", d.reason()); + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void newRuleReportsAddedMetricsButEmptyAlarmResetSet() { + // A brand-new bundle has metrics to register but no prior alarm windows to reset — + // AlarmKernelService.reset on nonexistent metrics is wasted work the design wants to + // skip, so alarmResetSet() must be empty for NEW regardless of addedMetrics. + final Set metrics = new HashSet<>(); + metrics.add("meter_vm_cpu"); + metrics.add("meter_vm_mem"); + final DSLDelta d = DSLDelta.newRule(metrics); + assertEquals(Classification.NEW, d.classification()); + assertEquals(metrics, d.addedMetrics()); + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void filterOnlyCarriesCustomReason() { + final DSLDelta d = DSLDelta.filterOnly("filter expression body changed"); + assertEquals(Classification.FILTER_ONLY, d.classification()); + assertEquals("filter expression body changed", d.reason()); + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void structuralAlarmResetIsUnionOfAddedRemovedShapeBreak() { + // Every metric whose semantics moved must have its alarm windows cleared — + // that's metrics we just created, metrics we just dropped, and metrics whose shape + // broke under us. Union of the three sets is the authoritative reset target. + final Set added = setOf("m_new1", "m_new2"); + final Set removed = setOf("m_old"); + final Set shape = setOf("m_shape"); + final DSLDelta d = DSLDelta.structural(added, removed, shape, "cpu scope moved"); + assertEquals(Classification.STRUCTURAL, d.classification()); + + final Set expected = new HashSet<>(); + expected.addAll(added); + expected.addAll(removed); + expected.addAll(shape); + assertEquals(expected, d.alarmResetSet()); + } + + @Test + void structuralAlarmResetDedupsOverlappingMetrics() { + // A metric that's both "shape-broken" and "added" (because we re-generate the class) + // would show up twice if we concatenated blindly. The HashSet-union contract prevents + // duplicate reset calls, which matters for cost (reset walks every running alarm rule). + final Set added = setOf("m_overlap"); + final Set shape = setOf("m_overlap"); + final DSLDelta d = DSLDelta.structural(added, Collections.emptySet(), shape, "x"); + assertEquals(setOf("m_overlap"), d.alarmResetSet()); + } + + @Test + void alarmResetSetIsUnmodifiable() { + // The set is published to AlarmKernelService and iterated by a live thread — if the + // caller could mutate it after the fact, we would get ConcurrentModificationException + // mid-reset. Unmodifiable wrapper is the design's cheapest guard. + final DSLDelta d = DSLDelta.structural( + setOf("a"), Collections.emptySet(), Collections.emptySet(), "x"); + assertThrows(UnsupportedOperationException.class, + () -> d.alarmResetSet().add("intruder")); + } + + @Test + void nullSetsAreNormalizedToEmpty() { + // Defensive: callers building DSLDelta from diff code sometimes pass null when a + // set is absent rather than Collections.emptySet(). Normalizing keeps downstream code + // free of null checks. + final DSLDelta d = new DSLDelta( + Classification.STRUCTURAL, null, null, null, null); + assertTrue(d.addedMetrics().isEmpty()); + assertTrue(d.removedMetrics().isEmpty()); + assertTrue(d.shapeBreakMetrics().isEmpty()); + assertEquals("", d.reason()); + } + + private static Set setOf(final String... s) { + final Set r = new HashSet<>(); + Collections.addAll(r, s); + return r; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifierTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifierTest.java new file mode 100644 index 000000000000..c0abff9285b3 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/DeltaClassifierTest.java @@ -0,0 +1,285 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; + +import java.util.Collections; +import java.util.LinkedHashSet; +import java.util.Set; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +class DeltaClassifierTest { + + private static final String MAL_TWO_METRICS = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu\n" + + " exp: cpu_seconds.sum(['host'])\n" + + " - name: mem\n" + + " exp: mem_bytes.sum(['host'])\n"; + + private static final String MAL_TWO_METRICS_BODY_CHANGE = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu\n" + + " exp: cpu_seconds.sum(['host']).rate('PT1M')\n" + + " - name: mem\n" + + " exp: mem_bytes.sum(['host'])\n"; + + private static final String MAL_ONE_METRIC = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu\n" + + " exp: cpu_seconds.sum(['host'])\n"; + + private static final String MAL_WITH_EXTRA_METRIC = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu\n" + + " exp: cpu_seconds.sum(['host'])\n" + + " - name: mem\n" + + " exp: mem_bytes.sum(['host'])\n" + + " - name: disk\n" + + " exp: disk_bytes.sum(['host'])\n"; + + @Test + void byteIdenticalReturnsNoChange() { + // Sanity — the dslManager's hash short-circuit already catches this before calling us, + // but the classifier must return NO_CHANGE for the same case so any direct caller + // (REST handler, tests) gets identical semantics. + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, MAL_TWO_METRICS); + assertEquals(Classification.NO_CHANGE, d.classification()); + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void oldNullIsNew() { + // First apply on this node (or after a previous unregister). All new metric names are + // reported as {@code addedMetrics}, but {@code alarmResetSet()} is empty because no + // prior windows existed to reset — matches {@link DSLDelta#newRule} where + // alarmResetSet() is deliberately empty for first-apply cases. + final DSLDelta d = DeltaClassifier.classifyMal(null, MAL_TWO_METRICS); + assertEquals(Classification.NEW, d.classification()); + assertEquals(setOf("meter_vm_cpu", "meter_vm_mem"), d.addedMetrics()); + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void newNullIsStructuralRemoval() { + // The classifier's contract for "bundle is going away": every prior metric name lands + // in removedMetrics, and alarmResetSet() contains them so the kernel clears windows + // whose subjects just vanished. Used by /delete and status→INACTIVE paths. + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, null); + assertEquals(Classification.STRUCTURAL, d.classification()); + assertEquals(setOf("meter_vm_cpu", "meter_vm_mem"), d.removedMetrics()); + assertEquals(setOf("meter_vm_cpu", "meter_vm_mem"), d.alarmResetSet()); + } + + @Test + void metricAddedIsStructural() { + // One metric added, two unchanged (identical shape on the common two). Shape diff + // correctly keeps them out of the shape-break set now that per-metric extraction is + // in place — only "disk" lands in added, shape-break is empty. + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, MAL_WITH_EXTRA_METRIC); + assertEquals(Classification.STRUCTURAL, d.classification()); + assertEquals(setOf("meter_vm_disk"), d.addedMetrics()); + assertTrue(d.removedMetrics().isEmpty()); + assertTrue(d.shapeBreakMetrics().isEmpty(), + "cpu and mem have unchanged shapes; only the newly-added metric is structural"); + assertTrue(d.alarmResetSet().contains("meter_vm_disk")); + } + + @Test + void metricRemovedIsStructural() { + // One metric dropped ("mem"). removedMetrics carries it — the dslManager / applier + // calls MeterSystem.removeMetric for every name in this set. + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, MAL_ONE_METRIC); + assertEquals(Classification.STRUCTURAL, d.classification()); + assertEquals(setOf("meter_vm_mem"), d.removedMetrics()); + assertTrue(d.addedMetrics().isEmpty()); + assertTrue(d.alarmResetSet().contains("meter_vm_mem")); + } + + @Test + void bodyOnlyChangeIsFilterOnly() { + // FILTER_ONLY fast path: same metric names, same (functionName, scopeType) for every + // metric — only the expression body of "cpu" changed (added .rate('PT1M')). Shape + // extraction confirms both metrics still resolve to the same storage class, so the + // dslManager swaps Analyzers without MeterSystem.removeMetric + create round-trip + // and without any alarm window reset. + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, MAL_TWO_METRICS_BODY_CHANGE); + assertEquals(Classification.FILTER_ONLY, d.classification()); + assertTrue(d.alarmResetSet().isEmpty(), + "FILTER_ONLY must not drive alarm windows off — shapes unchanged"); + } + + @Test + void scopeChangeIsStructuralWithShapeBreak() { + // Swap the expSuffix from service(...) to instance(...) — same metric names, + // different scope type. Shape diff on every metric → STRUCTURAL with every common + // metric in shape-break → alarm reset targets them all. + final String withInstanceScope = MAL_TWO_METRICS.replace( + "service(['host'], Layer.OS_LINUX)", + "instance(['host'], Layer.OS_LINUX)"); + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, withInstanceScope); + assertEquals(Classification.STRUCTURAL, d.classification()); + assertEquals(setOf("meter_vm_cpu", "meter_vm_mem"), d.shapeBreakMetrics()); + assertEquals(setOf("meter_vm_cpu", "meter_vm_mem"), d.alarmResetSet()); + } + + @Test + void downsamplingFunctionChangeIsStructural() { + // Explicit .downsampling(SUM) on cpu — same metric name, different storage-side + // downsampling type. cpu ends up in shape-break; mem (unchanged, default AVG) stays + // out. The storage-level change is exactly what the allowStorageChange guardrail on + // the REST handler keys off to reject dangerous edits unless explicitly approved. + final String withExplicitSum = MAL_TWO_METRICS.replace( + "exp: cpu_seconds.sum(['host'])", + "exp: cpu_seconds.sum(['host']).downsampling(SUM)"); + final DSLDelta d = DeltaClassifier.classifyMal(MAL_TWO_METRICS, withExplicitSum); + assertEquals(Classification.STRUCTURAL, d.classification()); + assertEquals(setOf("meter_vm_cpu"), d.shapeBreakMetrics()); + } + + @Test + void malformedYamlThrowsOnNewSide() { + // Malformed new content is unrecoverable — the caller must surface an apply error. + assertThrows(IllegalArgumentException.class, + () -> DeltaClassifier.classifyMal(MAL_TWO_METRICS, "this: is: not: valid: yaml")); + } + + @Test + void malformedOldContentIsToleratedOnRemoval() { + // If the prior content somehow became unparseable (race, corruption), classifying + // against null newContent still succeeds. The removed set comes up empty — the caller + // falls back to MalFileApplier.Applied.getRegisteredMetricNames for the authoritative + // prior metric list, so this degradation is safe. + final DSLDelta d = DeltaClassifier.classifyMal("garbage: not: valid", null); + assertEquals(Classification.STRUCTURAL, d.classification()); + // Empty because safeEnumerateMalNames swallowed the parse failure on the old side. + assertTrue(d.removedMetrics().isEmpty()); + } + + @Test + void lalByteIdenticalIsNoChange() { + final String lal = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final DSLDelta d = DeltaClassifier.classifyLal(lal, lal); + assertEquals(Classification.NO_CHANGE, d.classification()); + } + + @Test + void lalOldNullIsNew() { + final String lal = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final DSLDelta d = DeltaClassifier.classifyLal(null, lal); + assertEquals(Classification.NEW, d.classification()); + // LAL NEW carries an empty set today — alarm reset doesn't target LAL rule keys + // directly, and inline-MAL extraction is a follow-up. + assertTrue(d.alarmResetSet().isEmpty()); + } + + @Test + void lalChangedIsStructural() { + final String a = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String b = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { json {} sink {} }'\n"; + final DSLDelta d = DeltaClassifier.classifyLal(a, b); + assertEquals(Classification.STRUCTURAL, d.classification()); + } + + @Test + void enumerateLalRuleKeysHandlesAutoLayer() { + // The "auto" layer is stored as null in LALConfig.LAYER_AUTO (empty string on disk). + // enumerateLalRuleKeys must canonicalize it to "auto" so the cross-file collision + // check in the dslManager compares auto rules across files correctly. + final String lal = "rules:\n" + + " - name: r1\n" + + " layer: MESH\n" + + " dsl: 'filter { sink {} }'\n" + + " - name: r2\n" + + " layer: auto\n" + + " dsl: 'filter { sink {} }'\n"; + final Set keys = DeltaClassifier.enumerateLalRuleKeys(lal); + assertEquals(setOf("MESH:r1", "auto:r2"), keys); + } + + @Test + void enumerateLalRuleKeysOnEmptyReturnsEmpty() { + assertTrue(DeltaClassifier.enumerateLalRuleKeys("").isEmpty()); + assertTrue(DeltaClassifier.enumerateLalRuleKeys(null).isEmpty()); + } + + @Test + void lalStorageAffectingIdenticalContentIsEmpty() { + final String lal = "rules:\n - name: r1\n layer: MESH\n outputType: org.apache.skywalking.oap.server.core.source.LogBuilder\n dsl: 'filter { sink {} }'\n"; + assertTrue(DeltaClassifier.lalStorageAffectingChanges(lal, lal).isEmpty()); + } + + @Test + void lalStorageAffectingDetectsOutputTypeChange() { + // Changing outputType on a rule is exactly the "dangerous" case the REST handler + // allowStorageChange guardrail keys off — rerouting logs to a different AbstractLog + // subclass means previously-indexed rows for the old subclass are now orphaned and + // on BanyanDB the new subclass's measure is a separate target. + final String a = "rules:\n - name: r1\n layer: MESH\n outputType: org.example.TypeA\n dsl: 'filter { sink {} }'\n"; + final String b = "rules:\n - name: r1\n layer: MESH\n outputType: org.example.TypeB\n dsl: 'filter { sink {} }'\n"; + final Set affected = DeltaClassifier.lalStorageAffectingChanges(a, b); + assertEquals(setOf("MESH:r1"), affected); + } + + @Test + void lalStorageAffectingDetectsRuleAddRemove() { + // Rule added (even with no outputType) still counts as storage-affecting because + // any inline metrics {} it declares would become live MAL metrics, and the mirror + // case — rule removed — drops the corresponding metric via MeterSystem.removeMetric + // and the BanyanDB measure along with it. + final String a = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String b = "rules:\n" + + " - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n" + + " - name: r2\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final Set added = DeltaClassifier.lalStorageAffectingChanges(a, b); + assertEquals(setOf("MESH:r2"), added); + final Set removed = DeltaClassifier.lalStorageAffectingChanges(b, a); + assertEquals(setOf("MESH:r2"), removed); + } + + @Test + void lalStorageAffectingDslBodyChangeIsSafe() { + // Body tweak inside the DSL with same rule keys and same outputType — the guardrail + // must not flag this as storage-affecting. Operators frequently edit filter / sink + // bodies to change extraction rules, and blocking those by default would turn the + // guardrail into a nuisance. + final String a = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String b = "rules:\n - name: r1\n layer: MESH\n dsl: 'filter { json {} sink {} }'\n"; + assertTrue(DeltaClassifier.lalStorageAffectingChanges(a, b).isEmpty()); + } + + private static Set setOf(final String... s) { + final Set out = new LinkedHashSet<>(); + Collections.addAll(out, s); + return out; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplierTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplierTest.java new file mode 100644 index 000000000000..2c519f59c489 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplierTest.java @@ -0,0 +1,383 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.util.Collections; +import java.util.List; +import javassist.ClassPool; +import org.apache.skywalking.oap.log.analyzer.v2.provider.LALConfig; +import org.apache.skywalking.oap.log.analyzer.v2.provider.log.listener.LogFilterListener; +import org.apache.skywalking.oap.server.core.analysis.Layer; +import org.apache.skywalking.oap.server.library.module.ModuleStartException; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.Mockito; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertNull; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.ArgumentMatchers.eq; +import static org.mockito.Mockito.never; +import static org.mockito.Mockito.verify; +import static org.mockito.Mockito.when; + +/** + * Covers the LAL hot-apply path with a mocked {@link LogFilterListener.Factory}. LAL compile / + * register has no direct BanyanDB interaction (the {@code metrics{}} sink defers to + * MeterSystem at log-processing time, not at compile time), so the apply-path contract can be + * exercised entirely at the unit-test layer — parallel to {@link MalFileApplierTest}. + * + *

The contract pinned here: YAML parse failures propagate as {@link LalFileApplier.ApplyException} + * with an empty {@code partial} list; a compile-phase failure propagates with an empty partial + * (Phase 1 aborts before any registry mutation); a register-phase failure propagates with the + * rules registered up to the point of failure so the dslManager can roll them back via + * {@link LalFileApplier#remove(LalFileApplier.Applied)}; and {@code planKeys} is a pure + * read-only inspection with no side effects on the factory. + * + *

LAL DSL shape — for reference when reading the inline YAML fixtures below: + *

+ * rules:
+ *   - name: default
+ *     layer: GENERAL         # or LAYER_AUTO (=> null Layer at registration time)
+ *     dsl: |
+ *       filter {
+ *         sink {
+ *         }
+ *       }
+ * 
+ * The DSL body here is never actually compiled (the factory is mocked); we only verify + * that the applier passes the right {@link LALConfig} through to {@code factory.compile} + * and that successful compilation results in {@code factory.addOrReplace} being called + * exactly once per rule. + */ +class LalFileApplierTest { + + private LogFilterListener.Factory factory; + private LalFileApplier applier; + + /** + * Baseline single-rule LAL YAML. Layer resolves to {@link Layer#GENERAL} at registration + * time, so after a successful apply the applier's {@code Applied.registered} contains + * exactly one {@code RegisteredRule(GENERAL, "default")}. + *
+     * rules:
+     *   - name: default
+     *     layer: GENERAL
+     *     dsl: |
+     *       filter { sink { } }
+     * 
+ */ + private static final String VALID_LAL_YAML = + "rules:\n" + + " - name: default\n" + + " layer: GENERAL\n" + + " dsl: |\n" + + " filter {\n" + + " sink {\n" + + " }\n" + + " }\n"; + + /** + * Two-rule variant of {@link #VALID_LAL_YAML} — used to verify cross-rule bookkeeping + * (all rules compiled before any gets registered; registration happens per rule in order). + *
+     *  rules:
+     *    - name: default
+     *      layer: GENERAL
+     *      dsl: |
+     *        filter { sink { } }
+     * +  - name: second
+     * +    layer: MESH
+     * +    dsl: |
+     * +      filter { sink { } }
+     * 
+ */ + private static final String TWO_RULE_LAL_YAML = + VALID_LAL_YAML + + " - name: second\n" + + " layer: MESH\n" + + " dsl: |\n" + + " filter {\n" + + " sink {\n" + + " }\n" + + " }\n"; + + /** + * Auto-layer variant — {@code layer: auto} is the marker used for rules that decide their + * layer at sample-time. Registration-side behaviour: {@code RegisteredRule.layer} is null. + *
+     *  rules:
+     *    - name: default
+     * -    layer: GENERAL
+     * +    layer: auto
+     *      dsl: |
+     *        filter { sink { } }
+     * 
+ */ + private static final String AUTO_LAYER_LAL_YAML = + "rules:\n" + + " - name: default\n" + + " layer: auto\n" + + " dsl: |\n" + + " filter {\n" + + " sink {\n" + + " }\n" + + " }\n"; + + @BeforeEach + void setUp() throws Exception { + factory = Mockito.mock(LogFilterListener.Factory.class); + // Default: factory.compile returns a CompiledLAL with the declared (layer, name) and + // a null DSL — LalFileApplier never dereferences the DSL during apply/remove, so + // passing null lets us avoid spinning up a real compiled expression. + when(factory.compile(any(LALConfig.class), any(ClassPool.class), any(ClassLoader.class))) + .thenAnswer(inv -> { + final LALConfig c = inv.getArgument(0); + final Layer layer = LALConfig.LAYER_AUTO.equalsIgnoreCase(c.getLayer()) + ? null : Layer.nameOf(c.getLayer()); + return new LogFilterListener.Factory.CompiledLAL(layer, c.getName(), null); + }); + applier = new LalFileApplier(factory); + } + + @Test + void nullYamlRaisesApplyExceptionWithEmptyPartial() { + // SnakeYAML reads null → parse path explicitly guards. Applier wraps so the caller + // catches one exception type regardless of where in the pipeline the failure landed. + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply(null, "lal/default", "h0")); + assertNotNull(ex.getMessage()); + assertTrue(ex.getPartial().isEmpty(), + "nothing registered before parse failure — partial list must be empty"); + } + + @Test + void emptyYamlRaisesApplyException() { + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply("", "lal/empty", "h0")); + assertNotNull(ex.getMessage()); + assertTrue(ex.getPartial().isEmpty()); + } + + @Test + void malformedYamlRaisesApplyException() { + // Garbage bytes — SnakeYAML throws during loadAs. Applier wraps into ApplyException + // so callers don't need to know the snakeyaml exception hierarchy. + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply("this: is: not: valid: yaml: at all", "lal/bad", "h")); + assertTrue(ex.getPartial().isEmpty()); + } + + @Test + void yamlWithoutRulesListRaisesApplyException() { + // LAL YAML must have a top-level "rules:" list with at least one entry; anything + // else is treated as a parse-level failure so the caller can surface the exact file + // name in the operator-facing error response. + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply("notRules: []\n", "lal/wrongShape", "h")); + assertNotNull(ex.getMessage()); + assertTrue(ex.getPartial().isEmpty()); + } + + @Test + void successfulApplyCompilesAndRegistersEveryRule() throws Exception { + final LalFileApplier.Applied applied = applier.apply(TWO_RULE_LAL_YAML, "lal/multi", "h-ok"); + assertNotNull(applied); + assertNotNull(applied.getRuleClassLoader(), + "per-file loader must be retained so the dslManager can retire it through the " + + "graveyard on unregister"); + assertEquals(org.apache.skywalking.oap.server.core.classloader.Catalog.LAL, + applied.getRuleClassLoader().getCatalog()); + assertEquals("multi", applied.getRuleClassLoader().getRule()); + assertEquals("h-ok", applied.getRuleClassLoader().getContentHash()); + + // Both rules compiled; both registered. Order matters — the factory must see rules + // in the same order they appeared in the YAML so layer-keyed replace semantics stay + // deterministic. + assertEquals(2, applied.getRegistered().size()); + assertEquals(Layer.GENERAL, applied.getRegistered().get(0).getLayer()); + assertEquals("default", applied.getRegistered().get(0).getRuleName()); + assertEquals(Layer.MESH, applied.getRegistered().get(1).getLayer()); + assertEquals("second", applied.getRegistered().get(1).getRuleName()); + + verify(factory, Mockito.times(2)) + .compile(any(LALConfig.class), any(ClassPool.class), any(ClassLoader.class)); + verify(factory, Mockito.times(2)) + .addOrReplace(any(LogFilterListener.Factory.CompiledLAL.class)); + } + + @Test + void autoLayerRuleRegistersWithNullLayer() throws Exception { + // layer: auto is a marker for rules that pick their layer at sample-time. At apply + // time the registration-side Layer is null — the factory's addOrReplace uses the + // autoDsls map, keyed on name alone, not (layer, name). + final LalFileApplier.Applied applied = + applier.apply(AUTO_LAYER_LAL_YAML, "lal/auto", "h"); + assertEquals(1, applied.getRegistered().size()); + assertNull(applied.getRegistered().get(0).getLayer(), + "auto-layer rule must have null Layer at registration"); + assertEquals("default", applied.getRegistered().get(0).getRuleName()); + } + + @Test + void compilePhaseFailurePropagatesWithEmptyPartial() throws Exception { + // Phase 1 — factory.compile throws for the first rule. The two-phase apply must NOT + // have registered anything yet (addOrReplace is Phase 2), so the partial list the + // caller gets is empty. Matches the "LAL rollback-safe apply (two-phase compile + + // defer-old-removal)" contract. + when(factory.compile(any(LALConfig.class), any(ClassPool.class), any(ClassLoader.class))) + .thenThrow(new ModuleStartException("synthetic compile failure")); + + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply(TWO_RULE_LAL_YAML, "lal/broken", "h")); + assertTrue(ex.getPartial().isEmpty(), + "compile-phase failure means zero registrations landed — partial MUST be empty"); + // The factory must never have been asked to addOrReplace anything since Phase 1 + // aborted. + verify(factory, never()) + .addOrReplace(any(LogFilterListener.Factory.CompiledLAL.class)); + } + + @Test + void registerPhaseFailurePropagatesWithProgressSoFar() throws Exception { + // Phase 2 — compile succeeds for both, but factory.addOrReplace throws on the + // second rule's turn. The partial list MUST include the first rule so the caller + // can roll it back via remove(Applied). This is the rollback-safe contract. + Mockito.doNothing() + .doThrow(new RuntimeException("synthetic register failure")) + .when(factory).addOrReplace(any(LogFilterListener.Factory.CompiledLAL.class)); + + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.apply(TWO_RULE_LAL_YAML, "lal/half", "h")); + final List partial = ex.getPartial(); + assertEquals(1, partial.size(), + "one rule landed in the factory before the second threw — partial must reflect that"); + assertEquals(Layer.GENERAL, partial.get(0).getLayer()); + assertEquals("default", partial.get(0).getRuleName()); + } + + @Test + void removeUnregistersEveryRegisteredRule() throws Exception { + final LalFileApplier.Applied applied = applier.apply(TWO_RULE_LAL_YAML, "lal/multi", "h"); + applier.remove(applied); + + verify(factory).remove(eq(Layer.GENERAL), eq("default")); + verify(factory).remove(eq(Layer.MESH), eq("second")); + } + + @Test + void removeWithNullAppliedIsNoOp() { + applier.remove(null); + verify(factory, never()).remove(any(), Mockito.anyString()); + } + + @Test + void removeWithEmptyRegisteredIsNoOp() { + // Empty Applied (e.g. the empty-partial result from a compile-phase failure caller + // tries to roll back) is a no-op rather than a null-dereference. + final LalFileApplier.Applied empty = new LalFileApplier.Applied( + "lal/nothing", Collections.emptyList(), null); + applier.remove(empty); + verify(factory, never()).remove(any(), Mockito.anyString()); + } + + @Test + void removeContinuesAfterIndividualFailure() throws Exception { + // Best-effort removal — one rule throwing must not prevent the others from being + // unregistered. The factory.remove calls are wrapped in try/catch with log.warn in + // LalFileApplier; we verify that all three rules still hit factory.remove. + final String threeRuleYaml = + TWO_RULE_LAL_YAML + + " - name: third\n" + + " layer: K8S_SERVICE\n" + + " dsl: |\n" + + " filter {\n" + + " sink {\n" + + " }\n" + + " }\n"; + final LalFileApplier.Applied applied = applier.apply(threeRuleYaml, "lal/three", "h"); + Mockito.doThrow(new RuntimeException("simulated")) + .when(factory).remove(eq(Layer.MESH), eq("second")); + + applier.remove(applied); + + verify(factory).remove(eq(Layer.GENERAL), eq("default")); + verify(factory).remove(eq(Layer.MESH), eq("second")); + verify(factory).remove(eq(Layer.K8S_SERVICE), eq("third")); + } + + @Test + void planKeysReturnsLayerAndNameWithoutRegistering() throws Exception { + // planKeys is a read-only inspection so the dslManager can detect cross-file + // collisions before any compile work. factory must see ZERO calls. + final List keys = applier.planKeys(TWO_RULE_LAL_YAML, "lal/plan"); + assertEquals(2, keys.size()); + assertEquals(Layer.GENERAL, keys.get(0).getLayer()); + assertEquals("default", keys.get(0).getRuleName()); + assertEquals(Layer.MESH, keys.get(1).getLayer()); + assertEquals("second", keys.get(1).getRuleName()); + + verify(factory, never()) + .compile(any(LALConfig.class), any(ClassPool.class), any(ClassLoader.class)); + verify(factory, never()) + .addOrReplace(any(LogFilterListener.Factory.CompiledLAL.class)); + } + + @Test + void planKeysAutoLayerYieldsNullLayer() throws Exception { + final List keys = + applier.planKeys(AUTO_LAYER_LAL_YAML, "lal/auto"); + assertEquals(1, keys.size()); + assertNull(keys.get(0).getLayer()); + assertEquals("default", keys.get(0).getRuleName()); + } + + @Test + void planKeysRaisesOnYamlParseError() { + final LalFileApplier.ApplyException ex = assertThrows( + LalFileApplier.ApplyException.class, + () -> applier.planKeys("this: is: not: valid: yaml", "lal/bad")); + assertNotNull(ex.getMessage()); + assertTrue(ex.getPartial().isEmpty()); + } + + @Test + void legacyTwoArgApplyUsesEmptyHash() throws Exception { + // Back-compat 2-arg entry point — loader identity is still constructed but the hash + // is empty. Production callers always use the 3-arg form; 2-arg remains for tests + // and manual invocation. + final LalFileApplier.Applied applied = applier.apply(VALID_LAL_YAML, "lal/legacy"); + assertNotNull(applied); + // The legacy overload still constructs a per-file loader internally; its contentHash + // is empty string. + assertNotNull(applied.getRuleClassLoader()); + assertEquals("", applied.getRuleClassLoader().getContentHash()); + assertFalse(applied.getRegistered().isEmpty()); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java new file mode 100644 index 000000000000..c07c74f77512 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.apply; + +import java.util.Collections; +import java.util.HashSet; +import java.util.Set; +import javassist.ClassPool; +import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.Mockito; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.ArgumentMatchers.anyString; +import static org.mockito.Mockito.times; +import static org.mockito.Mockito.verify; + +/** + * Covers the MAL hot-apply path with a mocked {@link MeterSystem} — real Javassist compile + * (so per-file loader semantics are exercised end-to-end), but the terminal + * {@code meterSystem.create} call is captured rather than registering into a live OAP + * subsystem. These tests pin the behaviour the dslManager relies on: parse failures + * propagate as {@link MalFileApplier.ApplyException} with the metric-name set so callers + * can roll back, removed metrics hit {@code MeterSystem.removeMetric} for each name, and a + * successful apply produces an {@code Applied} with the derived name set and a non-null + * per-file loader. + */ +class MalFileApplierTest { + + private MeterSystem meterSystem; + private MalFileApplier applier; + + @BeforeEach + void setUp() { + // Mock-only — we do not exercise MeterSystem's dynamic class generation. The per-file + // RuleClassLoader test coverage lives in RuleClassLoaderTest + ClassLoaderGcTest; here + // we only verify that MalFileApplier itself threads content through correctly and + // raises the right exceptions on bad input. + meterSystem = Mockito.mock(MeterSystem.class); + applier = new MalFileApplier(meterSystem); + } + + @Test + void nullYamlRaisesApplyException() { + // SnakeYAML returns null for a null input stream. The applier must detect that and + // surface a clear message instead of an opaque NullPointerException deeper in the + // pipeline. + final MalFileApplier.ApplyException ex = assertThrows( + MalFileApplier.ApplyException.class, + () -> applier.apply(null, "dummy/vm", "hash-0")); + assertNotNull(ex.getMessage()); + assertTrue(ex.getPartiallyRegistered().isEmpty(), + "no metrics registered yet on parse failure"); + } + + @Test + void emptyYamlRaisesApplyException() { + final MalFileApplier.ApplyException ex = assertThrows( + MalFileApplier.ApplyException.class, + () -> applier.apply("", "dummy/empty", "hash-0")); + assertNotNull(ex.getMessage()); + } + + @Test + void malformedYamlRaisesApplyExceptionWithEmptyPartial() { + // Garbage bytes — SnakeYAML throws. The applier must wrap so the caller has a + // consistent exception type to catch; the partial-registration list is empty because + // we bailed before any MeterSystem.create was invoked. + final MalFileApplier.ApplyException ex = assertThrows( + MalFileApplier.ApplyException.class, + () -> applier.apply("this: is: not: valid: yaml: at all", "dummy/bad", "h")); + assertTrue(ex.getPartiallyRegistered().isEmpty()); + } + + @Test + void successfulApplyRegistersDerivedMetricNames() throws Exception { + // Valid minimal MAL file with two rules. The applier's metric-name enumeration should + // join metricPrefix + "_" + rule.name — this is the same formula MetricConvert uses, + // and callers depend on it for the STRUCTURAL diff's "removedMetrics" set. + final String yaml = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu_total_percentage\n" + + " exp: node_cpu_seconds_total.sum(['host']).rate('PT1M')\n" + + " - name: mem_total_used_percentage\n" + + " exp: node_memory_MemTotal_bytes.sum(['host'])\n"; + + final MalFileApplier.Applied applied = + applier.apply(yaml, "otel-rules/vm", "hashA"); + assertNotNull(applied); + assertNotNull(applied.getRuleClassLoader(), + "per-file loader must be retained for graveyard observation"); + assertEquals(org.apache.skywalking.oap.server.core.classloader.Catalog.OTEL_RULES, + applied.getRuleClassLoader().getCatalog()); + assertEquals("vm", applied.getRuleClassLoader().getRule()); + assertEquals("hashA", applied.getRuleClassLoader().getContentHash()); + + assertEquals( + setOf("meter_vm_cpu_total_percentage", "meter_vm_mem_total_used_percentage"), + applied.getRegisteredMetricNames()); + + // MeterSystem.create must have been called once per metric name (6-arg pool + opt + // overload, since we're on the hot-update path). Plain Mockito verify — not strict, + // just a minimum count assertion. + verify(meterSystem, times(2)) + .create(anyString(), anyString(), any(), any(ClassPool.class), any(ClassLoader.class), + any(StorageManipulationOpt.class)); + } + + @Test + void removeCallsMeterSystemPerName() { + // The inverse side of the contract: on unregister every metric name the prior apply + // recorded must flow to MeterSystem.removeMetric. The dslManager relies on this to + // drain L1/L2 handlers + drop the BanyanDB measure. The applier's no-opt overload + // delegates to the opt-aware removeMetric with fullInstall(), which is what we + // verify here. + final Set names = setOf("meter_a", "meter_b", "meter_c"); + applier.remove(names); + for (final String n : names) { + verify(meterSystem).removeMetric(Mockito.eq(n), any(StorageManipulationOpt.class)); + } + } + + @Test + void removeWithNullSetIsNoOp() { + applier.remove(null); + verify(meterSystem, Mockito.never()) + .removeMetric(anyString(), any(StorageManipulationOpt.class)); + } + + @Test + void removeWithEmptySetIsNoOp() { + applier.remove(Collections.emptySet()); + verify(meterSystem, Mockito.never()) + .removeMetric(anyString(), any(StorageManipulationOpt.class)); + } + + @Test + void removeContinuesAfterIndividualFailureThenThrowsSummary() { + // Best-effort-with-surfacing: the dslManager expects "all gone" as the end state, so if + // one metric throws we must still attempt the rest — otherwise an upstream corruption + // would leave half-registered state that the next tick sees and gets confused by. But + // after the loop completes, remove() throws RemoveException so the REST sync path + // surfaces 500 teardown_deferred / commit_deferred to the operator instead of + // misleading them with 200 inactivated / structural_applied. + Mockito.doThrow(new RuntimeException("simulated drain failure")) + .when(meterSystem).removeMetric(Mockito.eq("meter_b"), any(StorageManipulationOpt.class)); + final MalFileApplier.RemoveException thrown = assertThrows( + MalFileApplier.RemoveException.class, + () -> applier.remove(setOf("meter_a", "meter_b", "meter_c"))); + // Every sibling was still attempted — the throw happens at the end, not on first failure. + verify(meterSystem).removeMetric(Mockito.eq("meter_a"), any(StorageManipulationOpt.class)); + verify(meterSystem).removeMetric(Mockito.eq("meter_b"), any(StorageManipulationOpt.class)); + verify(meterSystem).removeMetric(Mockito.eq("meter_c"), any(StorageManipulationOpt.class)); + // Only the failing name is in the failures map; the other two succeeded. + assertEquals(1, thrown.getFailures().size()); + assertTrue(thrown.getFailures().containsKey("meter_b")); + } + + @Test + void removeReturnsNormallyWhenAllMetricsSucceed() { + // Sanity check on the happy path — no throw when every removeMetric returns cleanly. + // Protects against accidentally turning remove() into a "throws always" implementation + // in the process of making it surface failures. + applier.remove(setOf("meter_a", "meter_b")); + verify(meterSystem).removeMetric(Mockito.eq("meter_a"), any(StorageManipulationOpt.class)); + verify(meterSystem).removeMetric(Mockito.eq("meter_b"), any(StorageManipulationOpt.class)); + } + + @Test + void ruleNameFallsBackToSourceNameWhenMissing() throws Exception { + // The applier tolerates YAML that doesn't declare a name at the file level — it + // stamps sourceName in so stack traces are still identifiable. Rule.name null is + // handled, but individual metric rules must still have names (else enumeration + // skips them). + final String yaml = + "metricPrefix: meter_x\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: one\n" + + " exp: m.sum(['host'])\n"; + final MalFileApplier.Applied applied = applier.apply(yaml, "otel-rules/myfile", "h"); + assertEquals(setOf("meter_x_one"), applied.getRegisteredMetricNames()); + } + + @Test + void legacyTwoArgApplyUsesEmptyHash() throws Exception { + // Back-compat overload — loader identity still constructible, just with a less + // traceable hash. DSLManager callers all use the 3-arg form; the 2-arg form exists + // for tests and manual invocation. + final String yaml = + "metricPrefix: meter_y\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: rule1\n" + + " exp: m.sum(['host'])\n"; + final MalFileApplier.Applied applied = applier.apply(yaml, "otel-rules/legacy"); + assertEquals("", applied.getRuleClassLoader().getContentHash()); + assertFalse(applied.getRegisteredMetricNames().isEmpty()); + } + + private static Set setOf(final String... s) { + final Set r = new HashSet<>(); + Collections.addAll(r, s); + return r; + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouterTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouterTest.java new file mode 100644 index 000000000000..84e233424052 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/MainRouterTest.java @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import java.util.Collections; +import org.apache.skywalking.oap.server.core.remote.client.RemoteClientManager; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +class MainRouterTest { + + @Test + void selfIsMainWhenPeerListEmpty() { + // Empty peer list reflects either "no cluster module wired" (rcm == null) or a + // refresh window where the manager momentarily has no entries. Either way the + // local node is the operator's authority for runtime-rule writes. The earlier + // {@code isPeerListReady} guard that 503'd writes during cold-boot is gone — + // the cluster routing layer now treats empty list and null rcm symmetrically as + // "self is main", so writes accept without an extra readiness gate. + assertTrue(MainRouter.isSelfMain(null)); + final RemoteClientManager emptyRcm = mock(RemoteClientManager.class); + when(emptyRcm.getRemoteClient()).thenReturn(Collections.emptyList()); + assertTrue(MainRouter.isSelfMain(emptyRcm)); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java new file mode 100644 index 000000000000..7eccc122c234 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java @@ -0,0 +1,354 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.rest; + +import com.linecorp.armeria.common.HttpData; +import com.linecorp.armeria.common.HttpResponse; +import com.linecorp.armeria.common.HttpStatus; +import com.linecorp.armeria.common.ResponseHeaders; +import java.util.Arrays; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.ReentrantLock; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.library.module.ModuleProviderHolder; +import org.apache.skywalking.oap.server.library.module.ModuleServiceHolder; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal.LalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal.MalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.metrics.LockMetrics; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeDelete; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.StructuralCommitCoordinator; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResumeCoordinator; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.Mockito; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.never; +import static org.mockito.Mockito.verify; +import static org.mockito.Mockito.when; + +/** + * Unit-level IT for the {@code allowStorageChange} guardrail + {@code force=true} bypass. + * Complements {@link RuntimeRuleRestHandlerTest}'s path-selection coverage with scenarios + * focused on the shape-break / rule-add gate that the design treats as the "data-loss + * affirmation" UX surface. + * + *

No containers — the assertions are all on the classifier + guardrail path the REST + * handler walks before any persist or apply. {@link DSLManager} stays mocked at + * {@code applyNowForRuleFile}; {@link RuntimeRuleManagementDAO} stays mocked so the prior-row + * lookup returns what each scenario needs. + * + *

Scenarios: + *

    + *
  • {@link #malScopeChangeRejectedWithoutAllowStorageChange} — SERVICE→INSTANCE scope + * move via /addOrUpdate (no flag) is the canonical "dangerous push" the guardrail + * was added for. Expect 409, no persist, no apply.
  • + *
  • {@link #malScopeChangeAcceptedWithAllowStorageChangeTrue} — same edit via + * /addOrUpdate?allowStorageChange=true passes the guardrail and drives the apply + * pipeline. Combine with {@code force=true} for recovery pushes.
  • + *
  • {@link #malBodyOnlyEditNeverTripsGuardrail} — changing expression body but keeping + * (functionName, scopeType) identical is FILTER_ONLY per classifier, guardrail + * stays quiet even without the flag. Common operator workflow; must never be + * blocked.
  • + *
  • {@link #malAddedMetricNeverTripsGuardrail} — pure-additive rule-file edit (new + * metric added, unchanged metrics intact) is safe on BanyanDB (new measure created, + * existing ones untouched). Must not require the flag.
  • + *
  • {@link #lalOutputTypeChangeRejectedWithoutAllowStorageChange} — LAL outputType + * change reroutes log records to a different AbstractLog subclass; orphans the + * previous type's rows. Guardrail-gated; 409 without flag.
  • + *
  • {@link #lalRuleAddedIsRejectedWithoutAllowStorageChange} — LAL rule keys added + * bring inline-MAL metrics that fire DDL; gated.
  • + *
  • {@link #lalBodyOnlyEditAccepted} — same rule keys + same outputType + body tweaks + * pass without flag. CI-friendly normal-edit path.
  • + *
+ */ +class GuardrailIntegrationTest { + + private ModuleManager moduleManager; + private DSLManager dslManager; + private RuntimeRuleClusterClient clusterClient; + private RuntimeRuleManagementDAO dao; + private RuntimeRuleRestHandler handler; + + @BeforeEach + void setUp() { + moduleManager = mock(ModuleManager.class); + dslManager = mock(DSLManager.class); + clusterClient = mock(RuntimeRuleClusterClient.class); + dao = mock(RuntimeRuleManagementDAO.class); + + final ModuleProviderHolder storagePh = mock(ModuleProviderHolder.class); + final ModuleServiceHolder storageSh = mock(ModuleServiceHolder.class); + when(moduleManager.find(StorageModule.NAME)).thenReturn(storagePh); + when(storagePh.provider()).thenReturn(storageSh); + when(storageSh.getService(RuntimeRuleManagementDAO.class)).thenReturn(dao); + + final ModuleProviderHolder corePh = mock(ModuleProviderHolder.class); + final ModuleServiceHolder coreSh = mock(ModuleServiceHolder.class); + when(moduleManager.find(CoreModule.NAME)).thenReturn(corePh); + when(corePh.provider()).thenReturn(coreSh); + + // DSLManager per-file lock plumbing — the REST handler grabs a reentrant lock + a + // timer before running the workflow. The lock now lives on each AppliedRuleScript; + // a real ConcurrentHashMap stands in so AppliedRuleScript.lockFor lazy-creates an + // entry on first acquire. + when(dslManager.getRules()).thenReturn(new ConcurrentHashMap<>()); + final LockMetrics lockMetrics = mock(LockMetrics.class); + when(dslManager.getLockMetrics()).thenReturn(lockMetrics); + when(lockMetrics.acquireForRest(Mockito.any(ReentrantLock.class), Mockito.anyLong(), + Mockito.anyString(), Mockito.anyString())).thenAnswer(inv -> { + final ReentrantLock l = inv.getArgument(0); + l.lock(); + return true; + }); + when(lockMetrics.startRestHoldTimer()).thenReturn(mock(HistogramMetrics.Timer.class)); + + final RuleEngineRegistry engineRegistry = new RuleEngineRegistry(); + engineRegistry.register(new MalRuleEngine(new ConcurrentHashMap<>(), moduleManager)); + engineRegistry.register(new LalRuleEngine(new ConcurrentHashMap<>(), moduleManager)); + when(dslManager.getEngineRegistry()).thenReturn(engineRegistry); + + final SuspendResumeCoordinator suspendCoord = mock(SuspendResumeCoordinator.class); + when(dslManager.getSuspendCoord()).thenReturn(suspendCoord); + when(suspendCoord.localSuspend(Mockito.anyString(), Mockito.anyString())) + .thenReturn(SuspendResult.SUSPENDED); + when(suspendCoord.localResume(Mockito.anyString(), Mockito.anyString())).thenReturn(0); + when(dslManager.getCommitCoord()) + .thenReturn(mock(StructuralCommitCoordinator.class)); + when(dslManager.getDslRuntimeDelete()) + .thenReturn(mock(DSLRuntimeDelete.class)); + // Stub both overloads — the REST handler calls the single-arg form on the + // FILTER_ONLY path and the two-arg form (deferCommit=true) on STRUCTURAL. + when(dslManager.applyNowForRuleFile(any())).thenAnswer(inv -> { + final Object arg = inv.getArgument(0); + if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { + final RuntimeRuleManagementDAO.RuntimeRuleFile file = + (RuntimeRuleManagementDAO.RuntimeRuleFile) arg; + return DSLRuntimeState.running(file.getCatalog(), file.getName(), "h", 0L); + } + return null; + }); + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean())).thenAnswer(inv -> { + final Object arg = inv.getArgument(0); + if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { + final RuntimeRuleManagementDAO.RuntimeRuleFile file = + (RuntimeRuleManagementDAO.RuntimeRuleFile) arg; + return DSLRuntimeState.running(file.getCatalog(), file.getName(), "h", 0L); + } + return null; + }); + + handler = new RuntimeRuleRestHandler(moduleManager, dslManager, clusterClient, 30_000L); + } + + // ---- MAL scenarios -------------------------------------------------------------------- + + @Test + void malScopeChangeRejectedWithoutAllowStorageChange() throws Exception { + whenDaoHasRow(CATALOG_MAL, "vm", SERVICE_YAML, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_MAL, "vm", "false", "false", + HttpData.ofUtf8(INSTANCE_YAML)); + + assertHttp(resp, HttpStatus.CONFLICT); + // Guardrail runs BEFORE any Suspend broadcast or applyNowForRuleFile — rejection must + // leave the downstream clean. Check both overloads (single-arg FILTER_ONLY path and + // two-arg STRUCTURAL path). + verify(dslManager, never()).applyNowForRuleFile(any()); + verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); + verify(clusterClient, never()).broadcastSuspend( + Mockito.anyString(), Mockito.anyString(), Mockito.anyString()); + } + + @Test + void malScopeChangeAcceptedWithAllowStorageChangeTrue() throws Exception { + whenDaoHasRow(CATALOG_MAL, "vm", SERVICE_YAML, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_MAL, "vm", "true", "false", + HttpData.ofUtf8(INSTANCE_YAML)); + + assertHttp(resp, HttpStatus.OK); + // STRUCTURAL path uses the two-arg overload (deferCommit=true) so row-persist + // failure can cleanly roll back. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + @Test + void malScopeChangeAcceptedThroughFixRoute() throws Exception { + // /addOrUpdate?allowStorageChange=true&force=true. Same end-state as + // the allowStorageChange=true case, different audit-log surface. + whenDaoHasRow(CATALOG_MAL, "vm", SERVICE_YAML, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_MAL, "vm", "true", "true", + HttpData.ofUtf8(INSTANCE_YAML)); + + assertHttp(resp, HttpStatus.OK); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + @Test + void malBodyOnlyEditNeverTripsGuardrail() throws Exception { + // Same (function, scope) on the single metric — classifier reports FILTER_ONLY. + // No shape-break set, guardrail stays quiet. Must pass without the flag. + final String bodyEdited = SERVICE_YAML.replace( + "throughput_total.sum(['host'])", + "throughput_total.sum(['host']).rate('PT1M')"); + whenDaoHasRow(CATALOG_MAL, "vm", SERVICE_YAML, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_MAL, "vm", "false", "false", + HttpData.ofUtf8(bodyEdited)); + + assertHttp(resp, HttpStatus.OK); + // FILTER_ONLY path uses the single-arg overload — no deferred commit needed because + // no destructive tail exists for body-only edits. + verify(dslManager).applyNowForRuleFile(any()); + } + + @Test + void malAddedMetricNeverTripsGuardrail() throws Exception { + // New metric added, existing one unchanged. Pure-additive on BanyanDB (new measure, + // old measure untouched). Guardrail does not flag this — shapeBreakMetrics stays + // empty. + final String addedMetric = SERVICE_YAML + + " - name: latency\n" + + " exp: latency_seconds.sum(['host'])\n"; + whenDaoHasRow(CATALOG_MAL, "vm", SERVICE_YAML, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_MAL, "vm", "false", "false", + HttpData.ofUtf8(addedMetric)); + + assertHttp(resp, HttpStatus.OK); + // Non-empty addedMetrics makes this STRUCTURAL (NEW classification on first apply + // or STRUCTURAL on update) — goes through the deferred-commit path. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + // ---- LAL scenarios -------------------------------------------------------------------- + + @Test + void lalOutputTypeChangeRejectedWithoutAllowStorageChange() throws Exception { + final String oldLal = "rules:\n" + + " - name: r1\n layer: MESH\n outputType: org.example.TypeA\n" + + " dsl: 'filter { sink {} }'\n"; + final String newLal = oldLal.replace("org.example.TypeA", "org.example.TypeB"); + whenDaoHasRow(CATALOG_LAL, "lal-file", oldLal, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_LAL, "lal-file", "false", "false", + HttpData.ofUtf8(newLal)); + + assertHttp(resp, HttpStatus.CONFLICT); + verify(dslManager, never()).applyNowForRuleFile(any()); + verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); + } + + @Test + void lalRuleAddedIsRejectedWithoutAllowStorageChange() throws Exception { + final String oneRule = "rules:\n" + + " - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String twoRules = oneRule + + " - name: r2\n layer: MESH\n dsl: 'filter { json {} sink {} }'\n"; + whenDaoHasRow(CATALOG_LAL, "lal-file", oneRule, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_LAL, "lal-file", "false", "false", + HttpData.ofUtf8(twoRules)); + + assertHttp(resp, HttpStatus.CONFLICT); + } + + @Test + void lalBodyOnlyEditAccepted() throws Exception { + // Same rule keys, same outputType (absent both times = default), different DSL body. + // Storage identity unchanged → guardrail passes. + final String bodyA = "rules:\n" + + " - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String bodyB = "rules:\n" + + " - name: r1\n layer: MESH\n dsl: 'filter { json {} sink {} }'\n"; + whenDaoHasRow(CATALOG_LAL, "lal-file", bodyA, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_LAL, "lal-file", "false", "false", + HttpData.ofUtf8(bodyB)); + + assertHttp(resp, HttpStatus.OK); + // LAL always routes through the STRUCTURAL path (classifyLal reports STRUCTURAL on + // every content change), so the two-arg overload fires. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + @Test + void lalRuleAddedAcceptedWithAllowStorageChangeTrue() throws Exception { + final String oneRule = "rules:\n" + + " - name: r1\n layer: MESH\n dsl: 'filter { sink {} }'\n"; + final String twoRules = oneRule + + " - name: r2\n layer: MESH\n dsl: 'filter { json {} sink {} }'\n"; + whenDaoHasRow(CATALOG_LAL, "lal-file", oneRule, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate(CATALOG_LAL, "lal-file", "true", "false", + HttpData.ofUtf8(twoRules)); + + assertHttp(resp, HttpStatus.OK); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + // ---- helpers -------------------------------------------------------------------------- + + private void whenDaoHasRow(final String catalog, final String name, + final String content, final String status) throws Exception { + final RuntimeRuleManagementDAO.RuntimeRuleFile row = + new RuntimeRuleManagementDAO.RuntimeRuleFile(catalog, name, content, status, 0L); + when(dao.getAll()).thenReturn(Arrays.asList(row)); + } + + private static void assertHttp(final HttpResponse resp, final HttpStatus expected) { + final ResponseHeaders headers = resp.aggregate().toCompletableFuture().join().headers(); + assertEquals(expected.code(), headers.status().code(), + "unexpected HTTP status (headers: " + headers + ")"); + assertTrue(headers.status().isSuccess() || headers.status().isClientError() + || headers.status().isServerError(), "status classified as success/client/server"); + } + + private static final String CATALOG_MAL = "otel-rules"; + private static final String CATALOG_LAL = "lal"; + + private static final String SERVICE_YAML = + "metricPrefix: it_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: throughput\n" + + " exp: throughput_total.sum(['host'])\n"; + + private static final String INSTANCE_YAML = + "metricPrefix: it_vm\n" + + "expSuffix: instance(['host','instance'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: throughput\n" + + " exp: throughput_total.sum(['host','instance'])\n"; +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java new file mode 100644 index 000000000000..c2e61548c56d --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java @@ -0,0 +1,566 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.rest; + +import com.linecorp.armeria.common.AggregatedHttpResponse; +import com.linecorp.armeria.common.HttpData; +import com.linecorp.armeria.common.HttpHeaderNames; +import com.linecorp.armeria.common.HttpResponse; +import com.linecorp.armeria.common.HttpStatus; +import com.linecorp.armeria.common.ResponseHeaders; +import java.lang.reflect.Method; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.locks.ReentrantLock; +import org.apache.skywalking.oap.server.core.CoreModule; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.StorageModule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.library.module.ModuleManager; +import org.apache.skywalking.oap.server.library.module.ModuleProviderHolder; +import org.apache.skywalking.oap.server.library.module.ModuleServiceHolder; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.lal.LalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.mal.MalRuleEngine; +import org.apache.skywalking.oap.server.receiver.runtimerule.metrics.LockMetrics; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeDelete; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.StructuralCommitCoordinator; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResumeCoordinator; +import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; +import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; +import org.mockito.Mockito; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.mockito.ArgumentMatchers.any; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.never; +import static org.mockito.Mockito.verify; +import static org.mockito.Mockito.when; + +/** + * Unit tests for the REST handler's path-selection logic: no_change short-circuit status + * awareness, {@code /addOrUpdate?force=true} forceReapply bypass, and the 409 guardrail. + * No infra, no containers; {@link DSLManager} is mocked at the integration seam + * (applyNowForRuleFile). + * + *

The regression targets these specific bugs fixed in 8c96440d27: + *

    + *
  • /addOrUpdate with byte-identical content on an INACTIVE row no longer returns + * no_change — it runs through the apply pipeline to reactivate.
  • + *
  • /addOrUpdate?force=true always bypasses the no_change short-circuit, even on + * byte-identical content, so operator recovery re-pushes actually drive a + * re-apply.
  • + *
  • /addOrUpdate on byte-identical content on an ACTIVE row still no_change's — + * CI idempotency preserved for the normal push path.
  • + *
+ */ +class RuntimeRuleRestHandlerTest { + + private ModuleManager moduleManager; + private DSLManager dslManager; + private RuntimeRuleClusterClient clusterClient; + private RuntimeRuleManagementDAO dao; + private RuntimeRuleRestHandler handler; + + @BeforeEach + void setUp() { + moduleManager = mock(ModuleManager.class); + dslManager = mock(DSLManager.class); + clusterClient = mock(RuntimeRuleClusterClient.class); + dao = mock(RuntimeRuleManagementDAO.class); + + // Wire StorageModule → DAO resolution so currentRuleFile(...) reaches the mocked dao. + final ModuleProviderHolder storagePh = mock(ModuleProviderHolder.class); + final ModuleServiceHolder storageSh = mock(ModuleServiceHolder.class); + when(moduleManager.find(StorageModule.NAME)).thenReturn(storagePh); + when(storagePh.provider()).thenReturn(storageSh); + when(storageSh.getService(RuntimeRuleManagementDAO.class)).thenReturn(dao); + + // CoreModule stub — some handler paths resolve services from it; empty stub is fine + // for the doAddOrUpdate/doInactivate/doDelete paths these tests exercise. + final ModuleProviderHolder corePh = mock(ModuleProviderHolder.class); + final ModuleServiceHolder coreSh = mock(ModuleServiceHolder.class); + when(moduleManager.find(CoreModule.NAME)).thenReturn(corePh); + when(corePh.provider()).thenReturn(coreSh); + + // DSLManager per-file lock plumbing — the handler grabs a reentrant lock via + // AppliedRuleScript.lockFor(dslManager.getRules(), catalog, name) before running + // the workflow, and times the hold through dslManager.getLockMetrics() + // .startRestHoldTimer(). Mockito returns null for unstubbed object methods, which + // would NPE every test; wire a real ConcurrentHashMap so AppliedRuleScript.lockFor + // lazy-creates an entry on first acquire, plus a minimal LockMetrics mock. + when(dslManager.getRules()).thenReturn(new ConcurrentHashMap<>()); + final LockMetrics lockMetrics = mock(LockMetrics.class); + when(dslManager.getLockMetrics()).thenReturn(lockMetrics); + when(lockMetrics.acquireForRest(Mockito.any(ReentrantLock.class), Mockito.anyLong(), + Mockito.anyString(), Mockito.anyString())).thenAnswer(inv -> { + final ReentrantLock l = inv.getArgument(0); + l.lock(); + return true; + }); + when(lockMetrics.startRestHoldTimer()).thenReturn(mock(HistogramMetrics.Timer.class)); + + // Engine registry — REST handler validates incoming catalogs by asking the registry + // whether some engine claims them. Real Mal+Lal engines are cheap, no module deps. + final RuleEngineRegistry engineRegistry = new RuleEngineRegistry(); + engineRegistry.register(new MalRuleEngine(new ConcurrentHashMap<>(), moduleManager)); + engineRegistry.register(new LalRuleEngine(new ConcurrentHashMap<>(), moduleManager)); + when(dslManager.getEngineRegistry()).thenReturn(engineRegistry); + + // DSLManager subsystem getters — the REST handler reaches Suspend/Resume + 2-PC + // commit + /delete-backend-drop directly via DSLManager.getXxx() now (no + // pass-through wrappers). Wire each subsystem to a mock so every test gets + // no-op behaviour without per-test stubbing; the apply-path tests below add + // happy-path return values on top. + when(dslManager.getSuspendCoord()).thenReturn(mock(SuspendResumeCoordinator.class)); + when(dslManager.getCommitCoord()).thenReturn(mock(StructuralCommitCoordinator.class)); + when(dslManager.getDslRuntimeDelete()).thenReturn(mock(DSLRuntimeDelete.class)); + + // persistRuleSync now calls dao.save(rule); the mocked DAO returns void by default + // so the persist path completes successfully in these unit tests with no extra + // wiring. The earlier ManagementStreamProcessor reflection injection is gone. + + handler = new RuntimeRuleRestHandler(moduleManager, dslManager, clusterClient, 30_000L); + } + + @Test + void addOrUpdateReturnsNoChangeOnByteIdenticalActiveRow() throws Exception { + // Regression for CI idempotency: pushing the same bytes on a currently ACTIVE row + // must still short-circuit to 200 no_change with no side effects. The classifier, + // guardrail, Suspend broadcast, and dslManager apply are all skipped. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.addOrUpdate("otel-rules", "vm", "false", "false", + HttpData.ofUtf8(yaml)); + + assertHttpStatus(resp, HttpStatus.OK); + verify(dslManager, never()).applyNowForRuleFile(any()); + verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); + verify(clusterClient, never()).broadcastSuspend( + Mockito.anyString(), Mockito.anyString(), Mockito.anyString()); + } + + @Test + void addOrUpdateBypassesNoChangeOnInactiveRow() throws Exception { + // Reactivation path: same bytes but prior row is INACTIVE. The handler must NOT + // short-circuit — it needs to run the full apply so handlers register and the row + // flips back to ACTIVE in storage. Previously returned no_change and left the node + // serving nothing. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_INACTIVE); + whenReconcilerApplySucceeds("otel-rules", "vm"); + + final HttpResponse resp = handler.addOrUpdate("otel-rules", "vm", "false", "false", + HttpData.ofUtf8(yaml)); + + // Reactivation pushes through the STRUCTURAL/NEW path — expect 200 with a status + // other than no_change. We don't assert on the exact applyStatus string here (that + // depends on classifier output); the key assertion is that the two-arg deferred- + // commit form of applyNowForRuleFile ran (STRUCTURAL path signature). + assertHttpStatus(resp, HttpStatus.OK); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + @Test + void fixBypassesNoChangeEvenOnByteIdenticalActiveRow() throws Exception { + // Recovery path: operator re-posts known-good content through + // /addOrUpdate?allowStorageChange=true&force=true to converge from a stuck state. + // Previously the no_change short-circuit ate this; force=true must run the full + // apply pipeline. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + whenReconcilerApplySucceeds("otel-rules", "vm"); + + final HttpResponse resp = handler.addOrUpdate("otel-rules", "vm", "true", "true", + HttpData.ofUtf8(yaml)); + + assertHttpStatus(resp, HttpStatus.OK); + // /addOrUpdate?force=true with byte-identical content → classifier returns + // NO_CHANGE, handler falls through to applyStructural (not applyFilterOnly since + // NO_CHANGE != FILTER_ONLY) which uses the two-arg deferred-commit form. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + } + + @Test + void addOrUpdateReturnsCompileFailedOnMalformedYaml() throws Exception { + // compile_failed is the guaranteed pre-persist error: the classifier's AST walk + // catches a bad expression, we return 400 without persisting or broadcasting. This + // test pins that the response is 400 and no side effects fire. + final String garbage = "this: is: not: valid: mal: yaml: at all"; + whenDaoHasRow("otel-rules", "vm", null, null); + + final HttpResponse resp = handler.addOrUpdate("otel-rules", "vm", "false", "false", + HttpData.ofUtf8(garbage)); + + assertHttpStatus(resp, HttpStatus.BAD_REQUEST); + verify(dslManager, never()).applyNowForRuleFile(any()); + verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); + verify(clusterClient, never()).broadcastSuspend( + Mockito.anyString(), Mockito.anyString(), Mockito.anyString()); + } + + @Test + void addOrUpdateEmptyBodyRejected() throws Exception { + // Basic input validation — defense-in-depth. Also verifies the empty-body check + // runs before the DAO lookup, so an empty body doesn't trigger DAO IO. + final HttpResponse resp = handler.addOrUpdate("otel-rules", "vm", "false", "false", + HttpData.empty()); + + assertHttpStatus(resp, HttpStatus.BAD_REQUEST); + Mockito.verifyNoInteractions(dao); + } + + @Test + void deleteRejectsActiveRuleWith409() throws Exception { + // /delete now requires the rule to be INACTIVE. Posting /delete against an ACTIVE + // row must return HTTP 409 requires_inactivate_first without touching the DAO's + // delete API — operators have to /inactivate first so the destructive teardown + // (DDL drop, handler unregister, alarm reset) runs under its own endpoint. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.delete("otel-rules", "vm", ""); + + assertHttpStatus(resp, HttpStatus.CONFLICT); + Mockito.verify(dao, never()).delete(Mockito.anyString(), Mockito.anyString()); + } + + @Test + void deleteRemovesInactiveRow() throws Exception { + // /delete on an INACTIVE row is the happy path — the destructive work already ran + // at /inactivate time, so this is just a row removal under the per-file lock. The + // DAO's delete is called; no Suspend broadcast fires (no converter work to serialize + // against on peers). + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_INACTIVE); + + final HttpResponse resp = handler.delete("otel-rules", "vm", ""); + + assertHttpStatus(resp, HttpStatus.OK); + Mockito.verify(dao).delete("otel-rules", "vm"); + verify(clusterClient, never()).broadcastSuspend( + Mockito.anyString(), Mockito.anyString(), Mockito.anyString()); + } + + @Test + void deleteIsIdempotentOnAbsentRow() throws Exception { + // Absent row + /delete → 200 "not_found" (idempotent). The desired end state + // (no row) is already achieved; DAO.delete is not called because there's nothing + // to remove. + whenDaoHasRow("otel-rules", "vm", null, null); + + final HttpResponse resp = handler.delete("otel-rules", "vm", ""); + + assertHttpStatus(resp, HttpStatus.OK); + Mockito.verify(dao, never()).delete(Mockito.anyString(), Mockito.anyString()); + } + + @Test + void inactivateUsesLocalCacheOnlySoBackendSchemaIsPreserved() throws Exception { + // Soft-pause contract: /inactivate must drive the local teardown via the + // applyNowForRuleFile overload that takes a StorageManipulationOpt — and that opt + // must be localCacheOnly(). The localCacheOnly path makes per-backend + // whenRemoving record SKIPPED_NOT_ALLOWED instead of firing dropTable, so the + // BanyanDB measure / JDBC table / ES index plus stored data survive the pause. + // /delete is the only path that drops backend schema (still uses fullInstall()). + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + whenReconcilerApplySucceeds("otel-rules", "vm"); + // The 3-arg overload returns DSLRuntimeState too; mock the same successful state. + final DSLRuntimeState state = DSLRuntimeState.running("otel-rules", "vm", "hash", 0L); + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean(), + any(StorageManipulationOpt.class))) + .thenReturn(state); + + final HttpResponse resp = handler.inactivate("otel-rules", "vm"); + + assertHttpStatus(resp, HttpStatus.OK); + // Verify the soft-pause path was taken: 3-arg overload with deferCommit=false and + // a localCacheOnly opt. The destructive 2-arg overload (which would mean + // fullInstall and a dropTable cascade) must NOT have fired. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(false), + Mockito.argThat(opt -> opt != null && opt.isLocalCacheOnly())); + verify(dslManager, never()).applyNowForRuleFile(any()); + verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); + } + + // ---- GET /runtime/rule and /runtime/rule/bundled --------------------------------------- + + @Test + void getRuleReturnsRowYamlWhenActive() throws Exception { + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.get("otel-rules", "vm", "", "", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final AggregatedHttpResponse agg = + resp.aggregate().toCompletableFuture().join(); + // Default mode = raw YAML, byte-identical to /addOrUpdate's input. + assertEquals(yaml, agg.contentUtf8()); + // Metadata headers always present so raw and JSON modes are equally introspectable. + assertEquals("ACTIVE", agg.headers().get("X-Sw-Status")); + assertEquals("runtime", agg.headers().get("X-Sw-Source")); + // ETag matches contentHash so editor reload can do If-None-Match. + assertEquals("\"" + sha256Hex(yaml) + "\"", + agg.headers().get(HttpHeaderNames.ETAG)); + } + + @Test + void getRuleReturnsRowYamlWhenInactive() throws Exception { + // Soft-pause contract: an INACTIVE row keeps its original content so the editor can + // re-edit. Status header should reflect the actual DB state, not "ACTIVE". + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_INACTIVE); + + final HttpResponse resp = handler.get("otel-rules", "vm", "", "", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final AggregatedHttpResponse agg = + resp.aggregate().toCompletableFuture().join(); + assertEquals(yaml, agg.contentUtf8()); + assertEquals("INACTIVE", agg.headers().get("X-Sw-Status")); + } + + @Test + void getRuleFallsBackToStaticWhenNoRow() throws Exception { + // No DB row, but StaticRuleRegistry has a bundled rule. Studio's catalog browser + // displays bundled rules with the same shape as runtime ones; the editor needs to + // be able to fetch their content too. Source header distinguishes the two. + clearStaticRegistry(); + final String yaml = "bundled: true\n"; + StaticRuleRegistry.active() + .record("otel-rules", "bundled-only", yaml.getBytes(StandardCharsets.UTF_8)); + whenDaoHasRow("otel-rules", "bundled-only", null, null); + try { + final HttpResponse resp = handler.get("otel-rules", "bundled-only", "", "", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final AggregatedHttpResponse agg = + resp.aggregate().toCompletableFuture().join(); + assertEquals(yaml, agg.contentUtf8()); + assertEquals("BUNDLED", agg.headers().get("X-Sw-Status")); + assertEquals("bundled", agg.headers().get("X-Sw-Source")); + } finally { + clearStaticRegistry(); + } + } + + @Test + void getRuleReturns404WhenNoRowAndNoStatic() throws Exception { + clearStaticRegistry(); + whenDaoHasRow("otel-rules", "absent", null, null); + + final HttpResponse resp = handler.get("otel-rules", "absent", "", "", ""); + + assertHttpStatus(resp, HttpStatus.NOT_FOUND); + } + + @Test + void getRuleReturnsJsonEnvelopeOnAcceptJson() throws Exception { + // JSON envelope must use standard JSON-string escaping (no base64). Multi-line YAML + // → \n in the content field; a JSON parser yields the original bytes back. + final String yaml = "metricPrefix: vm\nmetricsRules:\n - name: cpu\n"; + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.get("otel-rules", "vm", "", "application/json", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final AggregatedHttpResponse agg = + resp.aggregate().toCompletableFuture().join(); + final String body = agg.contentUtf8(); + assertEquals(true, body.startsWith("{") && body.endsWith("}"), + "expected JSON envelope, got: " + body); + // Newline in the YAML must be JSON-escaped, NOT raw and NOT base64. + assertEquals(true, body.contains("\\n"), + "expected JSON-escaped newline in content field, got: " + body); + assertEquals(true, body.contains("\"source\":\"runtime\""), + "expected source=runtime, got: " + body); + } + + @Test + void getRuleReturns304OnIfNoneMatchHashMatch() throws Exception { + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); + final String currentETag = "\"" + sha256Hex(yaml) + "\""; + + final HttpResponse resp = handler.get("otel-rules", "vm", "", "", currentETag); + + assertHttpStatus(resp, HttpStatus.NOT_MODIFIED); + // 304 still emits metadata headers so the editor can refresh its cached state + // without re-downloading the body. + final AggregatedHttpResponse agg = + resp.aggregate().toCompletableFuture().join(); + assertEquals(currentETag, agg.headers().get(HttpHeaderNames.ETAG)); + } + + @Test + void getRuleReturns400OnInvalidCatalog() throws Exception { + final HttpResponse resp = handler.get("not-a-catalog", "vm", "", "", ""); + + assertHttpStatus(resp, HttpStatus.BAD_REQUEST); + } + + @Test + void listBundledReturnsAllForCatalog() throws Exception { + clearStaticRegistry(); + final StaticRuleRegistry registry = + StaticRuleRegistry.active(); + registry.record("otel-rules", "alpha", "alpha\n".getBytes(StandardCharsets.UTF_8)); + registry.record("otel-rules", "beta", "beta\n".getBytes(StandardCharsets.UTF_8)); + // A different catalog's entries must be excluded from this catalog's response. + registry.record("lal", "gamma", "gamma\n".getBytes(StandardCharsets.UTF_8)); + whenDaoHasRow("otel-rules", "absent", null, null); // empty DAO so overridden=false everywhere + try { + final HttpResponse resp = handler.listBundled("otel-rules", "true"); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertEquals(true, body.contains("\"name\":\"alpha\""), + "expected alpha in bundled list, got: " + body); + assertEquals(true, body.contains("\"name\":\"beta\""), + "expected beta in bundled list, got: " + body); + assertEquals(false, body.contains("\"name\":\"gamma\""), + "lal entry leaked into otel-rules list: " + body); + assertEquals(true, body.contains("\"kind\":\"bundled\"")); + } finally { + clearStaticRegistry(); + } + } + + @Test + void listBundledOmitsContentWhenWithContentFalse() throws Exception { + clearStaticRegistry(); + StaticRuleRegistry.active() + .record("otel-rules", "alpha", "alpha\n".getBytes(StandardCharsets.UTF_8)); + whenDaoHasRow("otel-rules", "absent", null, null); + try { + final HttpResponse resp = handler.listBundled("otel-rules", "false"); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + // contentHash must always be present so Studio can decide whether to fetch content + // lazily; content must be absent when withContent=false. + assertEquals(true, body.contains("\"contentHash\""), + "expected contentHash, got: " + body); + assertEquals(false, body.contains("\"content\":\""), + "expected no content field when withContent=false, got: " + body); + } finally { + clearStaticRegistry(); + } + } + + @Test + void listBundledMarksOverriddenWhenRuntimeRowExists() throws Exception { + clearStaticRegistry(); + StaticRuleRegistry.active() + .record("otel-rules", "vm", "static-vm\n".getBytes(StandardCharsets.UTF_8)); + // Operator override exists for "vm" via a runtime row. + whenDaoHasRow("otel-rules", "vm", "override-vm\n", RuntimeRule.STATUS_ACTIVE); + try { + final HttpResponse resp = handler.listBundled("otel-rules", "true"); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertEquals(true, body.contains("\"overridden\":true"), + "expected overridden=true for the bundled rule, got: " + body); + } finally { + clearStaticRegistry(); + } + } + + @Test + void listBundledReturns400OnInvalidCatalog() throws Exception { + final HttpResponse resp = handler.listBundled("not-a-catalog", "true"); + + assertHttpStatus(resp, HttpStatus.BAD_REQUEST); + } + + /** Reflection helper — StaticRuleRegistry.clear() is package-private. */ + private static void clearStaticRegistry() throws Exception { + final Method m = StaticRuleRegistry.class.getDeclaredMethod("clear"); + m.setAccessible(true); + m.invoke(StaticRuleRegistry.active()); + } + + private static String sha256Hex(final String s) { + return ContentHash.sha256Hex(s); + } + + // ---- helpers -------------------------------------------------------------------------- + + private void whenDaoHasRow(final String catalog, final String name, + final String content, final String status) throws Exception { + if (content == null) { + when(dao.getAll()).thenReturn(Collections.emptyList()); + return; + } + final RuntimeRuleManagementDAO.RuntimeRuleFile row = + new RuntimeRuleManagementDAO.RuntimeRuleFile(catalog, name, content, status, 0L); + when(dao.getAll()).thenReturn(Arrays.asList(row)); + } + + private void whenReconcilerApplySucceeds(final String catalog, final String name) { + final DSLRuntimeState state = DSLRuntimeState.running(catalog, name, "hash", 0L); + // Stub both overloads — FILTER_ONLY path uses the single-arg form; STRUCTURAL / + // NEW paths use the two-arg form (deferCommit=true). + when(dslManager.applyNowForRuleFile(any())).thenReturn(state); + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean())).thenReturn(state); + // Apply path needs SUSPENDED on localSuspend so the workflow proceeds. The other + // subsystem getters were stubbed in setUp() with default mocks; here we just + // override localSuspend to return SUSPENDED instead of the default null. + final SuspendResumeCoordinator suspendCoord = dslManager.getSuspendCoord(); + when(suspendCoord.localSuspend(Mockito.anyString(), Mockito.anyString())) + .thenReturn(SuspendResult.SUSPENDED); + } + + private static void assertHttpStatus(final HttpResponse resp, final HttpStatus expected) { + final ResponseHeaders headers = resp.aggregate().toCompletableFuture().join().headers(); + assertEquals(expected.code(), headers.status().code(), + "unexpected HTTP status (full response: " + headers + ")"); + } + + private static String minimalMalYaml() { + return "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu\n" + + " exp: cpu_seconds.sum(['host'])\n"; + } + + @SuppressWarnings("unused") + private static List ignoreListReturn() { + // Surfaces the unused-import check for ResponseHeaders if it ever loses usage — the + // compile fails before the ignore would matter, so this helper just keeps the import + // alive for future assertions on individual headers. + return Collections.emptyList(); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScriptLockTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScriptLockTest.java new file mode 100644 index 000000000000..5af6e1f602ac --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/AppliedRuleScriptLockTest.java @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.state; + +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; +import java.util.concurrent.CountDownLatch; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.locks.ReentrantLock; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertNotSame; +import static org.junit.jupiter.api.Assertions.assertSame; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Verifies the lazy lock semantics on the unified {@code rules} map via + * {@link AppliedRuleScript#lockFor}. The lock used to live in a dedicated + * {@code PerFileLockMap}; it now lives on each {@link AppliedRuleScript}, lazy-created on + * first {@code lockFor} call. The contract this test enforces is unchanged: same key → + * same lock instance; different keys → independent locks; the lock is a real mutex. + */ +class AppliedRuleScriptLockTest { + + @Test + void sameKeyReturnsSameLock() { + // Apply-path correctness depends on a single mutex per file — if two concurrent + // dslManager ticks for the same rule got different ReentrantLock instances, the + // compile+swap sequence would not actually be serialized. + final Map rules = new ConcurrentHashMap<>(); + final ReentrantLock a = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + final ReentrantLock b = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + assertSame(a, b); + } + + @Test + void differentFilesGetIndependentLocks() { + final Map rules = new ConcurrentHashMap<>(); + final ReentrantLock a = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + final ReentrantLock b = AppliedRuleScript.lockFor(rules, "mal", "k8s.yaml"); + assertNotSame(a, b); + } + + @Test + void differentCatalogsWithSameNameGetIndependentLocks() { + // A MAL file and a LAL file named "demo" are distinct bundles and must not share a lock. + final Map rules = new ConcurrentHashMap<>(); + final ReentrantLock a = AppliedRuleScript.lockFor(rules, "mal", "demo"); + final ReentrantLock b = AppliedRuleScript.lockFor(rules, "lal", "demo"); + assertNotSame(a, b); + } + + @Test + void lockActuallyBlocksAcrossThreads() throws Exception { + // Sanity: the returned ReentrantLock is a real mutex (not a no-op). Caller holds it on + // one thread, second thread's tryLock must observe the locked state. + final Map rules = new ConcurrentHashMap<>(); + final ReentrantLock lock = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + lock.lock(); + try { + final CountDownLatch done = new CountDownLatch(1); + final boolean[] acquired = {true}; + final Thread t = new Thread(() -> { + try { + acquired[0] = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml") + .tryLock(50, TimeUnit.MILLISECONDS); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } finally { + done.countDown(); + } + }); + t.start(); + assertTrue(done.await(2, TimeUnit.SECONDS), "probing thread should finish"); + assertFalse(acquired[0], "second thread must not acquire a held lock"); + } finally { + lock.unlock(); + } + } + + @Test + void lockSurvivesWithStateBuilders() { + // A with* builder produces a new AppliedRuleScript; the lock must remain stable so + // a thread that acquired the lock on the prior instance and released the map slot + // mid-update still owns the same mutex when it unlocks. This is the invariant that + // makes consolidating snapshot+content+lock+applied into one AppliedRuleScript safe. + final Map rules = new ConcurrentHashMap<>(); + final ReentrantLock first = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + rules.compute("mal:vm.yaml", (k, prev) -> prev.withContent("body")); + final ReentrantLock second = AppliedRuleScript.lockFor(rules, "mal", "vm.yaml"); + assertSame(first, second, + "lock identity must survive with* state replacements on the same key"); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeStateTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeStateTest.java new file mode 100644 index 000000000000..75e0911b655d --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/state/DSLRuntimeStateTest.java @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.state; + +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotSame; +import static org.junit.jupiter.api.Assertions.assertNull; +import static org.junit.jupiter.api.Assertions.assertSame; + +class DSLRuntimeStateTest { + + @Test + void runningFactoryProducesLiveState() { + final DSLRuntimeState s = DSLRuntimeState.running("mal", "vm.yaml", "abc123", 1000L); + assertEquals("mal", s.getCatalog()); + assertEquals("vm.yaml", s.getName()); + assertEquals("abc123", s.getContentHash()); + assertEquals(DSLRuntimeState.LocalState.RUNNING, s.getLocalState()); + assertEquals(DSLRuntimeState.LoaderGc.LIVE, s.getLoaderGc()); + assertNull(s.getLastApplyError()); + assertEquals(1000L, s.getLastAppliedAtMs()); + assertEquals(1000L, s.getEnteredCurrentStateAtMs()); + } + + @Test + void withLocalStateReturnsNewInstanceOnChange() { + final DSLRuntimeState s1 = DSLRuntimeState.running("mal", "vm.yaml", "abc", 1000L); + final DSLRuntimeState s2 = s1.withLocalState(DSLRuntimeState.LocalState.SUSPENDED, 2000L); + assertNotSame(s1, s2); + // Original snapshot intact — readers that captured s1 never observe s2's mutation. + assertEquals(DSLRuntimeState.LocalState.RUNNING, s1.getLocalState()); + assertEquals(DSLRuntimeState.LocalState.SUSPENDED, s2.getLocalState()); + // Entering a state stamps the transition time; lastAppliedAtMs unchanged (still the + // last *successful apply*, not the last state transition). + assertEquals(2000L, s2.getEnteredCurrentStateAtMs()); + assertEquals(1000L, s2.getLastAppliedAtMs()); + } + + @Test + void withLocalStateIsIdentityOnSameState() { + // Same-value withers must short-circuit — the dslManager calls these unconditionally on + // every tick and allocating a new DSLRuntimeState per no-op would thrash the state map. + final DSLRuntimeState s1 = DSLRuntimeState.running("mal", "vm.yaml", "abc", 1000L); + final DSLRuntimeState s2 = s1.withLocalState(DSLRuntimeState.LocalState.RUNNING, 9999L); + assertSame(s1, s2); + } + + @Test + void withLoaderGcTransitionsPendingThenCollected() { + final DSLRuntimeState live = DSLRuntimeState.running("mal", "vm.yaml", "abc", 1000L); + final DSLRuntimeState pending = live.withLoaderGc(DSLRuntimeState.LoaderGc.PENDING); + final DSLRuntimeState collected = pending.withLoaderGc(DSLRuntimeState.LoaderGc.COLLECTED); + assertEquals(DSLRuntimeState.LoaderGc.LIVE, live.getLoaderGc()); + assertEquals(DSLRuntimeState.LoaderGc.PENDING, pending.getLoaderGc()); + assertEquals(DSLRuntimeState.LoaderGc.COLLECTED, collected.getLoaderGc()); + } + + @Test + void withApplyErrorStampsTimestampAndMessage() { + final DSLRuntimeState s1 = DSLRuntimeState.running("mal", "vm.yaml", "abc", 1000L); + final DSLRuntimeState s2 = s1.withApplyError("compile failed", 5000L); + assertEquals("compile failed", s2.getLastApplyError()); + assertEquals(5000L, s2.getLastAppliedAtMs()); + // enteredCurrentStateAtMs is not advanced: an error does not change local state, just + // annotates the outcome of the most recent apply attempt. + assertEquals(1000L, s2.getEnteredCurrentStateAtMs()); + } + + @Test + void withContentHashRefreshesAppliedAndEntered() { + final DSLRuntimeState s1 = DSLRuntimeState.running("mal", "vm.yaml", "old", 1000L); + final DSLRuntimeState s2 = s1.withContentHash("new", 7000L); + assertEquals("new", s2.getContentHash()); + // A new content hash is always a successful re-apply, so both timestamps advance. + assertEquals(7000L, s2.getLastAppliedAtMs()); + assertEquals(7000L, s2.getEnteredCurrentStateAtMs()); + } + + @Test + void withContentHashIsIdentityOnSameHash() { + final DSLRuntimeState s1 = DSLRuntimeState.running("mal", "vm.yaml", "abc", 1000L); + final DSLRuntimeState s2 = s1.withContentHash("abc", 9999L); + assertSame(s1, s2); + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHashTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHashTest.java new file mode 100644 index 000000000000..bd17458a1fa6 --- /dev/null +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/util/ContentHashTest.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.util; + +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotEquals; + +class ContentHashTest { + + @Test + void nullContentReturnsEmptyString() { + // Null handling is deliberate — the dslManager represents "no content" as "" so state + // maps can carry a non-null hash field for bundles that never compiled. + assertEquals("", ContentHash.sha256Hex(null)); + } + + @Test + void emptyContentHasDefinedHash() { + // Known SHA-256 of the empty string, documented in the NIST FIPS 180-4 examples. Locked + // down as a canary — if the algorithm selection or encoding drifts, this catches it. + assertEquals( + "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", + ContentHash.sha256Hex("")); + } + + @Test + void knownVectorMatches() { + // RFC 6234 test vector for "abc" — cross-implementation reference. + assertEquals( + "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad", + ContentHash.sha256Hex("abc")); + } + + @Test + void sameContentHashesIdentically() { + final String yaml = "metricPrefix: demo\nmetricsRules:\n - name: r\n exp: m.sum([])"; + assertEquals(ContentHash.sha256Hex(yaml), ContentHash.sha256Hex(yaml)); + } + + @Test + void whitespaceAndCaseChangesProduceDifferentHashes() { + // Byte-identity is the whole point — any diff, however trivial, is a different bundle. + assertNotEquals( + ContentHash.sha256Hex("name: x"), + ContentHash.sha256Hex("name: x")); + assertNotEquals( + ContentHash.sha256Hex("Name: x"), + ContentHash.sha256Hex("name: x")); + } + + @Test + void hashIsAlways64HexChars() { + final String hash = ContentHash.sha256Hex("payload"); + assertEquals(64, hash.length()); + for (int i = 0; i < hash.length(); i++) { + final char c = hash.charAt(i); + final boolean isHex = (c >= '0' && c <= '9') || (c >= 'a' && c <= 'f'); + if (!isHex) { + throw new AssertionError("non-hex char at " + i + ": " + c); + } + } + } +} diff --git a/oap-server/server-receiver-plugin/skywalking-telegraf-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/telegraf/provider/TelegrafReceiverProvider.java b/oap-server/server-receiver-plugin/skywalking-telegraf-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/telegraf/provider/TelegrafReceiverProvider.java index c40edef1bad0..157926b93f8b 100644 --- a/oap-server/server-receiver-plugin/skywalking-telegraf-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/telegraf/provider/TelegrafReceiverProvider.java +++ b/oap-server/server-receiver-plugin/skywalking-telegraf-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/telegraf/provider/TelegrafReceiverProvider.java @@ -25,6 +25,7 @@ import org.apache.skywalking.oap.server.core.CoreModule; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; import org.apache.skywalking.oap.server.core.server.HTTPHandlerRegister; +import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.library.module.ModuleDefine; import org.apache.skywalking.oap.server.library.module.ModuleProvider; import org.apache.skywalking.oap.server.library.module.ModuleStartException; @@ -70,16 +71,20 @@ public void onInitialized(final TelegrafModuleConfig initialized) { @Override public void prepare() throws ServiceNotProvidedException, ModuleStartException { + } + + @Override + public void start() throws ServiceNotProvidedException, ModuleStartException { + // Load static telegraf MAL rules in start() (not prepare()) so the runtime-rule + // extension chain can consult Storage — installed by CoreModuleProvider.start() + // and made query-ready by StorageModule.start() which the requiredModules list below + // forces to run first. try { configs = Rules.loadRules(TelegrafModuleConfig.CONFIG_PATH, StringUtil.isEmpty(moduleConfig.getActiveFiles()) ? Collections.emptyList() : Splitter.on(",").splitToList(moduleConfig.getActiveFiles())); } catch (IOException e) { throw new ModuleStartException("Failed to load MAL rules", e); } - } - - @Override - public void start() throws ServiceNotProvidedException, ModuleStartException { if (CollectionUtils.isNotEmpty(configs)) { HTTPHandlerRegister httpHandlerRegister = getManager().find(SharingServerModule.NAME) .provider() @@ -98,9 +103,12 @@ public void notifyAfterCompleted() throws ServiceNotProvidedException { @Override public String[] requiredModules() { + // StorageModule is declared so the runtime-rule override cache is populated before + // Rules.loadRules fires in start() above. return new String[] { CoreModule.NAME, - SharingServerModule.NAME + SharingServerModule.NAME, + StorageModule.NAME }; } } diff --git a/oap-server/server-receiver-plugin/skywalking-zabbix-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/zabbix/provider/ZabbixMetricsTest.java b/oap-server/server-receiver-plugin/skywalking-zabbix-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/zabbix/provider/ZabbixMetricsTest.java index 8455baf56810..d637e98643b9 100644 --- a/oap-server/server-receiver-plugin/skywalking-zabbix-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/zabbix/provider/ZabbixMetricsTest.java +++ b/oap-server/server-receiver-plugin/skywalking-zabbix-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/zabbix/provider/ZabbixMetricsTest.java @@ -93,7 +93,10 @@ public void setupMetrics() throws Throwable { meterSystem = Mockito.spy(new MeterSystem(moduleManager)); ReflectUtil.setInternalState(MetricsStreamProcessor.class, "PROCESSOR", Mockito.spy(MetricsStreamProcessor.getInstance())); - doNothing().when(MetricsStreamProcessor.getInstance()).create(any(), (StreamDefinition) any(), any()); + // MetricsStreamProcessor.create now takes a StorageManipulationOpt on every path so + // the shape-mismatch gate at the installer level can surface to stream registration. + doNothing().when(MetricsStreamProcessor.getInstance()) + .create(any(), (StreamDefinition) any(), any(), any()); CoreModule coreModule = Mockito.spy(CoreModule.class); ReflectUtil.setInternalState(coreModule, "loadedProvider", moduleProvider); diff --git a/oap-server/server-starter/pom.xml b/oap-server/server-starter/pom.xml index 5b74ae9f5265..c849a88586bd 100644 --- a/oap-server/server-starter/pom.xml +++ b/oap-server/server-starter/pom.xml @@ -194,6 +194,11 @@ skywalking-pprof-receiver-plugin ${project.version} + + org.apache.skywalking + skywalking-runtime-rule-receiver-plugin + ${project.version} + org.apache.skywalking skywalking-telegraf-receiver-plugin diff --git a/oap-server/server-starter/src/main/resources/application.yml b/oap-server/server-starter/src/main/resources/application.yml index cbc604befc64..5f6983d66348 100644 --- a/oap-server/server-starter/src/main/resources/application.yml +++ b/oap-server/server-starter/src/main/resources/application.yml @@ -643,6 +643,23 @@ receiver-telegraf: default: activeFiles: ${SW_RECEIVER_TELEGRAF_ACTIVE_FILES:vm} +# Runtime rule admin surface for hot-update of MAL / LAL rule files. +# DISABLED BY DEFAULT. Empty selector below keeps the provider unloaded. To enable, set +# SW_RECEIVER_RUNTIME_RULE=default (or edit this block). Once enabled, the admin endpoint +# binds on port 17128. SECURITY NOTICE: the endpoint has no authentication this iteration — +# gateway-protect with IP allow-lists and never expose it to the public internet. +receiver-runtime-rule: + selector: ${SW_RECEIVER_RUNTIME_RULE:-} + default: + restHost: ${SW_RECEIVER_RUNTIME_RULE_REST_HOST:0.0.0.0} + restPort: ${SW_RECEIVER_RUNTIME_RULE_REST_PORT:17128} + restContextPath: ${SW_RECEIVER_RUNTIME_RULE_REST_CONTEXT_PATH:/} + restIdleTimeOut: ${SW_RECEIVER_RUNTIME_RULE_REST_IDLE_TIMEOUT:30000} + restAcceptQueueSize: ${SW_RECEIVER_RUNTIME_RULE_REST_QUEUE_SIZE:0} + httpMaxRequestHeaderSize: ${SW_RECEIVER_RUNTIME_RULE_HTTP_MAX_REQUEST_HEADER_SIZE:8192} + reconcilerIntervalSeconds: ${SW_RECEIVER_RUNTIME_RULE_RECONCILER_INTERVAL_SECONDS:30} + selfHealThresholdSeconds: ${SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS:60} + aws-firehose: selector: ${SW_RECEIVER_AWS_FIREHOSE:default} default: diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index b1e014867a58..7fc63ec649cb 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -38,8 +38,11 @@ import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.Trace; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRule; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRuleBinding; +import org.apache.skywalking.banyandb.schema.v1.BanyandbSchema.SchemaKey; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.TopNAggregation; +import java.time.Duration; import org.apache.skywalking.library.banyandb.v1.client.BanyanDBClient; +import org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher; import org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBException; import org.apache.skywalking.library.banyandb.v1.client.metadata.ResourceExist; import org.apache.skywalking.oap.server.core.CoreModule; @@ -51,10 +54,45 @@ import org.apache.skywalking.oap.server.core.storage.annotation.BanyanDB; import org.apache.skywalking.oap.server.core.storage.model.Model; import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.client.Client; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.library.util.CollectionUtils; +/** + * BanyanDB-side {@link ModelInstaller}. Owns the boot-time + runtime path that turns OAP's + * declared {@link Model}s into BanyanDB groups, measures, streams, properties, traces, and + * their index rules / bindings. + * + *
    + *
  • isExists — read-only inspection. Compares the declared shape against what + * BanyanDB currently holds; records + * {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH} on + * {@link StorageManipulationOpt} when the on-disk shape diverges, so the boot loop + * can skip the affected resource and log an ERROR diff instead of silently dropping + * samples.
  • + *
  • createTable — DDL path. Creates the resource if missing, updates index + * rules / bindings if drift is detected, and on time-series resources installs + * per-downsampling siblings.
  • + *
  • dropTable — runtime-rule teardown. Deletes the measure / stream and its + * index rules. Resources backing in-progress writes are first paused on every node + * via the runtime-rule Suspend RPC; the dropTable call here is just the BanyanDB + * side of that cutover.
  • + *
  • Schema-cutover fence — after every Create / Update / Delete that returned + * a non-zero etcd {@code mod_revision} OR that touched a known schema key, this + * installer waits (best-effort, bounded by a 2 s timeout) on + * {@link SchemaWatcher#awaitRevisionApplied} / + * {@link SchemaWatcher#awaitSchemaDeleted} for every BanyanDB data node to apply + * the change before returning to the caller. On laggard timeout it logs a warning + * naming the laggards and continues — see the runtime-rule architecture doc for + * why this is best-effort, not a hard guarantee.
  • + *
  • Peer-mode shortcut — when {@code opt.flags.inspectBackend == false} + * (non-main OAP in cluster mode), {@code isExists} skips every server RPC and + * just populates the local {@link MetadataRegistry} so this peer's DAOs can + * translate the model for sample read / write. The cluster main is the only + * node that actually drives DDL.
  • + *
+ */ @Slf4j public class BanyanDBIndexInstaller extends ModelInstaller { // BanyanDB group setting aligned with the OAP settings @@ -68,7 +106,7 @@ public BanyanDBIndexInstaller(Client client, ModuleManager moduleManager, Banyan } @Override - public InstallInfo isExists(Model model) throws StorageException { + public InstallInfo isExists(Model model, StorageManipulationOpt opt) throws StorageException { InstallInfoBanyanDB installInfo = new InstallInfoBanyanDB(model); installInfo.setDownSampling(model.getDownsampling()); final DownSamplingConfigService downSamplingConfigService = moduleManager.find(CoreModule.NAME) @@ -79,16 +117,38 @@ public InstallInfo isExists(Model model) throws StorageException { installInfo.setTableName(metadata.name()); installInfo.setKind(metadata.getKind()); installInfo.setGroup(metadata.getGroup()); + + // Peer-mode shortcut: when the caller has inspectBackend=false the contract is + // "zero server RPCs". The cluster main has already installed the resource on + // BanyanDB; we just populate the local MetadataRegistry so this peer's DAOs + // can translate this Model for sample read/write. No checkMeasure auto-update, + // no race with main's recent work. + if (!opt.getFlags().isInspectBackend()) { + registerLocallyByKind(model, downSamplingConfigService); + installInfo.setGroupExist(true); + installInfo.setTableExist(true); + installInfo.setAllExist(true); + opt.recordOutcome(metadata.getKind().name().toLowerCase(), metadata.name(), + StorageManipulationOpt.Outcome.EXISTING_MATCHED, + "peer-mode local cache refresh — no server RPC"); + return installInfo; + } + try { final BanyanDBClient c = ((BanyanDBStorageClient) this.client).client; // first check resource existence and create group if necessary - final ResourceExist resourceExist = checkResourceExistence(metadata, c); + final ResourceExist resourceExist = checkResourceExistence(metadata, c, opt); installInfo.setGroupExist(resourceExist.isHasGroup()); installInfo.setTableExist(resourceExist.isHasResource()); if (!resourceExist.isHasResource() && !BanyanDBTrace.MergeTable.class.isAssignableFrom(model.getStreamClass())) { installInfo.setAllExist(false); return installInfo; } else { + // Run shape-compat checks unless we're in the legacy no-init poll loop + // path. failOnAbsence implies the caller wants strict verification even + // in non-init mode (LOCAL_CACHE_VERIFY), so honour that instead of just + // gating on RunningMode. + final boolean runShapeChecks = !RunningMode.isNoInitMode() || opt.getFlags().isFailOnAbsence(); if (model.isTimeSeries()) { // register models only locally(Schema cache) but not remotely if (model.isRecord()) { @@ -99,47 +159,48 @@ public InstallInfo isExists(Model model) throws StorageException { installInfo.setAllExist(true); return installInfo; } - if (!RunningMode.isNoInitMode()) { - checkTrace(traceModel.getTrace(), c); - checkIndexRules(model.getName(), traceModel.getIndexRules(), c); + if (runShapeChecks) { + checkTrace(traceModel.getTrace(), c, opt); + checkIndexRules(model.getName(), traceModel.getIndexRules(), c, opt); checkIndexRuleBinding( traceModel.getIndexRules(), metadata.getGroup(), metadata.name(), - BanyandbCommon.Catalog.CATALOG_TRACE, c + BanyandbCommon.Catalog.CATALOG_TRACE, c, opt ); } } else { // stream StreamModel streamModel = MetadataRegistry.INSTANCE.registerStreamModel( model, config); - if (!RunningMode.isNoInitMode()) { - checkStream(streamModel.getStream(), c); - checkIndexRules(model.getName(), streamModel.getIndexRules(), c); + if (runShapeChecks) { + checkStream(streamModel.getStream(), c, opt); + checkIndexRules(model.getName(), streamModel.getIndexRules(), c, opt); checkIndexRuleBinding( streamModel.getIndexRules(), metadata.getGroup(), metadata.name(), - BanyandbCommon.Catalog.CATALOG_STREAM, c + BanyandbCommon.Catalog.CATALOG_STREAM, c, opt ); // Stream not support server side TopN pre-aggregation } } } else { // measure MeasureModel measureModel = MetadataRegistry.INSTANCE.registerMeasureModel(model, config, downSamplingConfigService); - if (!RunningMode.isNoInitMode()) { - checkMeasure(measureModel.getMeasure(), c); - checkIndexRules(model.getName(), measureModel.getIndexRules(), c); + if (runShapeChecks) { + checkMeasure(measureModel.getMeasure(), c, opt); + checkIndexRules(model.getName(), measureModel.getIndexRules(), c, opt); checkIndexRuleBinding( measureModel.getIndexRules(), metadata.getGroup(), metadata.name(), - BanyandbCommon.Catalog.CATALOG_MEASURE, c + BanyandbCommon.Catalog.CATALOG_MEASURE, c, opt ); - checkTopNAggregation(model, c); + checkTopNAggregation(model, c, opt); } } } else { PropertyModel propertyModel = MetadataRegistry.INSTANCE.registerPropertyModel(model, config); - if (!RunningMode.isNoInitMode()) { - checkProperty(propertyModel.getProperty(), c); + if (runShapeChecks) { + checkProperty(propertyModel.getProperty(), c, opt); } } installInfo.setAllExist(true); + fenceOnRevision(c, opt, "isExists:" + model.getName()); return installInfo; } } catch (BanyanDBException ex) { @@ -147,8 +208,51 @@ public InstallInfo isExists(Model model) throws StorageException { } } + /** Schema-watch budget per fence call. Standalone BanyanDB converges within + * microseconds; multi-node clusters within a few hundred ms. The + * runtime-rule REST handler's Armeria request timeout is 10 s and an + * apply may fire several fences (one per downsampling), so a per-fence + * budget of 2 s leaves comfortable headroom. A real stuck data node still + * surfaces — just as a bounded WARN per fence rather than an indefinite + * hang. */ + private static final Duration FENCE_TIMEOUT = Duration.ofSeconds(2); + + /** + * If any registry write performed during the call recorded a non-zero + * mod_revision, fence on it via {@code SchemaBarrierService.AwaitRevisionApplied} + * so subsequent data writes / queries against the new shape are guaranteed to + * land on a backend that has observed the schema. No-op when no revision was + * recorded (peer-side ticks, or unchanged shape). + * + *

A non-applied result (one or more data nodes still lagging at the timeout) + * is logged at WARN; the apply still completes. Operators tail the log to spot a + * stuck node — they don't need a synchronous failure here because the next data + * write that touches the lagging node would surface the issue. + */ + private void fenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, + final String context) throws BanyanDBException { + final long rev = opt.getMaxModRevision(); + if (rev <= 0L) { + return; + } + final SchemaWatcher.Result result = client.getSchemaWatcher().awaitRevisionApplied(rev, FENCE_TIMEOUT); + if (!result.isApplied()) { + log.warn("BanyanDB schema-watch fence did NOT confirm revision {} within {} ms for {}; " + + "proceeding anyway. Laggards: {}", rev, FENCE_TIMEOUT.toMillis(), context, result.getLaggards()); + } else { + log.debug("BanyanDB schema-watch fence confirmed revision {} for {}", rev, context); + } + } + @Override public void createTable(Model model) throws StorageException { + // Legacy entry point preserved for binary compatibility; orchestrator calls + // the opt-aware overload. + createTable(model, StorageManipulationOpt.fullInstall()); + } + + @Override + public void createTable(Model model, StorageManipulationOpt opt) throws StorageException { try { final BanyanDBClient client = ((BanyanDBStorageClient) this.client).client; DownSamplingConfigService configService = moduleManager.find(CoreModule.NAME) @@ -166,15 +270,15 @@ public void createTable(Model model) throws StorageException { if (trace != null) { log.info("install trace schema {}", model.getName()); try { - client.define(trace); + opt.recordModRevision(client.define(trace)); if (CollectionUtils.isNotEmpty(traceModel.getIndexRules())) { for (IndexRule indexRule : traceModel.getIndexRules()) { - defineIndexRule(model.getName(), indexRule, client); + opt.recordModRevision(defineIndexRule(model.getName(), indexRule, client)); } - defineIndexRuleBinding( + opt.recordModRevision(defineIndexRuleBinding( traceModel.getIndexRules(), trace.getMetadata().getGroup(), trace.getMetadata().getName(), BanyandbCommon.Catalog.CATALOG_TRACE, client - ); + )); } } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -191,16 +295,16 @@ public void createTable(Model model) throws StorageException { if (stream != null) { log.info("install stream schema {}", model.getName()); try { - client.define(stream); + opt.recordModRevision(client.define(stream)); if (CollectionUtils.isNotEmpty(streamModel.getIndexRules())) { for (IndexRule indexRule : streamModel.getIndexRules()) { - defineIndexRule(model.getName(), indexRule, client); + opt.recordModRevision(defineIndexRule(model.getName(), indexRule, client)); } - defineIndexRuleBinding( + opt.recordModRevision(defineIndexRuleBinding( streamModel.getIndexRules(), stream.getMetadata().getGroup(), stream.getMetadata().getName(), BanyandbCommon.Catalog.CATALOG_STREAM, client - ); + )); } } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -221,15 +325,15 @@ public void createTable(Model model) throws StorageException { if (measure != null) { log.info("install measure schema {}", model.getName()); try { - client.define(measure); + opt.recordModRevision(client.define(measure)); if (CollectionUtils.isNotEmpty(measureModel.getIndexRules())) { for (IndexRule indexRule : measureModel.getIndexRules()) { - defineIndexRule(model.getName(), indexRule, client); + opt.recordModRevision(defineIndexRule(model.getName(), indexRule, client)); } - defineIndexRuleBinding( + opt.recordModRevision(defineIndexRuleBinding( measureModel.getIndexRules(), measure.getMetadata().getGroup(), measure.getMetadata().getName(), BanyandbCommon.Catalog.CATALOG_MEASURE, client - ); + )); } } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -241,7 +345,7 @@ public void createTable(Model model) throws StorageException { } } final MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); - defineTopNAggregation(schema, client); + defineTopNAggregation(schema, client, opt); } } } else { @@ -249,7 +353,7 @@ public void createTable(Model model) throws StorageException { Property property = propertyModel.getProperty(); log.info("install property schema {}", model.getName()); try { - client.define(property); + opt.recordModRevision(client.define(property)); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { log.info("Property schema {} already created by another OAP node", model.getName()); @@ -258,11 +362,148 @@ public void createTable(Model model) throws StorageException { } } } + // Fence on the highest mod_revision recorded during this createTable + // pass before returning. Subsequent data writes / queries against the new + // shape are guaranteed to land on a backend that has observed the schema. + fenceOnRevision(client, opt, "createTable:" + model.getName()); } catch (BanyanDBException ex) { throw new StorageException("fail to create schema " + model.getName(), ex); } } + /** + * Drop the physical schema backing a runtime-removed model. Invoked by {@link ModelInstaller#whenRemoving(Model, StorageManipulationOpt)} + * during MAL/LAL hot-remove (never on the startup path). Because BanyanDB keeps one physical resource per + * logical model (per Measure / Stream / Trace / Property), dropping here is both safe and necessary — + * without it, a later re-create with a different shape would be silently rejected as ALREADY_EXISTS on the + * server while the old shape lingers. + * + *

Errors are logged but not re-thrown for NOT_FOUND (target already gone — idempotent from the caller's + * perspective). Any other {@link BanyanDBException} is wrapped as {@link StorageException} so the runtime-rule + * workflow can abort and retry on the next reconciler tick. + */ + @Override + public void dropTable(Model model) throws StorageException { + // Legacy entry point: delegate to opt-aware overload with a default opt so + // existing callers don't need to construct one. + dropTable(model, StorageManipulationOpt.fullInstall()); + } + + @Override + public void dropTable(Model model, StorageManipulationOpt opt) throws StorageException { + try { + final BanyanDBClient client = ((BanyanDBStorageClient) this.client).client; + final DownSamplingConfigService configService = moduleManager.find(CoreModule.NAME) + .provider() + .getService(DownSamplingConfigService.class); + final MetadataRegistry.SchemaMetadata metadata = MetadataRegistry.INSTANCE.parseMetadata( + model, config, configService); + final String group = metadata.getGroup(); + final String name = metadata.name(); + log.info("drop BanyanDB schema kind={} {}:{}", metadata.getKind(), group, name); + switch (metadata.getKind()) { + case MEASURE: + // Drop the TopN aggregations first (if any), then index rule bindings, index rules, then the measure. + try { + opt.recordModRevision(client.deleteTopNAggregationWithRevision(group, name)); + } catch (BanyanDBException ex) { + if (!Status.Code.NOT_FOUND.equals(ex.getStatus())) { + log.warn("drop TopN aggregation {}:{} failed: {}", group, name, ex.getMessage()); + } + } + dropIndexRuleBindingsBestEffort(client, group, name, opt); + opt.recordModRevision(client.deleteMeasureWithRevision(group, name)); + break; + case STREAM: + dropIndexRuleBindingsBestEffort(client, group, name, opt); + opt.recordModRevision(client.deleteStreamWithRevision(group, name)); + break; + case TRACE: + dropIndexRuleBindingsBestEffort(client, group, name, opt); + client.deleteTrace(group, name); + break; + case PROPERTY: + client.deletePropertyDefinition(group, name); + break; + default: + throw new StorageException( + "dropTable unsupported kind=" + metadata.getKind() + " for model " + model.getName()); + } + // Fence: prefer the revision-based wait when the server recorded a tombstone + // mod_revision; otherwise fall back to AwaitSchemaDeleted keyed on the + // primary resource so callers get a hard "removed everywhere" signal. + fenceOnRevisionOrDeletion(client, opt, metadata, "dropTable:" + model.getName()); + } catch (BanyanDBException ex) { + if (Status.Code.NOT_FOUND.equals(ex.getStatus())) { + log.info("BanyanDB schema {} already absent on drop (idempotent)", model.getName()); + return; + } + throw new StorageException("fail to drop schema " + model.getName(), ex); + } + } + + /** + * Prefer {@code AwaitRevisionApplied(maxRev)} when the registry returned a + * non-zero tombstone revision; otherwise fall back to + * {@code AwaitSchemaDeleted(key)} keyed on the primary resource. The fallback + * exists because {@code mod_revision == 0} on a delete response means the server + * did not record a tombstone — the revision-based fence cannot observe a + * deletion that didn't get one. + */ + private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final StorageManipulationOpt opt, + final MetadataRegistry.SchemaMetadata metadata, + final String context) throws BanyanDBException { + final long rev = opt.getMaxModRevision(); + if (rev > 0L) { + fenceOnRevision(client, opt, context); + return; + } + // mod_revision was 0 on every delete — fall back to key-based deletion fence. + final String kind; + switch (metadata.getKind()) { + case MEASURE: + kind = "measure"; + break; + case STREAM: + kind = "stream"; + break; + case TRACE: + kind = "trace"; + break; + case PROPERTY: + kind = "property"; + break; + default: + return; + } + final SchemaKey key = SchemaKey.newBuilder() + .setKind(kind) + .setGroup(metadata.getGroup()) + .setName(metadata.name()) + .build(); + final SchemaWatcher.Result result = client.getSchemaWatcher().awaitSchemaDeleted(key, FENCE_TIMEOUT); + if (!result.isApplied()) { + log.warn("BanyanDB schema-watch deletion fence did NOT confirm removal of {}:{} within {} ms ({}); " + + "proceeding anyway. Laggards: {}", metadata.getGroup(), metadata.name(), + FENCE_TIMEOUT.toMillis(), context, result.getLaggards()); + } else { + log.debug("BanyanDB schema-watch confirmed removal of {}:{} ({})", metadata.getGroup(), metadata.name(), context); + } + } + + private void dropIndexRuleBindingsBestEffort(BanyanDBClient client, String group, String name, + StorageManipulationOpt opt) { + // IndexRuleBindings are named after the resource; a best-effort delete covers both the common + // binding-name pattern and leaves other objects untouched on NOT_FOUND. + try { + opt.recordModRevision(client.deleteIndexRuleBindingWithRevision(group, name)); + } catch (BanyanDBException ex) { + if (!Status.Code.NOT_FOUND.equals(ex.getStatus())) { + log.warn("drop index rule binding {}:{} failed: {}", group, name, ex.getMessage()); + } + } + } + /** * Check if the group settings need to be updated */ @@ -294,7 +535,8 @@ private boolean checkGroup(MetadataRegistry.SchemaMetadata metadata, BanyanDBCli } private ResourceExist checkResourceExistence(MetadataRegistry.SchemaMetadata metadata, - BanyanDBClient client) throws BanyanDBException { + BanyanDBClient client, + StorageManipulationOpt opt) throws BanyanDBException { ResourceExist resourceExist; Group.Builder gBuilder = Group.newBuilder() @@ -405,7 +647,7 @@ private ResourceExist checkResourceExistence(MetadataRegistry.SchemaMetadata met } else { // update the group if necessary if (this.checkGroup(metadata, client)) { - client.update(gBuilder.build()); + opt.recordModRevision(client.update(gBuilder.build())); log.info("group {} updated", metadata.getGroup()); } } @@ -416,7 +658,8 @@ private ResourceExist checkResourceExistence(MetadataRegistry.SchemaMetadata met return resourceExist; } - private void defineTopNAggregation(MetadataRegistry.Schema schema, BanyanDBClient client) throws BanyanDBException { + private void defineTopNAggregation(MetadataRegistry.Schema schema, BanyanDBClient client, + StorageManipulationOpt opt) throws BanyanDBException { if (CollectionUtils.isEmpty(schema.getTopNSpecs())) { if (schema.getMetadata().getKind() == MetadataRegistry.Kind.MEASURE) { log.debug("skip null TopN Schema for [{}]", schema.getMetadata().name()); @@ -425,7 +668,7 @@ private void defineTopNAggregation(MetadataRegistry.Schema schema, BanyanDBClien } for (TopNAggregation topNSpec : schema.getTopNSpecs().values()) { try { - client.define(topNSpec); + opt.recordModRevision(client.define(topNSpec)); log.info("installed TopN schema for measure {}", schema.getMetadata().name()); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -466,27 +709,31 @@ private boolean checkIndexRuleProcessed(String modelName, IndexRule indexRule) { } /** - * Define the index rule if not exist and no conflict. + * Define the index rule if not exist and no conflict. Returns the etcd + * mod_revision of the write, or 0 when the rule is already processed locally + * or already exists on the server. */ - private void defineIndexRule(String modelName, + private long defineIndexRule(String modelName, IndexRule indexRule, BanyanDBClient client) throws BanyanDBException { if (checkIndexRuleProcessed(modelName, indexRule)) { - return; + return 0L; } try { - client.define(indexRule); + long rev = client.define(indexRule); log.info("new IndexRule created: {}", indexRule.getMetadata().getName()); + return rev; } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { log.info("IndexRule {} already created by another OAP node", indexRule.getMetadata().getName()); + return 0L; } else { throw ex; } } } - private void defineIndexRuleBinding(List indexRules, + private long defineIndexRuleBinding(List indexRules, String group, String name, BanyandbCommon.Catalog catalog, @@ -494,7 +741,7 @@ private void defineIndexRuleBinding(List indexRules, List indexRuleNames = indexRules.stream().map(indexRule -> indexRule.getMetadata().getName()).collect( Collectors.toList()); try { - client.define(IndexRuleBinding.newBuilder() + long rev = client.define(IndexRuleBinding.newBuilder() .setMetadata(BanyandbCommon.Metadata.newBuilder() .setGroup(group) .setName(name)) @@ -504,9 +751,11 @@ private void defineIndexRuleBinding(List indexRules, .addAllRules(indexRuleNames) .build()); log.info("new IndexRuleBinding created: {}", name); + return rev; } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { log.info("IndexRuleBinding {} already created by another OAP node", name); + return 0L; } else { throw ex; } @@ -514,87 +763,141 @@ private void defineIndexRuleBinding(List indexRules, } /** - * Check if the measure exists and update it if necessary + * Check if the measure exists and, when the live shape differs from the intended shape, + * either update it (on-demand operator workflow — {@link StorageManipulationOpt#isFullInstall()}) + * or skip the update and record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH} + * (static boot workflow — {@link StorageManipulationOpt#isCreateIfAbsent()}). Boot MUST + * NOT reshape the backend — reshape is an explicit operator action only. */ - private void checkMeasure(Measure measure, BanyanDBClient client) throws BanyanDBException { + private void checkMeasure(Measure measure, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { Measure hisMeasure = client.findMeasure(measure.getMetadata().getGroup(), measure.getMetadata().getName()); if (hisMeasure == null) { throw new IllegalStateException("Measure: " + measure.getMetadata().getName() + " exist but can't find it from BanyanDB server"); } else { boolean equals = hisMeasure.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() .equals(measure.toBuilder().clearMetadata().build()); if (!equals) { + if (!opt.getFlags().isUpdateOnMismatch()) { + log.error("BanyanDB measure {} shape mismatch at boot — backend holds a " + + "different shape than the declared rule. SKIPPING metric; operator " + + "must reshape via POST /runtime/rule/addOrUpdate or align the rule " + + "shape with the backend. backend={}, declared={}", + hisMeasure.getMetadata().getName(), hisMeasure, measure); + opt.recordOutcome("measure", hisMeasure.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "backend shape differs from declared shape; use /runtime/rule/addOrUpdate to reshape"); + return; + } // banyanDB server can not delete or update Tags. - client.update(measure); + opt.recordModRevision(client.update(measure)); log.info("update Measure: {} from: {} to: {}", hisMeasure.getMetadata().getName(), hisMeasure, measure); } } } /** - * Check if the stream exists and update it if necessary + * Check if the stream exists and update (or record shape mismatch) per mode. + * See {@link #checkMeasure} for the create-if-absent vs full-install contract. */ - private void checkStream(Stream stream, BanyanDBClient client) throws BanyanDBException { + private void checkStream(Stream stream, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { Stream hisStream = client.findStream(stream.getMetadata().getGroup(), stream.getMetadata().getName()); if (hisStream == null) { throw new IllegalStateException("Stream: " + stream.getMetadata().getName() + " exist but can't find it from BanyanDB server"); } else { boolean equals = hisStream.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() - .equals(stream.toBuilder().clearUpdatedAt().clearMetadata().build()); + .equals(stream.toBuilder().clearUpdatedAt().clearCreatedAt().clearMetadata().build()); if (!equals) { - client.update(stream); + if (!opt.getFlags().isUpdateOnMismatch()) { + log.error("BanyanDB stream {} shape mismatch at boot — backend holds a " + + "different shape than the declared rule. SKIPPING; operator must " + + "reshape via POST /runtime/rule/addOrUpdate. backend={}, declared={}", + hisStream.getMetadata().getName(), hisStream, stream); + opt.recordOutcome("stream", hisStream.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "backend shape differs from declared shape; use /runtime/rule/addOrUpdate to reshape"); + return; + } + opt.recordModRevision(client.update(stream)); log.info("update Stream: {} from: {} to: {}", hisStream.getMetadata().getName(), hisStream, stream); } } } - private void checkTrace(Trace trace, BanyanDBClient client) throws BanyanDBException { + private void checkTrace(Trace trace, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { Trace hisTrace = client.findTrace(trace.getMetadata().getGroup(), trace.getMetadata().getName()); if (hisTrace == null) { throw new IllegalStateException("Trace: " + trace.getMetadata().getName() + " exist but can't find it from BanyanDB server"); } else { boolean equals = hisTrace.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() - .equals(trace.toBuilder().clearUpdatedAt().clearMetadata().build()); + .equals(trace.toBuilder().clearUpdatedAt().clearCreatedAt().clearMetadata().build()); if (!equals) { - client.update(trace); + if (!opt.getFlags().isUpdateOnMismatch()) { + log.error("BanyanDB trace {} shape mismatch at boot — backend holds a " + + "different shape than the declared rule. SKIPPING; operator must " + + "reshape via POST /runtime/rule/addOrUpdate. backend={}, declared={}", + hisTrace.getMetadata().getName(), hisTrace, trace); + opt.recordOutcome("trace", hisTrace.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "backend shape differs from declared shape; use /runtime/rule/addOrUpdate to reshape"); + return; + } + opt.recordModRevision(client.update(trace)); log.info("update Trace: {} from: {} to: {}", hisTrace.getMetadata().getName(), hisTrace, trace); } } } /** - * Check if the property exists and update it if necessary + * Check if the property exists and update (or record shape mismatch) per mode. + * See {@link #checkMeasure} for the create-if-absent vs full-install contract. */ - private void checkProperty(Property property, BanyanDBClient client) throws BanyanDBException { + private void checkProperty(Property property, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { Property hisProperty = client.findPropertyDefinition(property.getMetadata().getGroup(), property.getMetadata().getName()); if (hisProperty == null) { throw new IllegalStateException("Property: " + property.getMetadata().getName() + " exist but can't find it from BanyanDB server"); } else { boolean equals = hisProperty.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() - .equals(property.toBuilder().clearUpdatedAt().clearMetadata().build()); + .equals(property.toBuilder().clearUpdatedAt().clearCreatedAt().clearMetadata().build()); if (!equals) { - client.update(property); + if (!opt.getFlags().isUpdateOnMismatch()) { + log.error("BanyanDB property {} shape mismatch at boot — backend holds a " + + "different shape than the declared rule. SKIPPING; operator must " + + "reshape via POST /runtime/rule/addOrUpdate. backend={}, declared={}", + hisProperty.getMetadata().getName(), hisProperty, property); + opt.recordOutcome("property", hisProperty.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "backend shape differs from declared shape; use /runtime/rule/addOrUpdate to reshape"); + return; + } + opt.recordModRevision(client.update(property)); log.info("update Property: {} from: {} to: {}", hisProperty.getMetadata().getName(), hisProperty, property); } } } /** - * Check if the index rules exist and update them if necessary + * Check if the index rules exist and update them if necessary. In + * {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode the writes are + * skipped and a {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH} is + * recorded instead — the orchestrator promotes that to a fatal boot error. */ - private void checkIndexRules(String modelName, List indexRules, BanyanDBClient client) throws BanyanDBException { + private void checkIndexRules(String modelName, List indexRules, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { for (IndexRule indexRule : indexRules) { if (checkIndexRuleProcessed(modelName, indexRule)) { return; @@ -602,8 +905,14 @@ private void checkIndexRules(String modelName, List indexRules, Banya IndexRule hisIndexRule = client.findIndexRule( indexRule.getMetadata().getGroup(), indexRule.getMetadata().getName()); if (hisIndexRule == null) { + if (!opt.getFlags().isCreateMissing()) { + opt.recordOutcome("indexRule", indexRule.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "IndexRule absent on backend; createMissing flag is off — refusing to define"); + continue; + } try { - client.define(indexRule); + opt.recordModRevision(client.define(indexRule)); log.info("new IndexRule created: {}", indexRule); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -615,11 +924,18 @@ private void checkIndexRules(String modelName, List indexRules, Banya } else { boolean equals = hisIndexRule.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() - .equals(indexRule.toBuilder().clearUpdatedAt().clearMetadata().build()); + .equals(indexRule.toBuilder().clearUpdatedAt().clearCreatedAt().clearMetadata().build()); if (!equals) { - client.update(indexRule); + if (opt.getFlags().isFailOnShapeMismatch()) { + opt.recordOutcome("indexRule", indexRule.getMetadata().getName(), + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "IndexRule shape mismatch on backend; failOnShapeMismatch flag is on — refusing to update"); + continue; + } + opt.recordModRevision(client.update(indexRule)); log.info( "update IndexRule: {} from: {} to: {}", hisIndexRule.getMetadata().getName(), hisIndexRule, indexRule @@ -630,13 +946,16 @@ private void checkIndexRules(String modelName, List indexRules, Banya } /** - * Check if the index rule binding exists and update it if necessary. + * Check if the index rule binding exists and update it if necessary. In + * {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode skip the write and + * record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH}. */ private void checkIndexRuleBinding(List indexRules, String group, String name, BanyandbCommon.Catalog catalog, - BanyanDBClient client) throws BanyanDBException { + BanyanDBClient client, + StorageManipulationOpt opt) throws BanyanDBException { if (indexRules.isEmpty()) { return; } @@ -655,8 +974,14 @@ private void checkIndexRuleBinding(List indexRules, .addAllRules(indexRuleNames).build(); IndexRuleBinding hisIndexRuleBinding = client.findIndexRuleBinding(group, name); if (hisIndexRuleBinding == null) { + if (!opt.getFlags().isCreateMissing()) { + opt.recordOutcome("indexRuleBinding", name, + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "IndexRuleBinding absent on backend; createMissing flag is off — refusing to define"); + return; + } try { - client.define(indexRuleBinding); + opt.recordModRevision(client.define(indexRuleBinding)); log.info("new IndexRuleBinding created: {}", indexRuleBinding); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -668,17 +993,24 @@ private void checkIndexRuleBinding(List indexRules, } else { boolean equals = hisIndexRuleBinding.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .clearBeginAt() .clearExpireAt() .build() - .equals(indexRuleBinding.toBuilder().clearMetadata().build()); + .equals(indexRuleBinding.toBuilder().clearCreatedAt().clearMetadata().build()); if (!equals) { + if (opt.getFlags().isFailOnShapeMismatch()) { + opt.recordOutcome("indexRuleBinding", name, + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "IndexRuleBinding shape mismatch on backend; failOnShapeMismatch flag is on — refusing to update"); + return; + } // update binding and use the same begin expire time - client.update(indexRuleBinding.toBuilder() + opt.recordModRevision(client.update(indexRuleBinding.toBuilder() .setBeginAt(hisIndexRuleBinding.getBeginAt()) .setExpireAt(hisIndexRuleBinding.getExpireAt()) - .build()); + .build())); log.info( "update IndexRuleBinding: {} from: {} to: {}", hisIndexRuleBinding.getMetadata().getName(), hisIndexRuleBinding, indexRuleBinding @@ -689,9 +1021,11 @@ private void checkIndexRuleBinding(List indexRules, /** * Check if the TopN aggregation exists and update it if necessary. - * If the TopN rules are not used, will be checked and deleted after install, in the `BanyanDBStorageProvider.notifyAfterCompleted()` + * If the TopN rules are not used, will be checked and deleted after install, in the `BanyanDBStorageProvider.notifyAfterCompleted()`. + * In {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode skip the write + * and record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH}. */ - private void checkTopNAggregation(Model model, BanyanDBClient client) throws BanyanDBException { + private void checkTopNAggregation(Model model, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); if (schema.getTopNSpecs() == null) { return; @@ -700,8 +1034,14 @@ private void checkTopNAggregation(Model model, BanyanDBClient client) throws Ban String topNName = topNAggregation.getMetadata().getName(); TopNAggregation hisTopNAggregation = client.findTopNAggregation(schema.getMetadata().getGroup(), topNName); if (hisTopNAggregation == null) { + if (!opt.getFlags().isCreateMissing()) { + opt.recordOutcome("topN", topNName, + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "TopNAggregation absent on backend; createMissing flag is off — refusing to define"); + continue; + } try { - client.define(topNAggregation); + opt.recordModRevision(client.define(topNAggregation)); log.info("new TopNAggregation created: {}", topNAggregation); } catch (BanyanDBException ex) { if (ex.getStatus().equals(Status.Code.ALREADY_EXISTS)) { @@ -713,11 +1053,18 @@ private void checkTopNAggregation(Model model, BanyanDBClient client) throws Ban } else { boolean equals = hisTopNAggregation.toBuilder() .clearUpdatedAt() + .clearCreatedAt() .clearMetadata() .build() - .equals(topNAggregation.toBuilder().clearMetadata().build()); + .equals(topNAggregation.toBuilder().clearCreatedAt().clearMetadata().build()); if (!equals) { - client.update(topNAggregation); + if (opt.getFlags().isFailOnShapeMismatch()) { + opt.recordOutcome("topN", topNName, + StorageManipulationOpt.Outcome.SKIPPED_SHAPE_MISMATCH, + "TopNAggregation shape mismatch on backend; failOnShapeMismatch flag is on — refusing to update"); + continue; + } + opt.recordModRevision(client.update(topNAggregation)); log.info( "update TopNAggregation: {} from: {} to: {}", hisTopNAggregation.getMetadata().getName(), hisTopNAggregation, topNAggregation @@ -727,6 +1074,35 @@ private void checkTopNAggregation(Model model, BanyanDBClient client) throws Ban } } + /** + * Register the {@link Model} in {@link MetadataRegistry} by its kind, without touching + * the BanyanDB server. Used on the peer-mode short-circuit above — populates the + * schema cache the local DAOs read from so this node can translate Model ↔ BanyanDB + * proto for sample ingest / queries. + */ + private void registerLocallyByKind(final Model model, + final DownSamplingConfigService downSamplingConfigService) { + if (model.isTimeSeries()) { + if (model.isRecord()) { + if (BanyanDB.TraceGroup.NONE != model.getBanyanDBModelExtension().getTraceGroup()) { + MetadataRegistry.INSTANCE.registerTraceModel(model, config); + } else { + MetadataRegistry.INSTANCE.registerStreamModel(model, config); + } + } else { + try { + MetadataRegistry.INSTANCE.registerMeasureModel(model, config, downSamplingConfigService); + } catch (final StorageException ignored) { + // Peer-side registration is idempotent / best-effort; if the registry rejects + // the model (already registered, or config skew) the peer's local DAOs will + // use whatever's already cached. Main owns convergence. + } + } + } else { + MetadataRegistry.INSTANCE.registerPropertyModel(model, config); + } + } + @Getter @Setter private static class InstallInfoBanyanDB extends InstallInfo { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBRuntimeRuleManagementDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBRuntimeRuleManagementDAO.java new file mode 100644 index 000000000000..e8a3c8456a96 --- /dev/null +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBRuntimeRuleManagementDAO.java @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.storage.plugin.banyandb; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.List; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.banyandb.common.v1.BanyandbCommon; +import org.apache.skywalking.banyandb.model.v1.BanyandbModel; +import org.apache.skywalking.banyandb.property.v1.BanyandbProperty.Property; +import org.apache.skywalking.library.banyandb.v1.client.TagAndValue; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.storage.plugin.banyandb.stream.AbstractBanyanDBDAO; + +/** + * BanyanDB read / write / delete for {@link RuntimeRule}. Stored as a BanyanDB + * {@code Property} — consistent with {@link BanyanDBUITemplateManagementDAO}. Writes go + * through {@link #save(RuntimeRule)} and use {@code PropertyStore.apply}, which is upsert + * by id. The generic {@code BanyanDBManagementDAO.insert} path is intentionally not used: + * its body just logs and returns without persisting. + */ +@Slf4j +public class BanyanDBRuntimeRuleManagementDAO extends AbstractBanyanDBDAO implements RuntimeRuleManagementDAO { + + public BanyanDBRuntimeRuleManagementDAO(final BanyanDBStorageClient client) { + super(client); + } + + @Override + public List getAll() throws IOException { + final List properties = getClient().listProperties(RuntimeRule.INDEX_NAME); + final List files = new ArrayList<>(properties.size()); + for (final Property p : properties) { + files.add(parse(p)); + } + return files; + } + + @Override + public void save(final RuntimeRule rule) throws IOException { + final MetadataRegistry.Schema schema = + MetadataRegistry.INSTANCE.findManagementMetadata(RuntimeRule.INDEX_NAME); + if (schema == null) { + throw new IOException( + "BanyanDB schema for " + RuntimeRule.INDEX_NAME + " not registered yet"); + } + // Property id matches RuntimeRule.id() (catalog + "_" + name) so apply() upserts on + // the same composite key the read path keys off. Default apply strategy is MERGE, + // but every save() call writes all five tags, so MERGE behaves like full replace. + final Property property = Property.newBuilder() + .setMetadata(BanyandbCommon.Metadata.newBuilder() + .setGroup(schema.getMetadata().getGroup()) + .setName(RuntimeRule.INDEX_NAME)) + .setId(rule.id().build()) + .addTags(TagAndValue.newStringTag(RuntimeRule.CATALOG, rule.getCatalog()).build()) + .addTags(TagAndValue.newStringTag(RuntimeRule.NAME, rule.getName()).build()) + .addTags(TagAndValue.newStringTag(RuntimeRule.CONTENT, rule.getContent()).build()) + .addTags(TagAndValue.newStringTag(RuntimeRule.STATUS, rule.getStatus()).build()) + .addTags(TagAndValue.newLongTag(RuntimeRule.UPDATE_TIME, rule.getUpdateTime()).build()) + .build(); + getClient().apply(property); + } + + @Override + public void delete(final String catalog, final String name) throws IOException { + // BanyanDB property id matches the StorageID composite built by RuntimeRule.id(). + // RuntimeRule.id() appends (CATALOG, NAME); StorageID.build() joins with the "_" separator. + final String id = catalog + "_" + name; + getClient().deleteProperty(RuntimeRule.INDEX_NAME, id); + } + + private RuntimeRuleFile parse(final Property property) { + String catalog = null; + String name = null; + String content = null; + String status = null; + long updateTime = 0L; + for (final BanyandbModel.Tag tag : property.getTagsList()) { + final TagAndValue tv = TagAndValue.fromProtobuf(tag); + final String tagName = tv.getTagName(); + final Object v = tv.getValue(); + if (RuntimeRule.CATALOG.equals(tagName)) { + catalog = asString(v); + } else if (RuntimeRule.NAME.equals(tagName)) { + name = asString(v); + } else if (RuntimeRule.CONTENT.equals(tagName)) { + content = asString(v); + } else if (RuntimeRule.STATUS.equals(tagName)) { + status = asString(v); + } else if (RuntimeRule.UPDATE_TIME.equals(tagName)) { + updateTime = v == null ? 0L : ((Number) v).longValue(); + } + } + return new RuntimeRuleFile(catalog, name, content, status, updateTime); + } + + private static String asString(final Object v) { + return v == null ? null : v.toString(); + } +} diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBStorageProvider.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBStorageProvider.java index 134c8f364a56..a85389882cbf 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBStorageProvider.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBStorageProvider.java @@ -34,8 +34,9 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.cache.INetworkAddressAliasDAO; import org.apache.skywalking.oap.server.core.storage.management.UIMenuManagementDAO; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.core.storage.management.UITemplateManagementDAO; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskLogQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskQueryDAO; @@ -156,6 +157,11 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { this.client = new BanyanDBStorageClient(getManager(), config); this.modelInstaller = new BanyanDBIndexInstaller(client, getManager(), this.config); + // Expose the installer so the runtime-rule reconciler can call isExists() after a + // hot-apply to verify that DDL landed as expected. Needed + // especially for BanyanDB, where client.define swallows ALREADY_EXISTS on shape- + // changing re-creates; the post-verify catches the silent divergence via describe+diff. + this.registerServiceImplementation(ModelInstaller.class, this.modelInstaller); // Stream this.registerServiceImplementation( @@ -182,6 +188,7 @@ IProfileThreadSnapshotQueryDAO.class, new BanyanDBProfileThreadSnapshotQueryDAO( this.config.getGlobal().getProfileTaskQueryMaxSize() )); this.registerServiceImplementation(UITemplateManagementDAO.class, new BanyanDBUITemplateManagementDAO(client)); + this.registerServiceImplementation(RuntimeRuleManagementDAO.class, new BanyanDBRuntimeRuleManagementDAO(client)); this.registerServiceImplementation(UIMenuManagementDAO.class, new BanyanDBUIMenuManagementDAO(client)); this.registerServiceImplementation(IEventQueryDAO.class, new BanyanDBEventQueryDAO(client)); this.registerServiceImplementation(ITopologyQueryDAO.class, new BanyanDBTopologyQueryDAO(client)); @@ -238,7 +245,7 @@ public void start() throws ServiceNotProvidedException, ModuleStartException { this.client.connect(); this.modelInstaller.start(); - getManager().find(CoreModule.NAME).provider().getService(ModelCreator.class).addModelListener(modelInstaller); + getManager().find(CoreModule.NAME).provider().getService(ModelRegistry.class).addModelListener(modelInstaller); } catch (Exception e) { throw new ModuleStartException(e.getMessage(), e); } diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/bulk/AbstractBulkWriteProcessor.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/bulk/AbstractBulkWriteProcessor.java index 3c85f7781302..91117699933f 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/bulk/AbstractBulkWriteProcessor.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/bulk/AbstractBulkWriteProcessor.java @@ -35,7 +35,7 @@ import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; @Slf4j -public abstract class AbstractBulkWriteProcessor> implements Runnable, Closeable { private final STUB stub; @@ -208,7 +208,7 @@ private Holder(AbstractWrite writeEntity, CompletableFuture future) { this.future = future; } - public static Holder create(AbstractWrite writeEntity, + public static Holder create(AbstractWrite writeEntity, CompletableFuture future) { future.whenComplete((v, t) -> { if (t != null) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/AbstractBanyanDBDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/AbstractBanyanDBDAO.java index 144cbe609bf5..2a47f734fb0e 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/AbstractBanyanDBDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/AbstractBanyanDBDAO.java @@ -396,7 +396,7 @@ protected void apply(MeasureQuery query) { }; } - protected abstract static class QueryBuilder> { + protected abstract static class QueryBuilder> { protected abstract void apply(final T query); protected PairQueryCondition eq(String name, long value) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIT.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIT.java deleted file mode 100644 index a8a0d3fc0e7a..000000000000 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIT.java +++ /dev/null @@ -1,356 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - * - */ - -package org.apache.skywalking.oap.server.storage.plugin.banyandb; - -import com.google.common.collect.ImmutableMap; -import com.google.common.collect.ImmutableSet; -import com.google.common.collect.Lists; -import java.time.Instant; -import java.time.temporal.ChronoUnit; -import java.util.Arrays; -import java.util.concurrent.CompletableFuture; -import java.util.concurrent.TimeUnit; -import lombok.extern.slf4j.Slf4j; -import org.apache.skywalking.banyandb.common.v1.BanyandbCommon; -import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase; -import org.apache.skywalking.banyandb.model.v1.BanyandbModel; -import org.apache.skywalking.library.banyandb.v1.client.MeasureQuery; -import org.apache.skywalking.library.banyandb.v1.client.MeasureQueryResponse; -import org.apache.skywalking.library.banyandb.v1.client.MeasureWrite; -import org.apache.skywalking.library.banyandb.v1.client.TagAndValue; -import org.apache.skywalking.library.banyandb.v1.client.TimestampRange; -import org.apache.skywalking.oap.server.core.CoreModule; -import org.apache.skywalking.oap.server.core.analysis.DownSampling; -import org.apache.skywalking.oap.server.core.analysis.Stream; -import org.apache.skywalking.oap.server.core.analysis.worker.MetricsStreamProcessor; -import org.apache.skywalking.oap.server.core.config.DownSamplingConfigService; -import org.apache.skywalking.oap.server.core.source.DefaultScopeDefine; -import org.apache.skywalking.oap.server.core.storage.StorageData; -import org.apache.skywalking.oap.server.core.storage.annotation.BanyanDB; -import org.apache.skywalking.oap.server.core.storage.annotation.Column; -import org.apache.skywalking.oap.server.core.storage.annotation.Storage; -import org.apache.skywalking.oap.server.core.storage.model.Model; -import org.apache.skywalking.oap.server.core.storage.model.StorageModels; -import org.apache.skywalking.oap.server.core.storage.type.Convert2Entity; -import org.apache.skywalking.oap.server.core.storage.type.Convert2Storage; -import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; -import org.apache.skywalking.oap.server.library.it.ITVersions; -import org.apache.skywalking.oap.server.library.module.ModuleDefine; -import org.apache.skywalking.oap.server.library.module.ModuleManager; -import org.apache.skywalking.oap.server.library.module.ModuleProviderHolder; -import org.apache.skywalking.oap.server.library.module.ModuleServiceHolder; -import org.apache.skywalking.oap.server.storage.plugin.banyandb.bulk.MeasureBulkWriteProcessor; -import org.apache.skywalking.oap.server.telemetry.TelemetryModule; -import org.apache.skywalking.oap.server.telemetry.api.MetricsCreator; -import org.apache.skywalking.oap.server.telemetry.none.MetricsCreatorNoop; -import org.apache.skywalking.oap.server.telemetry.none.NoneTelemetryProvider; -import org.junit.jupiter.api.Assertions; -import org.junit.jupiter.api.BeforeEach; -import org.junit.jupiter.api.Test; -import org.mockito.MockedStatic; -import org.mockito.Mockito; -import org.apache.skywalking.oap.server.testing.util.ReflectUtil; -import org.testcontainers.containers.GenericContainer; -import org.testcontainers.containers.wait.strategy.Wait; -import org.testcontainers.junit.jupiter.Container; -import org.testcontainers.junit.jupiter.Testcontainers; -import org.testcontainers.utility.DockerImageName; - -import static org.junit.jupiter.api.Assertions.assertEquals; -import static org.junit.jupiter.api.Assertions.assertFalse; -import static org.junit.jupiter.api.Assertions.assertNotNull; -import static org.junit.jupiter.api.Assertions.assertTrue; -import static org.mockito.Mockito.mock; -import static org.mockito.Mockito.mockStatic; -import static org.mockito.Mockito.when; -import static org.testcontainers.shaded.org.awaitility.Awaitility.await; - -@Slf4j -@Testcontainers -public class BanyanDBIT { - private static final String REGISTRY = "ghcr.io"; - private static final String IMAGE_NAME = "apache/skywalking-banyandb"; - private static final String TAG = ITVersions.get("SW_BANYANDB_COMMIT"); - - private static final String IMAGE = REGISTRY + "/" + IMAGE_NAME + ":" + TAG; - private static MockedStatic DEFAULT_SCOPE_DEFINE_MOCKED_STATIC; - protected static final int GRPC_PORT = 17912; - protected static final int HTTP_PORT = 17913; - - @Container - public GenericContainer banyanDB = new GenericContainer<>( - DockerImageName.parse(IMAGE)) - .withCommand("standalone", "--stream-root-path", "/tmp/banyandb-stream-data", - "--measure-root-path", "/tmp/banyand-measure-data" - ) - .withExposedPorts(GRPC_PORT, HTTP_PORT) - .waitingFor(Wait.forHttp("/api/healthz").forPort(HTTP_PORT)); - - private BanyanDBStorageClient client; - private BanyanDBStorageConfig config; - - protected void setUpConnection() throws Exception { - ModuleManager moduleManager = mock(ModuleManager.class); - ModuleDefine storageModule = mock(ModuleDefine.class); - BanyanDBStorageProvider provider = mock(BanyanDBStorageProvider.class); - Mockito.when(provider.getModule()).thenReturn(storageModule); - - NoneTelemetryProvider telemetryProvider = mock(NoneTelemetryProvider.class); - Mockito.when(telemetryProvider.getService(MetricsCreator.class)) - .thenReturn(new MetricsCreatorNoop()); - TelemetryModule telemetryModule = Mockito.spy(TelemetryModule.class); - ReflectUtil.setInternalState(telemetryModule, "loadedProvider", telemetryProvider); - Mockito.when(moduleManager.find(TelemetryModule.NAME)).thenReturn(telemetryModule); - log.info("create BanyanDB client and try to connect"); - config = new BanyanDBConfigLoader(provider).loadConfig(); - config.getGlobal().setTargets(banyanDB.getHost() + ":" + banyanDB.getMappedPort(GRPC_PORT)); - client = new BanyanDBStorageClient(moduleManager, config); - client.connect(); - } - - private MeasureBulkWriteProcessor processor; - - @BeforeEach - public void setUp() throws Exception { - DEFAULT_SCOPE_DEFINE_MOCKED_STATIC = mockStatic(DefaultScopeDefine.class); - DEFAULT_SCOPE_DEFINE_MOCKED_STATIC.when(() -> DefaultScopeDefine.nameOf(1)).thenReturn("any"); - setUpConnection(); - processor = client.createMeasureBulkProcessor(1000, 1, 1); - } - - @Test - public void testInstall() throws Exception { - DownSamplingConfigService downSamplingConfigService = new DownSamplingConfigService(Arrays.asList("minute")); - ModuleManager moduleManager = mock(ModuleManager.class); - ModuleProviderHolder moduleProviderHolder = mock(ModuleProviderHolder.class); - ModuleServiceHolder moduleServiceHolder = mock(ModuleServiceHolder.class); - when(moduleManager.find(CoreModule.NAME)).thenReturn(moduleProviderHolder); - when(moduleProviderHolder.provider()).thenReturn(moduleServiceHolder); - when(moduleServiceHolder.getService(DownSamplingConfigService.class)).thenReturn(downSamplingConfigService); - - StorageModels models = new StorageModels(); - Model model = models.add(TestMetric.class, DefaultScopeDefine.SERVICE, - new Storage("testMetric", true, DownSampling.Minute) - ); - BanyanDBIndexInstaller installer = new BanyanDBIndexInstaller(client, moduleManager, config); - installer.isExists(model); - //test Group install - String groupName = MetadataRegistry.convertGroupName( - config.getGlobal().getNamespace(), - BanyanDB.MeasureGroup.METRICS_MINUTE.getName() - ); - BanyandbCommon.Group group = client.client.findGroup(groupName); - assertEquals(BanyandbCommon.Catalog.CATALOG_MEASURE, group.getCatalog()); - assertEquals(config.getMetricsMin().getSegmentInterval(), group.getResourceOpts().getSegmentInterval().getNum()); - assertEquals(config.getMetricsMin().getShardNum(), group.getResourceOpts().getShardNum()); - assertEquals(BanyandbCommon.IntervalRule.Unit.UNIT_DAY, group.getResourceOpts().getSegmentInterval().getUnit()); - assertEquals(config.getMetricsMin().getTtl(), group.getResourceOpts().getTtl().getNum()); - assertEquals(BanyandbCommon.IntervalRule.Unit.UNIT_DAY, group.getResourceOpts().getTtl().getUnit()); - - installer.createTable(model); - //test Measure install - BanyandbDatabase.Measure measure = client.client.findMeasure(groupName, "testMetric_minute"); - assertEquals("storage-only", measure.getTagFamilies(0).getName()); - assertEquals("service_id", measure.getTagFamilies(0).getTags(0).getName()); - assertEquals(BanyandbDatabase.TagType.TAG_TYPE_STRING, measure.getTagFamilies(0).getTags(0).getType()); - assertEquals("searchable", measure.getTagFamilies(1).getName()); - assertEquals("tag", measure.getTagFamilies(1).getTags(0).getName()); - assertEquals(BanyandbDatabase.TagType.TAG_TYPE_STRING, measure.getTagFamilies(1).getTags(0).getType()); - assertEquals("service_id", measure.getEntity().getTagNames(0)); - assertEquals("value", measure.getFields(0).getName()); - assertEquals(BanyandbDatabase.FieldType.FIELD_TYPE_INT, measure.getFields(0).getFieldType()); - //test TopNAggregation install - BanyandbDatabase.TopNAggregation topNAggregation = client.client.findTopNAggregation( - groupName, "testMetric-service"); - assertEquals("value", topNAggregation.getFieldName()); - assertEquals("service_id", topNAggregation.getGroupByTagNames(0)); - assertEquals(BanyandbModel.Sort.SORT_DESC, topNAggregation.getFieldValueSort()); - assertEquals(10, topNAggregation.getLruSize()); - assertEquals(1000, topNAggregation.getCountersNumber()); - //test IndexRule install - BanyandbDatabase.IndexRule indexRuleTag = client.client.findIndexRule(groupName, "tag"); - assertEquals("url", indexRuleTag.getAnalyzer()); - assertTrue(indexRuleTag.getNoSort()); - //test IndexRuleBinding install - BanyandbDatabase.IndexRuleBinding indexRuleBinding = client.client.findIndexRuleBinding( - groupName, "testMetric_minute"); - assertEquals("tag", indexRuleBinding.getRules(0)); - assertEquals("testMetric_minute", indexRuleBinding.getSubject().getName()); - //test data query - Instant now = Instant.now(); - Instant begin = now.minus(15, ChronoUnit.MINUTES); - MeasureWrite measureWrite = client.createMeasureWrite(groupName, "testMetric_minute", now.toEpochMilli()); - measureWrite.tag("storage-only", "service_id", TagAndValue.stringTagValue("service1")) - .tag("searchable", "tag", TagAndValue.stringTagValue("tag1")) - .field("value", TagAndValue.longFieldValue(100)); - CompletableFuture f = processor.add(measureWrite); - f.exceptionally(exp -> { - Assertions.fail(exp.getMessage()); - return null; - }); - f.get(10, TimeUnit.SECONDS); - - MeasureQuery query = new MeasureQuery(Lists.newArrayList(groupName), "testMetric_minute", - new TimestampRange( - begin.toEpochMilli(), - now.plus(1, ChronoUnit.MINUTES).toEpochMilli() - ), ImmutableMap.of("service_id", "storage-only", "tag", "searchable"), - ImmutableSet.of("value") - ); - await().atMost(10, TimeUnit.SECONDS).untilAsserted(() -> { - MeasureQueryResponse resp = client.query(query); - assertNotNull(resp); - assertEquals(1, resp.getDataPoints().size()); - assertEquals("service1", resp.getDataPoints().get(0).getTagValue("service_id")); - assertEquals("tag1", resp.getDataPoints().get(0).getTagValue("tag")); - assertEquals(100, (Long) resp.getDataPoints().get(0).getFieldValue("value")); - }); - - Model updatedModel = models.add(UpdateTestMetric.class, DefaultScopeDefine.SERVICE, - new Storage("testMetric", true, DownSampling.Minute) - ); - config.getMetricsMin().setShardNum(config.getMetricsMin().getShardNum() + 1); - config.getMetricsMin().setSegmentInterval(config.getMetricsMin().getSegmentInterval() + 2); - config.getMetricsMin().setTtl(config.getMetricsMin().getTtl() + 3); - BanyanDBIndexInstaller newInstaller = new BanyanDBIndexInstaller(client, moduleManager, config); - newInstaller.isExists(updatedModel); - //test Group update - BanyandbCommon.Group updatedGroup = client.client.findGroup(groupName); - assertEquals(updatedGroup.getResourceOpts().getShardNum(), 3); - assertEquals(updatedGroup.getResourceOpts().getSegmentInterval().getNum(), 3); - assertEquals(updatedGroup.getResourceOpts().getTtl().getNum(), 10); - //test Measure update - BanyandbDatabase.Measure updatedMeasure = client.client.findMeasure(groupName, "testMetric_minute"); - assertEquals("storage-only", updatedMeasure.getTagFamilies(0).getName()); - assertEquals("service_id", updatedMeasure.getTagFamilies(0).getTags(0).getName()); - assertEquals(BanyandbDatabase.TagType.TAG_TYPE_STRING, updatedMeasure.getTagFamilies(0).getTags(0).getType()); - assertEquals("searchable", updatedMeasure.getTagFamilies(1).getName()); - assertEquals("tag", updatedMeasure.getTagFamilies(1).getTags(0).getName()); - assertEquals("new_tag", updatedMeasure.getTagFamilies(1).getTags(1).getName()); - assertEquals(BanyandbDatabase.TagType.TAG_TYPE_STRING, updatedMeasure.getTagFamilies(1).getTags(0).getType()); - assertEquals(BanyandbDatabase.TagType.TAG_TYPE_STRING, updatedMeasure.getTagFamilies(1).getTags(1).getType()); - assertEquals("service_id", updatedMeasure.getEntity().getTagNames(0)); - assertEquals("value", updatedMeasure.getFields(0).getName()); - assertEquals(BanyandbDatabase.FieldType.FIELD_TYPE_INT, updatedMeasure.getFields(0).getFieldType()); - assertEquals("new_value", updatedMeasure.getFields(1).getName()); - assertEquals(BanyandbDatabase.FieldType.FIELD_TYPE_INT, updatedMeasure.getFields(1).getFieldType()); - //test IndexRule update - BanyandbDatabase.IndexRule updatedIndexRuleTag = client.client.findIndexRule(groupName, "tag"); - assertEquals("", updatedIndexRuleTag.getAnalyzer()); - assertFalse(updatedIndexRuleTag.getNoSort()); - BanyandbDatabase.IndexRule updatedIndexRuleNewTag = client.client.findIndexRule(groupName, "new_tag"); - assertTrue(updatedIndexRuleNewTag.getNoSort()); - //test IndexRuleBinding update - BanyandbDatabase.IndexRuleBinding updatedIndexRuleBinding = client.client.findIndexRuleBinding( - groupName, "testMetric_minute"); - assertEquals("tag", updatedIndexRuleBinding.getRules(0)); - assertEquals("new_tag", updatedIndexRuleBinding.getRules(1)); - assertEquals("testMetric_minute", updatedIndexRuleBinding.getSubject().getName()); - //test data - MeasureWrite updatedMeasureWrite = client.createMeasureWrite(groupName, "testMetric_minute", now.plus(10, ChronoUnit.MINUTES).toEpochMilli()); - updatedMeasureWrite.tag("storage-only", "service_id", TagAndValue.stringTagValue("service2")) - .tag("searchable", "tag", TagAndValue.stringTagValue("tag1")) - .tag("searchable", "new_tag", TagAndValue.stringTagValue("new_tag1")) - .field("value", TagAndValue.longFieldValue(101)) - .field("new_value", TagAndValue.longFieldValue(1000)); - CompletableFuture cf = processor.add(updatedMeasureWrite); - cf.exceptionally(exp -> { - Assertions.fail(exp.getMessage()); - return null; - }); - cf.get(10, TimeUnit.SECONDS); - MeasureQuery updatedQuery = new MeasureQuery( - Lists.newArrayList(groupName), "testMetric_minute", - new TimestampRange(begin.toEpochMilli(), now.plus(15, ChronoUnit.MINUTES).toEpochMilli()), - ImmutableMap.of("service_id", "storage-only", "tag", "searchable", "new_tag", "searchable"), - ImmutableSet.of("value", "new_value") - ); - await().atMost(10, TimeUnit.SECONDS).untilAsserted(() -> { - MeasureQueryResponse updatedResp = client.query(updatedQuery); - assertNotNull(updatedResp); - assertEquals(2, updatedResp.getDataPoints().size()); - assertEquals("service1", updatedResp.getDataPoints().get(0).getTagValue("service_id")); - assertEquals("tag1", updatedResp.getDataPoints().get(0).getTagValue("tag")); - assertEquals(100, (Long) updatedResp.getDataPoints().get(0).getFieldValue("value")); - assertEquals("service2", updatedResp.getDataPoints().get(1).getTagValue("service_id")); - assertEquals("tag1", updatedResp.getDataPoints().get(1).getTagValue("tag")); - assertEquals("new_tag1", updatedResp.getDataPoints().get(1).getTagValue("new_tag")); - assertEquals(101, (Long) updatedResp.getDataPoints().get(1).getFieldValue("value")); - assertEquals(1000, (Long) updatedResp.getDataPoints().get(1).getFieldValue("new_value")); - }); - } - - @Stream(name = "testMetric", scopeId = DefaultScopeDefine.SERVICE, - builder = TestMetric.Builder.class, processor = MetricsStreamProcessor.class) - private static class TestMetric { - @Column(name = "service_id") - @BanyanDB.SeriesID(index = 0) - @BanyanDB.ShardingKey(index = 0) - private String serviceId; - @Column(name = "tag") - @BanyanDB.MatchQuery(analyzer = BanyanDB.MatchQuery.AnalyzerType.URL) - private String tag; - @Column(name = "value", dataType = Column.ValueDataType.COMMON_VALUE) - @BanyanDB.MeasureField - private long value; - - static class Builder implements StorageBuilder { - @Override - public StorageData storage2Entity(final Convert2Entity converter) { - return null; - } - - @Override - public void entity2Storage(final StorageData entity, final Convert2Storage converter) { - - } - } - } - - @Stream(name = "testMetric", scopeId = DefaultScopeDefine.SERVICE, - builder = UpdateTestMetric.Builder.class, processor = MetricsStreamProcessor.class) - private static class UpdateTestMetric { - @Column(name = "service_id") - @BanyanDB.SeriesID(index = 0) - private String serviceId; - @Column(name = "tag") - @BanyanDB.EnableSort - private String tag; - @Column(name = "new_tag") - private String newTag; - @Column(name = "value", dataType = Column.ValueDataType.COMMON_VALUE) - @BanyanDB.MeasureField - private long value; - @Column(name = "new_value", storageOnly = true) - @BanyanDB.MeasureField - private long newValue; - - static class Builder implements StorageBuilder { - @Override - public StorageData storage2Entity(final Convert2Entity converter) { - return null; - } - - @Override - public void entity2Storage(final StorageData entity, final Convert2Storage converter) { - - } - } - } -} diff --git a/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/StorageModuleElasticsearchProvider.java b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/StorageModuleElasticsearchProvider.java index aad41665f6a3..7bd80e7a733a 100644 --- a/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/StorageModuleElasticsearchProvider.java +++ b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/StorageModuleElasticsearchProvider.java @@ -34,8 +34,11 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.cache.INetworkAddressAliasDAO; import org.apache.skywalking.oap.server.core.storage.management.UIMenuManagementDAO; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.core.storage.management.UITemplateManagementDAO; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; +import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskLogQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IJFRDataQueryDAO; @@ -108,6 +111,7 @@ import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.TopologyQueryEsDAO; import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.TraceQueryEsDAO; import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.UIMenuManagementEsDAO; +import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.RuntimeRuleManagementEsDAO; import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.UITemplateManagementEsDAO; import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query.zipkin.ZipkinQueryEsDAO; import org.apache.skywalking.oap.server.telemetry.TelemetryModule; @@ -213,6 +217,11 @@ public void prepare() throws ServiceNotProvidedException { config.getNumHttpClientThread() ); modelInstaller = new StorageEsInstaller(elasticSearchClient, getManager(), config); + // Expose the installer so the runtime-rule reconciler can call isExists() after a + // hot-apply to verify that DDL landed. On ES, verify compares + // the logic-shard structure + mapping + index settings; a missing field / mapping + // diff surfaces as a clear WARN to operators. + this.registerServiceImplementation(ModelInstaller.class, modelInstaller); this.registerServiceImplementation( IBatchDAO.class, @@ -246,6 +255,8 @@ IProfileThreadSnapshotQueryDAO.class, new ProfileThreadSnapshotQueryEsDAO(elasti .getProfileTaskQueryMaxSize())); this.registerServiceImplementation( UITemplateManagementDAO.class, new UITemplateManagementEsDAO(elasticSearchClient, new UITemplate.Builder())); + this.registerServiceImplementation( + RuntimeRuleManagementDAO.class, new RuntimeRuleManagementEsDAO(elasticSearchClient, new RuntimeRule.Builder())); this.registerServiceImplementation( UIMenuManagementDAO.class, new UIMenuManagementEsDAO(elasticSearchClient, new UIMenu.Builder())); @@ -330,7 +341,7 @@ public void start() throws ModuleStartException { getManager().find(CoreModule.NAME) .provider() - .getService(ModelCreator.class) + .getService(ModelRegistry.class) .addModelListener(modelInstaller); } catch (Exception e) { throw new ModuleStartException(e.getMessage(), e); diff --git a/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/base/StorageEsInstaller.java b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/base/StorageEsInstaller.java index 7d0a577621ea..c0cc609d875a 100644 --- a/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/base/StorageEsInstaller.java +++ b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/base/StorageEsInstaller.java @@ -33,6 +33,7 @@ import org.apache.skywalking.oap.server.core.storage.model.Model; import org.apache.skywalking.oap.server.core.storage.model.ModelColumn; import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.client.Client; import org.apache.skywalking.oap.server.library.client.elasticsearch.ElasticSearchClient; import org.apache.skywalking.oap.server.library.module.ModuleManager; @@ -82,7 +83,7 @@ protected IndexStructures getStructures() { } @Override - public InstallInfo isExists(Model model) throws StorageException { + public InstallInfo isExists(Model model, StorageManipulationOpt opt) throws StorageException { InstallInfoES installInfo = new InstallInfoES(model, config); ElasticSearchClient esClient = (ElasticSearchClient) client; String tableName = IndexController.INSTANCE.getTableName(model); diff --git a/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/query/RuntimeRuleManagementEsDAO.java b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/query/RuntimeRuleManagementEsDAO.java new file mode 100644 index 000000000000..d476e7d178da --- /dev/null +++ b/oap-server/server-storage-plugin/storage-elasticsearch-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/elasticsearch/query/RuntimeRuleManagementEsDAO.java @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.storage.plugin.elasticsearch.query; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.library.elasticsearch.requests.search.BoolQueryBuilder; +import org.apache.skywalking.library.elasticsearch.requests.search.Query; +import org.apache.skywalking.library.elasticsearch.requests.search.Search; +import org.apache.skywalking.library.elasticsearch.requests.search.SearchBuilder; +import org.apache.skywalking.library.elasticsearch.response.search.SearchHit; +import org.apache.skywalking.library.elasticsearch.response.search.SearchResponse; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.type.StorageBuilder; +import org.apache.skywalking.oap.server.library.client.elasticsearch.ElasticSearchClient; +import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.ManagementCRUDEsDAO; +import org.apache.skywalking.oap.server.storage.plugin.elasticsearch.base.IndexController; + +@Slf4j +public class RuntimeRuleManagementEsDAO extends ManagementCRUDEsDAO implements RuntimeRuleManagementDAO { + @SuppressWarnings({"rawtypes", "unchecked"}) + public RuntimeRuleManagementEsDAO(final ElasticSearchClient client, + final StorageBuilder storageBuilder) { + super(client, storageBuilder); + } + + @Override + public List getAll() throws IOException { + final BoolQueryBuilder boolQuery = Query.bool(); + boolQuery.must(Query.term( + IndexController.LogicIndicesRegister.MANAGEMENT_TABLE_NAME, RuntimeRule.INDEX_NAME)); + // No upper bound on rule count — 10000 is the safety ceiling and matches the UITemplate + // DAO convention. Operators that approach this limit should split rules across files. + final SearchBuilder search = Search.builder().query(boolQuery).size(10000); + final String index = + IndexController.LogicIndicesRegister.getPhysicalTableName(RuntimeRule.INDEX_NAME); + final SearchResponse response = getClient().search(index, search.build()); + + final List files = new ArrayList<>(); + for (final SearchHit hit : response.getHits()) { + final Map src = hit.getSource(); + files.add(new RuntimeRuleFile( + asString(src.get(RuntimeRule.CATALOG)), + asString(src.get(RuntimeRule.NAME)), + asString(src.get(RuntimeRule.CONTENT)), + asString(src.get(RuntimeRule.STATUS)), + asLong(src.get(RuntimeRule.UPDATE_TIME)) + )); + } + return files; + } + + @Override + public void save(final RuntimeRule rule) throws IOException { + final String index = + IndexController.LogicIndicesRegister.getPhysicalTableName(RuntimeRule.INDEX_NAME); + final String docId = RuntimeRule.INDEX_NAME + "_" + rule.getCatalog() + "_" + rule.getName(); + final Map source = new HashMap<>(); + source.put(RuntimeRule.CATALOG, rule.getCatalog()); + source.put(RuntimeRule.NAME, rule.getName()); + source.put(RuntimeRule.CONTENT, rule.getContent()); + source.put(RuntimeRule.STATUS, rule.getStatus()); + source.put(RuntimeRule.UPDATE_TIME, rule.getUpdateTime()); + source.put(IndexController.LogicIndicesRegister.MANAGEMENT_TABLE_NAME, RuntimeRule.INDEX_NAME); + // forceInsert maps to the ES `index` API which is upsert-by-_id; existing docs with + // the same id are replaced. The base class `create` / `update` helpers explicitly + // gate on existDoc, which is exactly the bug we're avoiding here. + getClient().forceInsert(index, docId, source); + } + + @Override + public void delete(final String catalog, final String name) throws IOException { + final String index = + IndexController.LogicIndicesRegister.getPhysicalTableName(RuntimeRule.INDEX_NAME); + // Elasticsearch document id format for management-data records follows the StorageID + // composite — see RuntimeRule.id() which appends (catalog, name). IndexController + // passes the id through to the _id field on insert, so delete by that same id. + final String docId = RuntimeRule.INDEX_NAME + "_" + catalog + "_" + name; + getClient().deleteById(index, docId); + } + + private static String asString(final Object v) { + return v == null ? null : v.toString(); + } + + private static long asLong(final Object v) { + if (v == null) { + return 0L; + } + if (v instanceof Number) { + return ((Number) v).longValue(); + } + return Long.parseLong(v.toString()); + } +} diff --git a/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCStorageProvider.java b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCStorageProvider.java index 7891c1c7b122..6295737d4909 100644 --- a/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCStorageProvider.java +++ b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCStorageProvider.java @@ -27,8 +27,9 @@ import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.cache.INetworkAddressAliasDAO; import org.apache.skywalking.oap.server.core.storage.management.UIMenuManagementDAO; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.core.storage.management.UITemplateManagementDAO; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskLogQueryDAO; import org.apache.skywalking.oap.server.core.storage.profiling.asyncprofiler.IAsyncProfilerTaskQueryDAO; @@ -97,6 +98,7 @@ import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCTopologyQueryDAO; import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCTraceQueryDAO; import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCUIMenuManagementDAO; +import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCRuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCUITemplateManagementDAO; import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao.JDBCZipkinQueryDAO; import org.apache.skywalking.oap.server.telemetry.TelemetryModule; @@ -144,6 +146,12 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { modelInstaller = (JDBCTableInstaller) createModelInstaller(); tableHelper = new TableHelper(getManager(), jdbcClient); + // Expose the installer so the runtime-rule reconciler can call isExists() post-apply + // On JDBC the verify is cheap: merged-table columns are + // compared against the Model's expected set; a missing column is a clear WARN signal + // rather than a silent schema mismatch. + this.registerServiceImplementation(ModelInstaller.class, modelInstaller); + this.registerServiceImplementation( StorageBuilderFactory.class, new StorageBuilderFactory.Default()); @@ -205,6 +213,9 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { this.registerServiceImplementation( UITemplateManagementDAO.class, new JDBCUITemplateManagementDAO(jdbcClient, tableHelper)); + this.registerServiceImplementation( + RuntimeRuleManagementDAO.class, + new JDBCRuntimeRuleManagementDAO(jdbcClient, tableHelper)); this.registerServiceImplementation( UIMenuManagementDAO.class, new JDBCUIMenuManagementDAO(jdbcClient, tableHelper)); @@ -299,7 +310,7 @@ public void start() throws ServiceNotProvidedException, ModuleStartException { getManager() .find(CoreModule.NAME) .provider() - .getService(ModelCreator.class) + .getService(ModelRegistry.class) .addModelListener(modelInstaller); } catch (StorageException e) { throw new ModuleStartException(e.getMessage(), e); diff --git a/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCTableInstaller.java b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCTableInstaller.java index 57079de7ce09..6d8a46b779ac 100644 --- a/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCTableInstaller.java +++ b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/JDBCTableInstaller.java @@ -32,6 +32,7 @@ import org.apache.skywalking.oap.server.core.storage.model.Model; import org.apache.skywalking.oap.server.core.storage.model.ModelColumn; import org.apache.skywalking.oap.server.core.storage.model.ModelInstaller; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.core.storage.type.StorageDataComplexObject; import org.apache.skywalking.oap.server.library.client.Client; import org.apache.skywalking.oap.server.library.client.jdbc.hikaricp.JDBCClient; @@ -67,7 +68,7 @@ public JDBCTableInstaller(Client client, ModuleManager moduleManager) { @Override @SneakyThrows - public InstallInfo isExists(Model model) { + public InstallInfo isExists(Model model, StorageManipulationOpt opt) { InstallInfoJDBC installInfo = new InstallInfoJDBC(model); TableMetaInfo.addModel(model); diff --git a/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/dao/JDBCRuntimeRuleManagementDAO.java b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/dao/JDBCRuntimeRuleManagementDAO.java new file mode 100644 index 000000000000..82d5bbf62207 --- /dev/null +++ b/oap-server/server-storage-plugin/storage-jdbc-hikaricp-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/jdbc/common/dao/JDBCRuntimeRuleManagementDAO.java @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.storage.plugin.jdbc.common.dao; + +import java.io.IOException; +import java.sql.Connection; +import java.sql.SQLException; +import java.util.ArrayList; +import java.util.List; +import lombok.RequiredArgsConstructor; +import lombok.SneakyThrows; +import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.Model; +import org.apache.skywalking.oap.server.core.storage.type.HashMapConverter; +import org.apache.skywalking.oap.server.library.client.jdbc.hikaricp.JDBCClient; +import org.apache.skywalking.oap.server.storage.plugin.jdbc.SQLExecutor; +import org.apache.skywalking.oap.server.storage.plugin.jdbc.TableMetaInfo; +import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.JDBCTableInstaller; +import org.apache.skywalking.oap.server.storage.plugin.jdbc.common.TableHelper; + +/** + * JDBC read + delete for the {@link RuntimeRule} management table. Reuses the same + * {@code JDBCTableInstaller.TABLE_COLUMN} multi-entity pattern that every other management + * DAO in this plugin uses, so a single physical table can host records for multiple + * management models without schema churn. + */ +@Slf4j +@RequiredArgsConstructor +public class JDBCRuntimeRuleManagementDAO extends JDBCSQLExecutor implements RuntimeRuleManagementDAO { + private final JDBCClient jdbcClient; + private final TableHelper tableHelper; + + @Override + @SneakyThrows + public List getAll() { + final List tables = tableHelper.getTablesWithinTTL(RuntimeRule.INDEX_NAME); + final List files = new ArrayList<>(); + + for (final String table : tables) { + final StringBuilder sql = new StringBuilder(); + sql.append("select ") + .append(RuntimeRule.CATALOG).append(", ") + .append(RuntimeRule.NAME).append(", ") + .append(RuntimeRule.CONTENT).append(", ") + .append(RuntimeRule.STATUS).append(", ") + .append(RuntimeRule.UPDATE_TIME) + .append(" from ").append(table) + .append(" where ").append(JDBCTableInstaller.TABLE_COLUMN).append(" = ? "); + jdbcClient.executeQuery(sql.toString(), resultSet -> { + while (resultSet.next()) { + files.add(new RuntimeRuleFile( + resultSet.getString(RuntimeRule.CATALOG), + resultSet.getString(RuntimeRule.NAME), + resultSet.getString(RuntimeRule.CONTENT), + resultSet.getString(RuntimeRule.STATUS), + resultSet.getLong(RuntimeRule.UPDATE_TIME) + )); + } + return null; + }, RuntimeRule.INDEX_NAME); + } + return files; + } + + @Override + public void save(final RuntimeRule rule) throws IOException { + final Model model = TableMetaInfo.get(RuntimeRule.INDEX_NAME); + final RuntimeRule.Builder builder = new RuntimeRule.Builder(); + // The shared {@link JDBCSQLExecutor#getByID} is unusable for ManagementData like + // RuntimeRule: it always passes the lookup id through {@code TableHelper.generateId( + // String, String)} which prefixes with the model name, but the INSERT path uses + // {@code TableHelper.generateId(Model, String)} which returns the RAW id for + // non-record / non-function-metric types. RuntimeRule is non-record + non-metric, so + // its row is stored with the raw composite id while getByID looks up "runtimerule_" + // and never finds it. Without this workaround the second save() always falls through + // to INSERT and trips the primary-key constraint, breaking every /addOrUpdate update + // and every /inactivate after the first persist. + final String storedId = TableHelper.generateId(model, rule.id().build()); + try (Connection connection = jdbcClient.getConnection()) { + final boolean exists = rowExists(connection, storedId); + final SQLExecutor executor; + if (exists) { + executor = getUpdateExecutor(model, rule, 0, builder, null); + } else { + executor = getInsertExecutor( + model, rule, 0, builder, new HashMapConverter.ToStorage(), null); + } + executor.invoke(connection); + } catch (final SQLException e) { + throw new IOException("failed to save runtime rule " + + rule.getCatalog() + ":" + rule.getName(), e); + } + } + + /** + * Probe every TTL-shadow table for a row with the given stored id. Direct SELECT on the + * id column rather than the shared getByID helper for the prefix-mismatch reason + * documented in {@link #save(RuntimeRule)}. + */ + private boolean rowExists(final Connection connection, final String storedId) throws SQLException { + final List tables = tableHelper.getTablesWithinTTL(RuntimeRule.INDEX_NAME); + for (final String table : tables) { + final String sql = "SELECT 1 FROM " + table + + " WHERE " + JDBCTableInstaller.ID_COLUMN + " = ? LIMIT 1"; + try (final var stmt = connection.prepareStatement(sql)) { + stmt.setString(1, storedId); + try (final var rs = stmt.executeQuery()) { + if (rs.next()) { + return true; + } + } + } + } + return false; + } + + @Override + public void delete(final String catalog, final String name) throws IOException { + final List tables = tableHelper.getTablesWithinTTL(RuntimeRule.INDEX_NAME); + for (final String table : tables) { + final String sql = "delete from " + table + + " where " + JDBCTableInstaller.TABLE_COLUMN + " = ?" + + " and " + RuntimeRule.CATALOG + " = ?" + + " and " + RuntimeRule.NAME + " = ?"; + try { + jdbcClient.executeUpdate(sql, RuntimeRule.INDEX_NAME, catalog, name); + } catch (final SQLException e) { + throw new IOException("failed to delete runtime rule " + catalog + ":" + name, e); + } + } + } +} diff --git a/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/MockCoreModuleProvider.java b/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/MockCoreModuleProvider.java index 14c67e35624c..3e08e56ba7b2 100755 --- a/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/MockCoreModuleProvider.java +++ b/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/MockCoreModuleProvider.java @@ -74,7 +74,7 @@ import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.IModelManager; import org.apache.skywalking.oap.server.core.trace.SpanListenerManager; -import org.apache.skywalking.oap.server.core.storage.model.ModelCreator; +import org.apache.skywalking.oap.server.core.storage.model.ModelRegistry; import org.apache.skywalking.oap.server.core.storage.model.ModelManipulator; import org.apache.skywalking.oap.server.core.storage.model.StorageModels; import org.apache.skywalking.oap.server.core.worker.IWorkerInstanceGetter; @@ -162,7 +162,7 @@ public void prepare() throws ServiceNotProvidedException, ModuleStartException { this.registerServiceImplementation(IWorkerInstanceSetter.class, instancesService); this.registerServiceImplementation(RemoteSenderService.class, new RemoteSenderService(getManager())); - this.registerServiceImplementation(ModelCreator.class, storageModels); + this.registerServiceImplementation(ModelRegistry.class, storageModels); this.registerServiceImplementation(IModelManager.class, storageModels); this.registerServiceImplementation(ModelManipulator.class, storageModels); diff --git a/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/mock/MockWorkerInstancesService.java b/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/mock/MockWorkerInstancesService.java index 54a734471ede..922881a16921 100644 --- a/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/mock/MockWorkerInstancesService.java +++ b/oap-server/server-tools/profile-exporter/tool-profile-snapshot-server-mock/src/main/java/org/apache/skywalking/oap/server/tool/profile/core/mock/MockWorkerInstancesService.java @@ -39,4 +39,8 @@ public RemoteHandleWorker get(String nextWorkerName) { public void put(String remoteReceiverWorkName, AbstractWorker instance, MetricStreamKind kind, Class streamDataClass) { } + + @Override + public void remove(String remoteReceiverWorkName) { + } } diff --git a/pom.xml b/pom.xml index 1f8e8764e8aa..42e88963b02d 100755 --- a/pom.xml +++ b/pom.xml @@ -163,15 +163,14 @@ 1.18.7 - 1.70.0 - 4.2.10.Final - 2.0.75.Final + 1.80.0 + 4.33.1 + 4.2.12.Final + 2.0.77.Final 2.9.0 1.6.2 0.6.1 - 3.19.2 - 1.42.1 - 1.2.1 + 1.3.0 1.3.2 3.1 @@ -344,6 +343,14 @@ ${antlr.version} + + com.google.protobuf + protobuf-bom + ${protobuf-java.version} + pom + import + + diff --git a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh new file mode 100755 index 000000000000..bf2e7cf0e9ed --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh @@ -0,0 +1,178 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Drives a runtime-rule apply on OAP-1 and asserts OAP-2 converges on the same +# (catalog, name, contentHash) within the reconciler tick window. Run from the +# repo root. +# +# Coverage: +# 1. Apply seed-rule on OAP-1 → ACTIVE +# 2. Wait for OAP-2 to see the rule via /list (one tick = ~30 s default) +# 3. STRUCTURAL update on OAP-1 → re-converge on OAP-2 (different content hash) +# 4. Inactivate on OAP-1 → INACTIVE on OAP-2 +# 5. Delete on OAP-1 → row gone on OAP-2 +# +# Failures route to stderr so the e2e harness's stdout capture stays clean. + +set -euo pipefail + +log() { echo "[cluster-flow] $*" >&2; } +fail() { log "FAIL: $*"; exit 1; } + +OAP1_PORT="${OAP1_PORT:-17128}" +OAP2_PORT="${OAP2_PORT:-17129}" +OAP1_BASE="http://127.0.0.1:${OAP1_PORT}" +OAP2_BASE="http://127.0.0.1:${OAP2_PORT}" +SEED_DIR="${SEED_DIR:-$(pwd)/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules}" +SEED_NEW="${SEED_DIR}/seed-rule.yaml" +SEED_STRUCT="${SEED_DIR}/seed-rule-structural.yaml" +CATALOG="otel-rules" +NAME="cluster_rr" + +# Two ticks worth — default reconciler interval is 30 s; allow a generous 90 s for +# convergence on a busy CI host. +CONVERGE_TIMEOUT_S="${CONVERGE_TIMEOUT_S:-90}" + +[ -f "${SEED_NEW}" ] || fail "seed-rule.yaml missing at ${SEED_NEW}" + +list_row() { + local base="$1" + curl -fsS "${base}/runtime/rule/list" 2>/dev/null \ + | jq -c '.rules[] + | select(.catalog == "'"${CATALOG}"'" and .name == "'"${NAME}"'") + | select(.status != "n/a")' \ + | head -1 +} + +list_status() { + local base="$1" + list_row "${base}" | jq -r '.status // empty' +} + +list_hash() { + local base="$1" + list_row "${base}" | jq -r '.contentHash // empty' +} + +await_status() { + local base="$1" expected="$2" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) + while true; do + local got + got="$(list_status "${base}")" + if [ "${got}" = "${expected}" ]; then + return 0 + fi + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${base} did not reach status='${expected}' within ${CONVERGE_TIMEOUT_S}s (last='${got}')" + fi + sleep 2 + done +} + +await_hash() { + local base="$1" expected_hash="$2" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) + while true; do + local got + got="$(list_hash "${base}")" + if [ "${got}" = "${expected_hash}" ]; then + return 0 + fi + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${base} did not converge to contentHash='${expected_hash:0:8}…' within ${CONVERGE_TIMEOUT_S}s (last='${got:0:8}…')" + fi + sleep 2 + done +} + +await_absent() { + local base="$1" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) + while true; do + if [ -z "$(list_row "${base}")" ]; then + return 0 + fi + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${base} did not drop row within ${CONVERGE_TIMEOUT_S}s" + fi + sleep 2 + done +} + +apply_on() { + local base="$1" body="$2" extra="${3:-}" + local query="catalog=${CATALOG}&name=${NAME}" + if [ -n "${extra}" ]; then + query="${query}&${extra}" + fi + local resp; resp="$(curl -fsS -XPOST -H 'Content-Type: text/plain' \ + --data-binary "@${body}" "${base}/runtime/rule/addOrUpdate?${query}")" \ + || fail "addOrUpdate against ${base} failed" + echo "${resp}" +} + +# --- Wait for both OAPs to come up ------------------------------------------------- +log "waiting for OAP-1 (${OAP1_BASE})" +deadline=$(( $(date +%s) + 120 )) +until curl -fsS "${OAP1_BASE}/runtime/rule/list" >/dev/null 2>&1; do + if [ "$(date +%s)" -ge "${deadline}" ]; then fail "OAP-1 not ready after 120s"; fi + sleep 2 +done +log "waiting for OAP-2 (${OAP2_BASE})" +deadline=$(( $(date +%s) + 120 )) +until curl -fsS "${OAP2_BASE}/runtime/rule/list" >/dev/null 2>&1; do + if [ "$(date +%s)" -ge "${deadline}" ]; then fail "OAP-2 not ready after 120s"; fi + sleep 2 +done +log "both OAPs ready" + +# --- Phase 1: apply on OAP-1, observe convergence on OAP-2 ------------------------- +log "=== Phase 1: apply (NEW) on OAP-1 ===" +apply_on "${OAP1_BASE}" "${SEED_NEW}" >/dev/null +await_status "${OAP1_BASE}" "ACTIVE" +hash_initial="$(list_hash "${OAP1_BASE}")" +log "OAP-1 → ACTIVE @ ${hash_initial:0:8}…" +await_status "${OAP2_BASE}" "ACTIVE" +await_hash "${OAP2_BASE}" "${hash_initial}" +log "OAP-2 converged to ${hash_initial:0:8}…" + +# --- Phase 2: STRUCTURAL update on OAP-1, observe new hash on OAP-2 ---------------- +log "=== Phase 2: STRUCTURAL on OAP-1 ===" +apply_on "${OAP1_BASE}" "${SEED_STRUCT}" "allowStorageChange=true" >/dev/null +hash_struct="$(list_hash "${OAP1_BASE}")" +[ "${hash_struct}" != "${hash_initial}" ] || fail "OAP-1 contentHash unchanged after STRUCTURAL apply" +log "OAP-1 → ACTIVE @ ${hash_struct:0:8}… (was ${hash_initial:0:8}…)" +await_hash "${OAP2_BASE}" "${hash_struct}" +log "OAP-2 converged to ${hash_struct:0:8}…" + +# --- Phase 3: inactivate on OAP-1, observe INACTIVE on OAP-2 ----------------------- +log "=== Phase 3: /inactivate on OAP-1 ===" +curl -fsS -XPOST "${OAP1_BASE}/runtime/rule/inactivate?catalog=${CATALOG}&name=${NAME}" >/dev/null \ + || fail "inactivate against OAP-1 failed" +await_status "${OAP1_BASE}" "INACTIVE" +log "OAP-1 → INACTIVE" +await_status "${OAP2_BASE}" "INACTIVE" +log "OAP-2 converged to INACTIVE" + +# --- Phase 4: delete on OAP-1, observe row gone on OAP-2 --------------------------- +log "=== Phase 4: /delete on OAP-1 ===" +curl -fsS -XPOST "${OAP1_BASE}/runtime/rule/delete?catalog=${CATALOG}&name=${NAME}" >/dev/null \ + || fail "delete against OAP-1 failed" +await_absent "${OAP1_BASE}" +log "OAP-1 → row gone" +await_absent "${OAP2_BASE}" +log "OAP-2 converged: row gone" + +log "=== ALL CLUSTER PHASES PASSED ===" diff --git a/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml b/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml new file mode 100644 index 000000000000..0b27757c79cf --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/cluster/docker-compose.yml @@ -0,0 +1,91 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Cluster convergence — 2 OAPs behind a ZooKeeper coordinator + BanyanDB. +# Verifies that a runtime-rule apply on one node propagates to the other within a +# reconciler tick (default 30 s) and that the Suspend / Resume RPC bracket dispatch +# correctly across the cluster. +services: + zookeeper: + image: zookeeper:3.8 + networks: + - e2e + environment: + ZOO_4LW_COMMANDS_WHITELIST: "ruok,stat,srvr" + healthcheck: + # Use the zookeeper-shell.sh ls wrapper (image's own /bin) — the official + # zookeeper:3.8 image does not ship `nc`, so the more obvious `echo ruok | nc ...` + # idiom fails. zkServer.sh status returns 0 once the server is in standalone / + # leader mode. + test: ["CMD-SHELL", "zkServer.sh status 2>/dev/null | grep -E 'Mode: (standalone|leader|follower)'"] + interval: 5s + timeout: 10s + retries: 30 + + banyandb: + extends: + file: ../../../script/docker-compose/base-compose.yml + service: banyandb + + oap1: + extends: + file: ../../../script/docker-compose/base-compose.yml + service: oap + hostname: oap1 + environment: + SW_RECEIVER_RUNTIME_RULE: default + SW_STORAGE: banyandb + SW_CLUSTER: zookeeper + SW_CLUSTER_ZK_HOST_PORT: zookeeper:2181 + # First-up node also doubles as the static-rule installer; nothing to coordinate + # with peers on storage init. + ports: + - "11800:11800" + - "12800:12800" + - "17128:17128" + depends_on: + zookeeper: + condition: service_healthy + banyandb: + condition: service_healthy + networks: + - e2e + + oap2: + extends: + file: ../../../script/docker-compose/base-compose.yml + service: oap + hostname: oap2 + environment: + SW_RECEIVER_RUNTIME_RULE: default + SW_STORAGE: banyandb + SW_CLUSTER: zookeeper + SW_CLUSTER_ZK_HOST_PORT: zookeeper:2181 + ports: + - "11801:11800" + - "12801:12800" + - "17129:17128" + depends_on: + zookeeper: + condition: service_healthy + banyandb: + condition: service_healthy + oap1: + condition: service_healthy + networks: + - e2e + +networks: + e2e: diff --git a/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml b/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml new file mode 100644 index 000000000000..477a8a52b374 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/cluster/e2e.yaml @@ -0,0 +1,59 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# 2-OAP cluster + ZK + BanyanDB. Drives apply / inactivate / delete on OAP-1 and +# verifies OAP-2 converges within a reconciler tick (default 30 s). + +setup: + env: compose + file: docker-compose.yml + timeout: 25m + init-system-environment: ../../../script/env + steps: + - name: set PATH + command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + - name: install jq + command: | + if ! command -v jq >/dev/null 2>&1; then + curl -fsSL -o /tmp/skywalking-infra-e2e/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 + chmod +x /tmp/skywalking-infra-e2e/bin/jq + fi + - name: drive cluster convergence flow + command: | + set -euo pipefail + export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + bash test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh + +verify: + retry: + count: 1 + interval: 1s + cases: + - query: curl -fsS http://127.0.0.1:17128/runtime/rule/list >/dev/null && echo ok + expected: expected/ok.txt + +cleanup: + on: always + collect: + on: failure + output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/cluster + items: + - service: oap1 + paths: + - /skywalking/logs/ + - service: oap2 + paths: + - /skywalking/logs/ diff --git a/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt b/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt new file mode 100644 index 000000000000..9766475a4185 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/cluster/expected/ok.txt @@ -0,0 +1 @@ +ok diff --git a/test/e2e-v2/cases/runtime-rule/lal/docker-compose.yml b/test/e2e-v2/cases/runtime-rule/lal/docker-compose.yml new file mode 100644 index 000000000000..e58522d0c1b8 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/docker-compose.yml @@ -0,0 +1,60 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LAL live-swap — single OAP + BanyanDB + an OTLP log emitter. The flow applies +# a LAL rule (v1), swaps it for v2 with the same (layer, ruleName) key, then +# inactivates and deletes. The log emitter pushes one log every second so the +# LAL filter pipeline has actual data to match across the swap window — failures +# in the per-file classloader retirement / Factory.addOrReplace path show up as +# stalled or duplicated dispatch (caught by the next-tick /list assertions). +services: + oap: + extends: + file: ../../../script/docker-compose/base-compose.yml + service: oap + environment: + SW_RECEIVER_RUNTIME_RULE: default + SW_STORAGE: banyandb + # Open the OTLP gRPC log receiver on the standard port (11800) — base-compose + # already opens 11800; we just need the log handler to be on. + SW_OTEL_LOG_RECEIVER: default + ports: + - "11800:11800" + - "12800:12800" + - "17128:17128" + networks: + - e2e + + banyandb: + extends: + file: ../../../script/docker-compose/base-compose.yml + service: banyandb + + log-emitter: + build: + context: ./log-emitter + networks: + - e2e + environment: + OTLP_ENDPOINT: http://oap:11800 + EMITTER_SERVICE: e2e-rr-lal-svc + EMITTER_INSTANCE: e2e-rr-lal-i1 + EMITTER_INTERVAL_S: "1" + depends_on: + oap: + condition: service_healthy + +networks: + e2e: diff --git a/test/e2e-v2/cases/runtime-rule/lal/e2e.yaml b/test/e2e-v2/cases/runtime-rule/lal/e2e.yaml new file mode 100644 index 000000000000..3cc008a42998 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/e2e.yaml @@ -0,0 +1,66 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LAL hot-update with end-to-end MAL extraction proof — applies a log-mal +# aggregation rule via runtime-rule, then hot-swaps a LAL extractor rule +# whose extractor stamps a per-version `step` label so swctl can verify the +# swap actually changed the running extraction. + +setup: + env: compose + file: docker-compose.yml + timeout: 25m + init-system-environment: ../../../script/env + steps: + - name: set PATH + command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + - name: install yq + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh yq + - name: install swctl + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl + - name: install jq + command: | + if ! command -v jq >/dev/null 2>&1; then + curl -fsSL -o /tmp/skywalking-infra-e2e/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 + chmod +x /tmp/skywalking-infra-e2e/bin/jq + fi + - name: drive LAL hot-update flow + command: | + set -euo pipefail + export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + export OAP_HOST=127.0.0.1 + export OAP_REST_PORT=17128 + export OAP_GQL_PORT=12800 + export SEED_DIR=$(pwd)/test/e2e-v2/cases/runtime-rule/lal/seed-rules + bash test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh + +verify: + retry: + count: 1 + interval: 1s + cases: + - query: curl -fsS http://127.0.0.1:17128/runtime/rule/list >/dev/null && echo ok + expected: expected/ok.txt + +cleanup: + on: always + collect: + on: failure + output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/lal + items: + - service: oap + paths: + - /skywalking/logs/ diff --git a/test/e2e-v2/cases/runtime-rule/lal/expected/ok.txt b/test/e2e-v2/cases/runtime-rule/lal/expected/ok.txt new file mode 100644 index 000000000000..9766475a4185 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/expected/ok.txt @@ -0,0 +1 @@ +ok diff --git a/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh new file mode 100755 index 000000000000..5c3a9058698d --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh @@ -0,0 +1,209 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Drives runtime-rule LAL hot-update with end-to-end MAL extraction proof: +# +# 0. Apply log-mal aggregation rule (catalog=log-mal-rules) +# 1. Apply LAL v1 — extractor stamps step=v1 +# Verify swctl returns a value for meter_e2e_lal_log_count{step='v1'} +# 2. Swap to LAL v2 — same name, extractor stamps step=v2 +# Verify swctl returns a value for ...{step='v2'} (proves the running +# extraction switched bodies, not just the persisted rule row) +# 3. Inactivate — soft-pause; new step values stop landing +# 4. Delete — destructive; /list row gone, MAL rule +# removed too +# +# Run from the repo root. + +set -euo pipefail + +log() { echo "[lal-flow] $*" >&2; } +fail() { log "FAIL: $*"; exit 1; } + +OAP_HOST="${OAP_HOST:-127.0.0.1}" +OAP_REST_PORT="${OAP_REST_PORT:-17128}" +OAP_GQL_PORT="${OAP_GQL_PORT:-12800}" +OAP_BASE="http://${OAP_HOST}:${OAP_REST_PORT}" +GQL_BASE="http://${OAP_HOST}:${OAP_GQL_PORT}" +SEED_DIR="${SEED_DIR:-$(pwd)/test/e2e-v2/cases/runtime-rule/lal/seed-rules}" +SEED_V1="${SEED_DIR}/lal-v1.yaml" +SEED_V2="${SEED_DIR}/lal-v2.yaml" +SEED_MAL="${SEED_DIR}/log-mal.yaml" +LAL_CATALOG="lal" +LAL_NAME="e2e-rr-lal-live" +MAL_CATALOG="log-mal-rules" +MAL_NAME="e2e_lal" +METRIC="meter_e2e_lal_log_count" +SERVICE_NAME="e2e-rr-lal-svc" +# First-phase budget covers minute-bucket boundary + OTLP export interval + +# extraction-then-aggregation latency. Subsequent phases land sooner but the +# upper bound stays here for resilience under CI load. +SETTLE_SECONDS="${SETTLE_SECONDS:-360}" + +[ -f "${SEED_V1}" ] || fail "seed v1 missing at ${SEED_V1}" +[ -f "${SEED_V2}" ] || fail "seed v2 missing at ${SEED_V2}" +[ -f "${SEED_MAL}" ] || fail "seed mal missing at ${SEED_MAL}" + +list_row() { + local catalog="$1" name="$2" + curl -fsS "${OAP_BASE}/runtime/rule/list" 2>/dev/null \ + | jq -c '.rules[] + | select(.catalog == "'"${catalog}"'" and .name == "'"${name}"'") + | select(.status != "n/a")' \ + | head -1 +} +list_field() { + local catalog="$1" name="$2" field="$3" + list_row "${catalog}" "${name}" | jq -r '."'"${field}"'" // empty' +} + +apply_rule() { + local catalog="$1" name="$2" body="$3" + curl -fsS -XPOST -H 'Content-Type: text/plain' \ + --data-binary "@${body}" \ + "${OAP_BASE}/runtime/rule/addOrUpdate?catalog=${catalog}&name=${name}" >/dev/null \ + || fail "addOrUpdate ${catalog}/${name} from ${body} failed" +} + +# Retries 503 cluster_not_ready for up to 60s — the reconciler's peer-refresh +# window briefly returns 503 right after a structural reshape (e.g. LAL +# delete that retires its dispatcher). Mirrors the MAL flow's pattern. +retry_post() { + local url="$1" + local deadline=$(( $(date +%s) + 60 )) + local out + while (( $(date +%s) < deadline )); do + out="$(curl -fsS -XPOST "${url}" 2>&1)" && return 0 + if [[ "${out}" == *503* ]]; then + sleep 2 + continue + fi + echo "${out}" >&2 + return 1 + done + echo "${out}" >&2 + return 1 +} + +inactivate_rule() { + local catalog="$1" name="$2" + retry_post "${OAP_BASE}/runtime/rule/inactivate?catalog=${catalog}&name=${name}" >/dev/null \ + || fail "inactivate ${catalog}/${name} failed" +} + +delete_rule() { + local catalog="$1" name="$2" + retry_post "${OAP_BASE}/runtime/rule/delete?catalog=${catalog}&name=${name}" >/dev/null \ + || fail "delete ${catalog}/${name} failed" +} + +# Returns 0 if swctl finds at least one minute-bucket sample for the given +# metric+step combo. Reads through GraphQL via swctl's `metrics exec`. +swctl_metric_for_step() { + local step="$1" + local expr="${METRIC}{step='${step}'}" + local out + out="$(swctl --display yaml --base-url="${GQL_BASE}/graphql" \ + metrics exec --expression="${expr}" \ + --service-name="${SERVICE_NAME}" 2>&1)" || { + log " swctl exec ${expr} failed: ${out}" + return 1 + } + log " swctl ${expr} → ${out}" + echo "${out}" | grep -qE '^\s*value:\s*"?-?[0-9]+(\.[0-9]+)?"?\s*$' +} + +await_metric_for_step() { + local step="$1" + log " await ${METRIC}{step='${step}'} (budget ${SETTLE_SECONDS}s)" + local deadline=$(( $(date +%s) + SETTLE_SECONDS )) + while (( $(date +%s) < deadline )); do + if swctl_metric_for_step "${step}"; then + log " ✓ ${METRIC}{step='${step}'} has values" + return 0 + fi + sleep 5 + done + fail "${METRIC}{step='${step}'} did not produce a value within ${SETTLE_SECONDS}s" +} + +log "waiting for OAP runtime-rule port" +deadline=$(( $(date +%s) + 90 )) +until curl -fsS "${OAP_BASE}/runtime/rule/list" >/dev/null 2>&1; do + if [ "$(date +%s)" -ge "${deadline}" ]; then fail "OAP not ready after 90s"; fi + sleep 2 +done +log "OAP ready" + +# --- Phase 0: apply log-mal aggregation ----------------------------------------------- +log "=== Phase 0: apply log-mal aggregation rule ===" +apply_rule "${MAL_CATALOG}" "${MAL_NAME}" "${SEED_MAL}" +mal_status="$(list_field "${MAL_CATALOG}" "${MAL_NAME}" status)" +[ "${mal_status}" = "ACTIVE" ] || fail "MAL rule expected ACTIVE, got '${mal_status}'" +log "log-mal → ACTIVE" + +# --- Phase 1: apply LAL v1 ------------------------------------------------------------ +log "=== Phase 1: apply LAL v1 (extractor stamps step=v1) ===" +apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V1}" +status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" +[ "${status}" = "ACTIVE" ] || fail "v1 expected ACTIVE, got '${status}'" +hash_v1="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" +[ -n "${hash_v1}" ] || fail "v1 contentHash empty" +log "v1 → ACTIVE @ ${hash_v1:0:8}…" +await_metric_for_step "v1" + +# --- Phase 2: swap to LAL v2 (same key, step flips to v2) ----------------------------- +log "=== Phase 2: swap to LAL v2 (extractor stamps step=v2) ===" +apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V2}" +status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" +[ "${status}" = "ACTIVE" ] || fail "v2 expected ACTIVE, got '${status}'" +hash_v2="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" +[ "${hash_v2}" != "${hash_v1}" ] || fail "v2 contentHash unchanged from v1 (${hash_v2:0:8}…)" +log "v2 → ACTIVE @ ${hash_v2:0:8}… (was ${hash_v1:0:8}…) — swap applied" +await_metric_for_step "v2" + +# --- Phase 3: inactivate LAL ---------------------------------------------------------- +log "=== Phase 3: inactivate LAL ===" +inactivate_rule "${LAL_CATALOG}" "${LAL_NAME}" +status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" +[ "${status}" = "INACTIVE" ] || fail "expected INACTIVE, got '${status}'" +log "inactivate → INACTIVE OK" + +# --- Phase 4: delete LAL + MAL -------------------------------------------------------- +log "=== Phase 4: delete LAL + log-mal rules ===" +delete_rule "${LAL_CATALOG}" "${LAL_NAME}" +deadline=$(( $(date +%s) + 30 )) +while [ -n "$(list_row "${LAL_CATALOG}" "${LAL_NAME}")" ]; do + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "LAL row still present 30s after delete" + fi + sleep 2 +done +log "LAL row gone OK" + +inactivate_rule "${MAL_CATALOG}" "${MAL_NAME}" +delete_rule "${MAL_CATALOG}" "${MAL_NAME}" +deadline=$(( $(date +%s) + 30 )) +while [ -n "$(list_row "${MAL_CATALOG}" "${MAL_NAME}")" ]; do + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "MAL row still present 30s after delete" + fi + sleep 2 +done +log "MAL row gone OK" + +log "=== ALL LAL FLOW PHASES PASSED ===" diff --git a/test/e2e-v2/cases/runtime-rule/lal/log-emitter/Dockerfile b/test/e2e-v2/cases/runtime-rule/lal/log-emitter/Dockerfile new file mode 100644 index 000000000000..6226c3b9cd21 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/log-emitter/Dockerfile @@ -0,0 +1,22 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FROM python:3.11-slim +RUN pip install --no-cache-dir \ + opentelemetry-sdk==1.27.0 \ + opentelemetry-exporter-otlp-proto-grpc==1.27.0 \ + opentelemetry-api==1.27.0 +COPY emitter.py / +CMD ["python", "/emitter.py"] diff --git a/test/e2e-v2/cases/runtime-rule/lal/log-emitter/emitter.py b/test/e2e-v2/cases/runtime-rule/lal/log-emitter/emitter.py new file mode 100644 index 000000000000..000149999dfb --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/log-emitter/emitter.py @@ -0,0 +1,78 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Synthetic OTLP log emitter for the runtime-rule LAL live-swap e2e. + +Pushes one log record per producer-interval to OAP via OTLP gRPC. The log +body carries a fixed marker ("e2e_rr_lal_live") that the LAL rule under test +matches on; the per-log severity rotates so a swap from "all-INFO" to +"only-ERROR" filtering produces a visible metric-count change. +""" + +import logging +import os +import time + +from opentelemetry._logs import set_logger_provider +from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter +from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler +from opentelemetry.sdk._logs.export import BatchLogRecordProcessor +from opentelemetry.sdk.resources import Resource + +ENDPOINT = os.environ.get("OTLP_ENDPOINT", "http://oap:11800") +SERVICE_NAME = os.environ.get("EMITTER_SERVICE", "e2e-rr-lal-svc") +INSTANCE_NAME = os.environ.get("EMITTER_INSTANCE", "e2e-rr-lal-i1") +PRODUCER_INTERVAL_SECONDS = float(os.environ.get("EMITTER_INTERVAL_S", "1")) + + +def main() -> None: + resource = Resource.create({ + "service.name": SERVICE_NAME, + "service.instance.id": INSTANCE_NAME, + }) + provider = LoggerProvider(resource=resource) + provider.add_log_record_processor( + BatchLogRecordProcessor(OTLPLogExporter(endpoint=ENDPOINT, insecure=True)) + ) + set_logger_provider(provider) + + handler = LoggingHandler(level=logging.NOTSET, logger_provider=provider) + logger = logging.getLogger("e2e_rr_lal") + logger.setLevel(logging.INFO) + logger.addHandler(handler) + + print( + f"lal-log-emitter started — endpoint={ENDPOINT} service={SERVICE_NAME} " + f"instance={INSTANCE_NAME} producer_interval={PRODUCER_INTERVAL_SECONDS}s", + flush=True, + ) + + seq = 0 + while True: + seq += 1 + # Alternate INFO / ERROR every other tick so the LAL filter has both shapes + # to choose from; the swap test uses severity to differentiate "all logs" + # from "errors only". + level = logging.ERROR if seq % 2 == 0 else logging.INFO + logger.log(level, "e2e_rr_lal_live seq=%d", seq, extra={ + "marker": "e2e_rr_lal_live", + "service.name": SERVICE_NAME, + "service.instance.id": INSTANCE_NAME, + }) + time.sleep(PRODUCER_INTERVAL_SECONDS) + + +if __name__ == "__main__": + main() diff --git a/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v1.yaml b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v1.yaml new file mode 100644 index 000000000000..83cb4d37325d --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v1.yaml @@ -0,0 +1,40 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LAL v1 — extracts one meter sample per matching log under metric name +# `e2e_lal_log_count` and stamps it with `step=v1`. The companion log-mal +# rule (seed-rules/log-mal.yaml) aggregates this counter; swctl asserts the +# aggregated value carries step=v1 once data lands. Swapping to v2 (same +# rule key, different `step` literal) proves the hot-update actually changed +# the running extraction — the new step value shows up in queries. +rules: + - name: e2e-rr-lal-live + layer: GENERAL + dsl: | + filter { + text { + regexp $/.*e2e_rr_lal_live.*/$ + } + extractor { + metrics { + timestamp log.timestamp as Long + labels service: log.service, service_instance_id: log.serviceInstance, step: 'v1' + name 'e2e_lal_log_count' + value 1 + } + } + sink { + } + } diff --git a/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v2.yaml b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v2.yaml new file mode 100644 index 000000000000..854fdcd5ae43 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/lal-v2.yaml @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LAL v2 — same rule key as v1, same metric name `e2e_lal_log_count`, but the +# `step` label is now 'v2'. Successful hot-swap shows up as the aggregated +# metric beginning to publish step=v2 samples after the apply lands; swctl +# queries for step=v2 only become non-empty post-swap, proving the running +# extraction switched bodies (not just the persisted rule row). +rules: + - name: e2e-rr-lal-live + layer: GENERAL + dsl: | + filter { + text { + regexp $/.*e2e_rr_lal_live.*/$ + } + extractor { + metrics { + timestamp log.timestamp as Long + labels service: log.service, service_instance_id: log.serviceInstance, step: 'v2' + name 'e2e_lal_log_count' + value 1 + } + } + sink { + } + } diff --git a/test/e2e-v2/cases/runtime-rule/lal/seed-rules/log-mal.yaml b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/log-mal.yaml new file mode 100644 index 000000000000..6055ce1741c6 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/lal/seed-rules/log-mal.yaml @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Log-MAL aggregation for the LAL extractor's emitted counter. The LAL rule +# (lal-v1 / lal-v2) attaches a `step` label whose value flips on hot-swap; +# this MAL rule retains `step` in the grouping so swctl can filter by phase. +# The flow applies this rule (catalog=log-mal-rules, name=e2e_lal) before +# any LAL apply so the aggregation pipeline is ready by the time samples +# start flowing. +metricPrefix: meter_e2e_lal +expSuffix: service(['service'], Layer.GENERAL) +metricsRules: + - name: log_count + exp: e2e_lal_log_count.sum(['service','service_instance_id','step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/docker-compose.yml b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/docker-compose.yml new file mode 100644 index 000000000000..2341bec7c540 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/docker-compose.yml @@ -0,0 +1,69 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +services: + oap: + extends: + file: ../../../../script/docker-compose/base-compose.yml + service: oap + environment: + # Hot-update receiver — opens the admin port on 17128. Default is "" (disabled). + SW_RECEIVER_RUNTIME_RULE: default + # Steer OAP to BanyanDB. Reuses the standard env var the base-compose already wires. + SW_STORAGE: banyandb + # Tighten the persistence timer for the e2e: every 10s (default 25s) so the + # lifecycle flow's per-phase awaits land sooner. Doesn't change the minute- + # bucket boundary itself (which is what dominates the wait), just shortens + # the flush gap between bucket close and BanyanDB-side visibility. + SW_CORE_PERSISTENT_PERIOD: "10" + # Static rule catalogs stay at their defaults. The runtime-rule lifecycle uses a + # unique e2e_rr_-prefixed metric name that does not collide with anything the + # static catalogs ship, so the apply path's CREATE phase still exercises a + # first-time DDL register against the backend for that specific metric — even + # though the storage-side merging tables (meter_sum_*) may already exist for + # other static metrics that share the function. + ports: + - "11800:11800" + - "12800:12800" + - "17128:17128" + networks: + - e2e + + banyandb: + extends: + file: ../../../../script/docker-compose/base-compose.yml + service: banyandb + + otlp-emitter: + build: + context: ../otlp-emitter + networks: + - e2e + environment: + OTLP_ENDPOINT: http://oap:11800 + EMITTER_SERVICE: e2e-rr-svc + EMITTER_INSTANCE: e2e-rr-i1 + # The lifecycle flow rewrites this file via `docker compose exec` + # between phases so each emitted sample carries the current phase's + # `step` label. Defaults to "create" until the flow writes it for + # the first time, matching the first-phase emitter behaviour. + STEP_FILE: /tmp/step + STEP_DEFAULT: create + depends_on: + oap: + condition: service_healthy + +networks: + e2e: diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml new file mode 100644 index 000000000000..af1fcededf3b --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/e2e.yaml @@ -0,0 +1,82 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Runtime-rule lifecycle e2e — BanyanDB variant. +# +# Boots OAP with empty static otel-rules + lal so the runtime-rule apply path +# exercises first-time DDL on BanyanDB. The python OTLP emitter pushes a +# counter (e2e_rr_request_count_total) and gauge (e2e_rr_pool_size) tagged +# with a per-phase `step` label that the flow script flips between phases; +# the runtime-rule HTTP handler accepts seed rules that derive SkyWalking +# metrics from those signals; the flow script drives the full lifecycle — +# CREATE → UPDATE-FILTER → UPDATE-STRUCTURAL → DUMP → ILLEGAL × 4 → +# SHAPE-BREAK → INACTIVATE → ACTIVATE → DELETE → DUMP — and asserts each +# phase against /list, swctl, the BanyanDB-side measure existence, and the +# step-label-attributed query results. + +setup: + env: compose + file: docker-compose.yml + timeout: 25m + init-system-environment: ../../../../script/env + steps: + - name: set PATH + command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + - name: install yq + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh yq + - name: install swctl + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl + - name: install jq + command: | + if ! command -v jq >/dev/null 2>&1; then + curl -fsSL -o /tmp/skywalking-infra-e2e/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 + chmod +x /tmp/skywalking-infra-e2e/bin/jq + fi + - name: drive runtime-rule lifecycle + command: | + set -euo pipefail + export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + export OAP_HOST=127.0.0.1 + export OAP_REST_PORT=17128 + export OAP_GQL_PORT=12800 + export SEED_RULES_DIR=$(pwd)/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules + bash test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh + +verify: + retry: + count: 1 + interval: 1s + cases: + # The flow script above drives every assertion inline; verify.cases is kept + # minimal so the harness reports the script's pass/fail directly. A trailing + # /list smoke check confirms the receiver port is still serving after the + # destructive phase (catches regressions where /delete crashes the handler). + - query: curl -fsS http://127.0.0.1:17128/runtime/rule/list >/dev/null && echo ok + expected: expected/ok.txt + +cleanup: + on: always + collect: + on: failure + output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/mal-storage-banyandb + items: + - service: banyandb + paths: + - /tmp/banyandb-stream-data/ + - /tmp/banyand-measure-data/ + - service: oap + paths: + - /skywalking/logs/ diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/expected/ok.txt b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/expected/ok.txt new file mode 100644 index 000000000000..9766475a4185 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/banyandb/expected/ok.txt @@ -0,0 +1 @@ +ok diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/docker-compose.yml b/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/docker-compose.yml new file mode 100644 index 000000000000..c621d3d824b6 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/docker-compose.yml @@ -0,0 +1,78 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +services: + es: + image: elastic/elasticsearch:${ES_VERSION} + networks: + - e2e + ports: + - "9200:9200" + environment: + - discovery.type=single-node + - cluster.routing.allocation.disk.threshold_enabled=false + - xpack.security.enabled=false + healthcheck: + test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/9200"] + interval: 5s + timeout: 60s + retries: 120 + + oap: + extends: + file: ../../../../script/docker-compose/base-compose.yml + service: oap + environment: + SW_STORAGE: elasticsearch + SW_STORAGE_ES_CLUSTER_NODES: es:9200 + SW_RECEIVER_RUNTIME_RULE: default + # Tighten the persistence timer for the e2e: every 10s (default 25s) so the + # lifecycle flow's per-phase awaits land sooner. + SW_CORE_PERSISTENT_PERIOD: "10" + # Static rule catalogs stay at their defaults — the e2e_rr_-prefixed metric name + # the runtime-rule lifecycle uses doesn't collide with anything static. ES default + # logicSharding=false means all metrics merge into a single metrics-all index; + # CREATE still exercises a first-time register for our specific metric even + # though the index may already exist for other static metrics. + ports: + - "11800:11800" + - "12800:12800" + - "17128:17128" + depends_on: + es: + condition: service_healthy + networks: + - e2e + + otlp-emitter: + build: + context: ../otlp-emitter + networks: + - e2e + environment: + OTLP_ENDPOINT: http://oap:11800 + EMITTER_SERVICE: e2e-rr-svc + EMITTER_INSTANCE: e2e-rr-i1 + # The lifecycle flow rewrites this file via `docker compose exec` + # between phases so each emitted sample carries the current phase's + # `step` label. + STEP_FILE: /tmp/step + STEP_DEFAULT: create + depends_on: + oap: + condition: service_healthy + +networks: + e2e: diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml new file mode 100644 index 000000000000..260773f1b2ba --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/elasticsearch/e2e.yaml @@ -0,0 +1,70 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Storage-matrix e2e — Elasticsearch variant. +# +# OAP runs ES with default logicSharding=false → all function metrics land in +# a single merged index `metrics-all`. dropTable on ES is a documented no-op +# (append-only policy), so after /delete the index stays and historical data +# remains queryable — the lifecycle proof comes from the /list row check +# (rule gone) and the swctl/MQE assertions (current step's row absent / +# present), not from any backend-direct introspection. + +setup: + env: compose + file: docker-compose.yml + timeout: 25m + init-system-environment: ../../../../script/env + steps: + - name: set PATH + command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + - name: install yq + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh yq + - name: install swctl + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl + - name: install jq + command: | + if ! command -v jq >/dev/null 2>&1; then + curl -fsSL -o /tmp/skywalking-infra-e2e/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 + chmod +x /tmp/skywalking-infra-e2e/bin/jq + fi + - name: drive runtime-rule lifecycle (CREATE → FILTER_ONLY → STRUCTURAL → INACTIVATE → DELETE) + command: | + set -euo pipefail + export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + export OAP_HOST=127.0.0.1 + export OAP_REST_PORT=17128 + export OAP_GQL_PORT=12800 + export SEED_RULES_DIR=$(pwd)/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules + bash test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh + +verify: + retry: + count: 1 + interval: 1s + cases: + - query: curl -fsS http://127.0.0.1:17128/runtime/rule/list >/dev/null && echo ok + expected: ../expected/ok.txt + +cleanup: + on: always + collect: + on: failure + output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/mal-storage-elasticsearch + items: + - service: oap + paths: + - /skywalking/logs/ diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/expected/ok.txt b/test/e2e-v2/cases/runtime-rule/mal-storage/expected/ok.txt new file mode 100644 index 000000000000..9766475a4185 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/expected/ok.txt @@ -0,0 +1 @@ +ok diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/Dockerfile b/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/Dockerfile new file mode 100644 index 000000000000..ca1f0a97378e --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/Dockerfile @@ -0,0 +1,21 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +FROM python:3.11-slim +RUN pip install --no-cache-dir \ + opentelemetry-sdk==1.27.0 \ + opentelemetry-exporter-otlp-proto-grpc==1.27.0 +COPY emitter.py / +CMD ["python", "/emitter.py"] diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/emitter.py b/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/emitter.py new file mode 100644 index 000000000000..83109070bdea --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/otlp-emitter/emitter.py @@ -0,0 +1,129 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Synthetic OTLP metric emitter for the runtime-rule storage-matrix e2e. + +Emits two metrics on a steady cadence so the runtime-rule MAL pipeline +has predictable input regardless of which backend the OAP under test is +talking to. Names are deliberately namespaced with ``e2e_rr_`` so they +do not collide with any static rule shipped by OAP. + +Counter ``e2e_rr_request_count_total`` is monotonically increasing and +drives the FILTER_ONLY / STRUCTURAL apply assertions through ``sum(...)`` +in MAL. Gauge ``e2e_rr_pool_size`` is held constant so STRUCTURAL adds +have a second metric to derive from without coupling to time. + +The emitter sleeps a short interval between samples so the L1 / L2 +aggregation pipeline on OAP has steady ticks; the OTLP exporter's own +periodic reader pushes whatever has been recorded since the last flush. +""" +import os +import time + +from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter +from opentelemetry.sdk.metrics import MeterProvider +from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader +from opentelemetry.sdk.resources import Resource + +ENDPOINT = os.environ.get("OTLP_ENDPOINT", "http://oap:11800") +SERVICE_NAME = os.environ.get("EMITTER_SERVICE", "e2e-rr-svc") +INSTANCE_NAME = os.environ.get("EMITTER_INSTANCE", "e2e-rr-i1") + +# 5 s OTLP export interval so OAP sees fresh data within one minute bucket. +EXPORT_INTERVAL_MILLIS = int(os.environ.get("OTLP_EXPORT_INTERVAL_MS", "5000")) +# 2 s producer sleep — independent of the export interval so we always have +# at least one observation per export window. +PRODUCER_INTERVAL_SECONDS = float(os.environ.get("EMITTER_INTERVAL_S", "2")) + +# Shared file the flow script rewrites between phases. Each emitter tick +# reads the file so samples carry the *current* phase's `step` label and +# the lifecycle e2e can attribute storage rows back to the phase that +# wrote them. Defaults to "create" for back-compat with the simpler flow. +STEP_FILE = os.environ.get("STEP_FILE", "/shared/step") +STEP_DEFAULT = os.environ.get("STEP_DEFAULT", "create") + + +def read_step() -> str: + try: + with open(STEP_FILE, "r") as f: + value = f.read().strip() + return value or STEP_DEFAULT + except FileNotFoundError: + return STEP_DEFAULT + + +def main() -> None: + resource = Resource.create({ + "service.name": SERVICE_NAME, + "service.instance.id": INSTANCE_NAME, + }) + + exporter = OTLPMetricExporter(endpoint=ENDPOINT, insecure=True) + reader = PeriodicExportingMetricReader( + exporter, + export_interval_millis=EXPORT_INTERVAL_MILLIS, + ) + provider = MeterProvider(resource=resource, metric_readers=[reader]) + meter = provider.get_meter("e2e-rr-otlp-emitter") + + counter = meter.create_counter( + name="e2e_rr_request_count_total", + description="Synthetic request counter for runtime-rule e2e.", + ) + + # Hold the pool-size gauge constant — an ObservableGauge needs a callback + # but the value is otherwise stable so STRUCTURAL assertions can pin a + # specific number. The callback reads the current step so samples produced + # via the gauge's periodic export carry the same label as the counter's. + def pool_size_callback(_options): + from opentelemetry.metrics import Observation + step = read_step() + return [ + Observation(value=42, attributes={ + "service.name": SERVICE_NAME, + "service.instance.id": INSTANCE_NAME, + "step": step, + }), + ] + + meter.create_observable_gauge( + name="e2e_rr_pool_size", + callbacks=[pool_size_callback], + description="Synthetic pool-size gauge for runtime-rule e2e.", + ) + + print( + f"otlp-emitter started — endpoint={ENDPOINT} service={SERVICE_NAME} " + f"instance={INSTANCE_NAME} producer_interval={PRODUCER_INTERVAL_SECONDS}s " + f"export_interval={EXPORT_INTERVAL_MILLIS}ms step_file={STEP_FILE}", + flush=True, + ) + + last_step = None + while True: + step = read_step() + if step != last_step: + print(f"otlp-emitter: step={step}", flush=True) + last_step = step + counter.add(1, attributes={ + "service.name": SERVICE_NAME, + "service.instance.id": INSTANCE_NAME, + "step": step, + }) + time.sleep(PRODUCER_INTERVAL_SECONDS) + + +if __name__ == "__main__": + main() diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/docker-compose.yml b/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/docker-compose.yml new file mode 100644 index 000000000000..7bd2def287fb --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/docker-compose.yml @@ -0,0 +1,79 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +services: + oap: + extends: + file: ../../../../script/docker-compose/base-compose.yml + service: oap + environment: + SW_STORAGE: postgresql + SW_JDBC_URL: jdbc:postgresql://postgres:5432/swtest + SW_DATA_SOURCE_USER: postgres + SW_DATA_SOURCE_PASSWORD: root@1234 + SW_RECEIVER_RUNTIME_RULE: default + # Tighten the persistence timer for the e2e: every 10s (default 25s) so the + # lifecycle flow's per-phase awaits land sooner. + SW_CORE_PERSISTENT_PERIOD: "10" + # Static rule catalogs stay at their defaults — the e2e_rr_-prefixed metric name + # the runtime-rule lifecycle uses doesn't collide with anything static, so CREATE + # still exercises a first-time register for that specific metric even though the + # merging-table (meter_sum_*) may already exist for other static metrics. + ports: + - "11800:11800" + - "12800:12800" + - "17128:17128" + depends_on: + postgres: + condition: service_healthy + networks: + - e2e + + postgres: + image: postgres:14.1 + environment: + TZ: Asia/Shanghai + POSTGRES_PASSWORD: root@1234 + POSTGRES_DB: swtest + ports: + - "5432:5432" + networks: + - e2e + healthcheck: + test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/5432"] + interval: 5s + timeout: 60s + retries: 60 + + otlp-emitter: + build: + context: ../otlp-emitter + networks: + - e2e + environment: + OTLP_ENDPOINT: http://oap:11800 + EMITTER_SERVICE: e2e-rr-svc + EMITTER_INSTANCE: e2e-rr-i1 + # The lifecycle flow rewrites this file via `docker compose exec` + # between phases so each emitted sample carries the current phase's + # `step` label. + STEP_FILE: /tmp/step + STEP_DEFAULT: create + depends_on: + oap: + condition: service_healthy + +networks: + e2e: diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml new file mode 100644 index 000000000000..f15d0afd19de --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/postgresql/e2e.yaml @@ -0,0 +1,71 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Storage-matrix e2e — PostgreSQL (JDBC) variant. +# PostgreSQL is preferred over MySQL here because the OAP image already bundles +# the PostgreSQL JDBC driver — no runtime download dance. +# +# dropTable on JDBC is a documented no-op (append-only policy), so after +# /delete the merging table stays and historical data remains queryable — +# the lifecycle proof comes from the /list row check (rule gone) and the +# swctl/MQE assertions (current step's row absent / present), not from any +# backend-direct introspection. + +setup: + env: compose + file: docker-compose.yml + timeout: 25m + init-system-environment: ../../../../script/env + steps: + - name: set PATH + command: export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + - name: install yq + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh yq + - name: install swctl + command: bash test/e2e-v2/script/prepare/setup-e2e-shell/install.sh swctl + - name: install jq + command: | + if ! command -v jq >/dev/null 2>&1; then + curl -fsSL -o /tmp/skywalking-infra-e2e/bin/jq \ + https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-amd64 + chmod +x /tmp/skywalking-infra-e2e/bin/jq + fi + - name: drive runtime-rule lifecycle (CREATE → FILTER_ONLY → STRUCTURAL → INACTIVATE → DELETE) + command: | + set -euo pipefail + export PATH=/tmp/skywalking-infra-e2e/bin:$PATH + export OAP_HOST=127.0.0.1 + export OAP_REST_PORT=17128 + export OAP_GQL_PORT=12800 + export SEED_RULES_DIR=$(pwd)/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules + bash test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh + +verify: + retry: + count: 1 + interval: 1s + cases: + - query: curl -fsS http://127.0.0.1:17128/runtime/rule/list >/dev/null && echo ok + expected: ../expected/ok.txt + +cleanup: + on: always + collect: + on: failure + output-dir: $SW_INFRA_E2E_LOG_DIR/runtime-rule/mal-storage-postgresql + items: + - service: oap + paths: + - /skywalking/logs/ diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh new file mode 100755 index 000000000000..7f80d2b032dd --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh @@ -0,0 +1,578 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Runtime-rule lifecycle flow. +# +# Drives the full runtime-rule API surface against an OAP under test on +# port 17128 and asserts each transition end-to-end: +# +# 1. CREATE — POST rule v1 (1 metric, SERVICE-scope) +# 2. UPDATE-FILTER — POST rule v2 (body change ×10, same shape) +# 3. UPDATE-STRUCTURAL — POST rule v3 (adds 2nd metric) +# 4. DUMP (mid-flight) — GET /dump returns tar.gz with the live ruleset +# 5. ILLEGAL-APPLY × 4 — verify rejection paths +# 5a. malformed YAML → 400 compile_failed +# 5b. shape flip without allowStorageChange → 409 +# 5c. /delete on ACTIVE row → 409 requires_inactivate_first +# 5d. sibling rule claims the same metric name → 409 ownership conflict +# 6. SHAPE-BREAK — /inactivate → /delete → POST rule v4 (INSTANCE-scope) +# 7. INACTIVATE — POST /inactivate (soft-pause) +# 8. ACTIVATE — re-POST /addOrUpdate (lossless reactivate) +# 9. DELETE — /inactivate → /delete (destructive) +# 10. DUMP (final) — GET /dump returns tar.gz with manifest only +# +# Per-phase data attribution: the emitter publishes a `step` label whose +# value the flow rewrites between phases via `docker exec`. After each +# phase that writes data the verify queries select rows by the current +# step value, so the e2e proves "the row carrying step= exists in +# storage" or "step= rows never appeared because the rule was +# rejected". +# +set -euo pipefail + +OAP_HOST="${OAP_HOST:-oap}" +OAP_REST_PORT="${OAP_REST_PORT:-17128}" +OAP_GQL_PORT="${OAP_GQL_PORT:-12800}" +SEED_RULES_DIR="${SEED_RULES_DIR:-/seed-rules}" +SETTLE_SECONDS="${SETTLE_SECONDS:-360}" # first phase needs minute-bucket boundary + OTLP export interval + flush latency + register-and-aggregate path; subsequent phases land within ~120s but the upper bound stays here for resilience under CI load +CATALOG="otel-rules" +NAME="e2e_rr" +SIBLING_NAME="e2e_rr_sibling" + +REST_BASE="http://${OAP_HOST}:${OAP_REST_PORT}" +GQL_BASE="http://${OAP_HOST}:${OAP_GQL_PORT}" + +# ---- helpers -------------------------------------------------------------- + +log() { echo "[runtime-rule-flow] $*" >&2; } +fail() { echo "[runtime-rule-flow] FAIL: $*" >&2; exit 1; } + +# Resolve the otlp-emitter container by name fragment so we don't need to know +# the compose project name. Cached on first lookup. +EMITTER_CONTAINER="" +emitter_container() { + if [[ -n "${EMITTER_CONTAINER}" ]]; then + echo "${EMITTER_CONTAINER}" + return + fi + EMITTER_CONTAINER="$(docker ps --filter "name=otlp-emitter" --format "{{.Names}}" | head -1)" + [[ -n "${EMITTER_CONTAINER}" ]] || fail "no running container matching name=otlp-emitter" + echo "${EMITTER_CONTAINER}" +} + +# Flip the emitter's `step` label. The emitter re-reads /tmp/step on every +# tick, so subsequent samples carry the new value within a producer interval +# (~2 s) and reach storage after one OTLP export + one MAL aggregation hop. +step_set() { + local value="$1" + local container + container="$(emitter_container)" + docker exec "${container}" sh -c "echo '${value}' > /tmp/step" \ + || fail "failed to set step=${value} on ${container}" + log " step=${value}" +} + +# Retry a 2xx-or-fail curl for up to RETRY_BUDGET_S seconds. Exists because the +# cluster routing layer transiently returns 503 cluster_not_ready when its peer +# refresh is in flight; happens reliably right after a STRUCTURAL apply (the +# reconciler's cache may be paused). Operator retries after a few seconds work +# in practice, so the e2e applies the same pattern automatically. +RETRY_BUDGET_S="${RETRY_BUDGET_S:-60}" +retry_curl_post() { + local url="$1" + local body_arg="${2:-}" # e.g. --data-binary @file ; empty for empty-body POST + local deadline=$(( $(date +%s) + RETRY_BUDGET_S )) + local out + while (( $(date +%s) < deadline )); do + if [[ -n "${body_arg}" ]]; then + # shellcheck disable=SC2086 + out="$(curl -fsS -XPOST ${body_arg} -H "Content-Type: text/plain" "${url}" 2>&1)" && { + echo "${out}"; return 0; + } + else + out="$(curl -fsS -XPOST "${url}" 2>&1)" && { echo "${out}"; return 0; } + fi + if [[ "${out}" == *503* ]]; then + log " transient 503 on ${url} — retrying" + sleep 2 + continue + fi + echo "${out}" + return 1 + done + echo "${out}" + return 1 +} + +# POST a rule file to /addOrUpdate. Echoes the JSON response. Asserts 200. +post_rule() { + local file="$1" + local extra_qs="${2:-}" + local rule_name="${3:-${NAME}}" + local url="${REST_BASE}/runtime/rule/addOrUpdate?catalog=${CATALOG}&name=${rule_name}${extra_qs:+&${extra_qs}}" + log "POST ${url} (body=${file})" + local resp + resp="$(curl -fsS -XPOST --data-binary "@${file}" -H "Content-Type: text/plain" "${url}")" \ + || fail "addOrUpdate of ${file} returned non-2xx" + log " → ${resp}" + echo "${resp}" +} + +# POST a rule that's expected to be REJECTED. Captures the HTTP status and the +# response body via curl's separate -w / -o, asserts the status matches, and +# echoes the body so callers can grep for a specific failure code/string. +post_rule_expect_status() { + local file="$1" + local expected_status="$2" + local extra_qs="${3:-}" + local rule_name="${4:-${NAME}}" + local url="${REST_BASE}/runtime/rule/addOrUpdate?catalog=${CATALOG}&name=${rule_name}${extra_qs:+&${extra_qs}}" + log "POST ${url} (expect HTTP ${expected_status}, body=${file})" + local body_file http_status + body_file="$(mktemp)" + http_status="$(curl -sS -o "${body_file}" -w '%{http_code}' \ + -XPOST --data-binary "@${file}" -H "Content-Type: text/plain" "${url}")" + local body + body="$(cat "${body_file}")" + rm -f "${body_file}" + log " ← HTTP ${http_status} body=${body}" + [[ "${http_status}" == "${expected_status}" ]] \ + || fail "expected HTTP ${expected_status}, got ${http_status} (body: ${body})" + echo "${body}" +} + +# POST a non-/addOrUpdate endpoint that's expected to be REJECTED. Same +# semantics as post_rule_expect_status but takes an explicit URL. +post_url_expect_status() { + local url="$1" + local expected_status="$2" + log "POST ${url} (expect HTTP ${expected_status})" + local body_file http_status + body_file="$(mktemp)" + http_status="$(curl -sS -o "${body_file}" -w '%{http_code}' -XPOST "${url}")" + local body + body="$(cat "${body_file}")" + rm -f "${body_file}" + log " ← HTTP ${http_status} body=${body}" + [[ "${http_status}" == "${expected_status}" ]] \ + || fail "expected HTTP ${expected_status}, got ${http_status} (body: ${body})" + echo "${body}" +} + +# Assert the JSON response carries the expected applyStatus. +assert_apply_status() { + local expected="$1" + local actual_json="$2" + local actual + actual="$(echo "${actual_json}" | jq -r '.applyStatus // empty')" + [[ "${actual}" == "${expected}" ]] \ + || fail "expected applyStatus=${expected}, got '${actual}' (full: ${actual_json})" +} + +# GET /runtime/rule/list and ensure the row matches the expected status. Returns +# the matching JSON line on stdout for callers that want to inspect contentHash. +list_row() { + local expected_status="$1" + local rule_name="${2:-${NAME}}" + log "GET /runtime/rule/list → looking for ${CATALOG}/${rule_name} status=${expected_status}" + local lines + lines="$(curl -fsS "${REST_BASE}/runtime/rule/list")" \ + || fail "GET /runtime/rule/list failed" + local match + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + [[ -n "${match}" ]] \ + || fail "/list has no row for ${CATALOG}/${rule_name} (got: ${lines})" + local actual_status + actual_status="$(echo "${match}" | jq -r '.status')" + [[ "${actual_status}" == "${expected_status}" ]] \ + || fail "expected /list status=${expected_status}, got '${actual_status}' (row: ${match})" + echo "${match}" +} + +# Assert that /list does NOT have a row for the given (catalog, name). +list_no_row() { + local rule_name="${1:-${NAME}}" + log "GET /runtime/rule/list → expect NO row for ${CATALOG}/${rule_name}" + local lines match + lines="$(curl -fsS "${REST_BASE}/runtime/rule/list")" \ + || fail "GET /runtime/rule/list failed" + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + if [[ -n "${match}" ]]; then + local status + status="$(echo "${match}" | jq -r '.status')" + [[ "${status}" == "n/a" ]] \ + || fail "/list still has row for ${CATALOG}/${rule_name} status=${status} (row: ${match})" + fi +} + +# Per-phase entity scope. SHAPE-BREAK reshapes the metric from SERVICE to +# SERVICE_INSTANCE, after which swctl needs both --service-name AND +# --instance-name to resolve the entity. Phases set this before calling +# the helpers; default is SERVICE. +ENTITY_INSTANCE="${ENTITY_INSTANCE:-}" + +# Sample query: swctl returns YAML; non-empty .values means at least one minute +# bucket has data for the given metric scoped to the current ENTITY_INSTANCE +# (empty = SERVICE-scope, set = SERVICE_INSTANCE-scope). +swctl_metric_has_value() { + local metric="$1" + local out + if [[ -n "${ENTITY_INSTANCE}" ]]; then + out="$(swctl --display yaml --base-url="${GQL_BASE}/graphql" \ + metrics exec --expression="${metric}" \ + --service-name="e2e-rr-svc" --instance-name="${ENTITY_INSTANCE}" 2>&1)" || { + log " swctl exec ${metric} (instance=${ENTITY_INSTANCE}) failed: ${out}" + return 1 + } + else + out="$(swctl --display yaml --base-url="${GQL_BASE}/graphql" \ + metrics exec --expression="${metric}" --service-name="e2e-rr-svc" 2>&1)" || { + log " swctl exec ${metric} failed: ${out}" + return 1 + } + fi + log " swctl ${metric} → ${out}" + echo "${out}" | grep -qE '^\s*value:\s*"?-?[0-9]+(\.[0-9]+)?"?\s*$' && return 0 + return 1 +} + +# Like swctl_metric_has_value but the MQE expression filters on a `step` +# label so the result is restricted to a specific lifecycle phase. +swctl_metric_has_value_for_step() { + local metric="$1" + local step="$2" + local expr="${metric}{step='${step}'}" + swctl_metric_has_value "${expr}" +} + +await_metric() { + local metric="$1" + log "awaiting metric ${metric} (up to ${SETTLE_SECONDS}s)" + local deadline=$(( $(date +%s) + SETTLE_SECONDS )) + while (( $(date +%s) < deadline )); do + if swctl_metric_has_value "${metric}"; then + log " ✓ ${metric} has values" + return 0 + fi + sleep 5 + done + fail "metric ${metric} never produced a value within ${SETTLE_SECONDS}s" +} + +await_metric_for_step() { + local metric="$1" + local step="$2" + log "awaiting metric ${metric}{step='${step}'} (up to ${SETTLE_SECONDS}s)" + local deadline=$(( $(date +%s) + SETTLE_SECONDS )) + while (( $(date +%s) < deadline )); do + if swctl_metric_has_value_for_step "${metric}" "${step}"; then + log " ✓ ${metric}{step='${step}'} has values" + return 0 + fi + sleep 5 + done + fail "metric ${metric}{step='${step}'} never produced a value within ${SETTLE_SECONDS}s" +} + +# Storage-direct version of await_metric_for_step that bypasses the MQE +# query path and asks BanyanDB for the measure's raw rows. Needed for +# phases that change the metric's scope (e.g. SERVICE → SERVICE_INSTANCE +# in the SHAPE-BREAK phase): MQE resolves entity by service / instance +# binding which lags behind the BanyanDB-side schema change, so the right +# truth signal for "the new rule is producing data" is the storage layer. +await_step_in_banyandb() { + local metric="$1" # e.g. e2e_rr_requests + local step="$2" # e.g. shape_break_new + local since_ms="${3:-$(( $(date +%s) - 120 ))000}" + local measure="${metric}_minute" + local group="sw_metricsMinute" + log "awaiting BanyanDB row in ${group}/${measure} carrying step='${step}' (up to ${SETTLE_SECONDS}s, since ts=${since_ms}ms)" + local deadline=$(( $(date +%s) + SETTLE_SECONDS )) + local now_iso since_iso + since_iso="$(python3 -c "import datetime; print(datetime.datetime.fromtimestamp(${since_ms}/1000, tz=datetime.timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))")" + while (( $(date +%s) < deadline )); do + now_iso="$(python3 -c "import datetime; print(datetime.datetime.now(datetime.timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))")" + local body + body="$(curl -s -X POST "${BYDB_BASE}/api/v1/measure/data" -H 'Content-Type: application/json' \ + -d "{\"groups\":[\"${group}\"],\"name\":\"${measure}\",\"timeRange\":{\"begin\":\"${since_iso}\",\"end\":\"${now_iso}\"},\"tagProjection\":{\"tagFamilies\":[{\"name\":\"storage-only\",\"tags\":[\"entity_id\"]}]},\"fieldProjection\":{\"names\":[\"datatable_value\"]}}" 2>/dev/null)" + # Each post-shape-break row's datatable_value is a packed string like + # {step=create},286|{step=shape_break_new},168|{step=structural},112 + # Match the literal `step=,` substring to confirm the rule + # actually emitted that step's bucket. Bypasses MQE entity binding. + if echo "${body}" | grep -q "step=${step},[0-9]"; then + log " ✓ BanyanDB row found with step=${step}" + return 0 + fi + sleep 5 + done + fail "no BanyanDB row in ${group}/${measure} carrying step=${step} within ${SETTLE_SECONDS}s (last body: ${body:-})" +} + +# Negative-direction await. Polls for SETTLE_SECONDS hoping the metric STAYS +# empty; succeeds if no value materialises in that window. Used to assert +# the INACTIVATE soft-pause window genuinely drops samples (the rule's MAL +# converter is unregistered, so emitter samples produce no rows). +expect_no_metric_for_step() { + local metric="$1" + local step="$2" + local window="${3:-${SETTLE_SECONDS}}" + log "expecting NO metric ${metric}{step='${step}'} within ${window}s" + local deadline=$(( $(date +%s) + window )) + while (( $(date +%s) < deadline )); do + if swctl_metric_has_value_for_step "${metric}" "${step}"; then + fail "metric ${metric}{step='${step}'} unexpectedly produced a value (proves the phase wasn't rejected / paused)" + fi + sleep 5 + done + log " ✓ ${metric}{step='${step}'} stayed empty for ${window}s" +} + +# Capture the latest non-null bucket id for {metric, step} so a follow-up +# call to assert_metric_step_advanced can prove a NEW bucket landed after +# some intervening event. Used after ILLEGAL-APPLY rejections to prove the +# existing rule's MAL converter kept aggregating (a rejection that +# accidentally tore down the converter would freeze the bucket id even +# though contentHash stays unchanged). +latest_bucket_id_for_step() { + local metric="$1" + local step="$2" + local out + out="$(swctl --display yaml --base-url="${GQL_BASE}/graphql" \ + metrics exec --expression="${metric}{step='${step}'}" --service-name="e2e-rr-svc" 2>/dev/null)" || { + echo "" + return + } + # Parse YAML: rows look like + # - id: "1777335720000" + # value: "19" + # Match the id of the most-recent row whose value is NOT null. awk pairs + # adjacent id+value lines and prints the id only when value is numeric. + echo "${out}" | awk ' + /^[[:space:]]*-[[:space:]]*id:/ { id = $0 } + /^[[:space:]]*value:/ { + if ($0 ~ /value:[[:space:]]*"?-?[0-9]/) { + gsub(/[^0-9]/, "", id) + print id + } + } + ' | tail -1 +} + +# Assert the {metric, step} latest bucket id is strictly greater than the +# baseline captured earlier — proves a NEW bucket landed. +assert_metric_step_advanced() { + local metric="$1" + local step="$2" + local baseline="$3" + local window="${4:-${SETTLE_SECONDS}}" + log "expecting metric ${metric}{step='${step}'} to advance past id=${baseline} within ${window}s" + local deadline=$(( $(date +%s) + window )) + local latest + while (( $(date +%s) < deadline )); do + latest="$(latest_bucket_id_for_step "${metric}" "${step}")" + if [[ -n "${latest}" && -n "${baseline}" && "${latest}" -gt "${baseline}" ]]; then + log " ✓ ${metric}{step='${step}'} advanced ${baseline} → ${latest}" + return 0 + fi + sleep 5 + done + fail "metric ${metric}{step='${step}'} did not advance past ${baseline} within ${window}s (latest=${latest:-})" +} + +# Fetch /runtime/rule/dump, save the tar.gz, and assert it contains expected +# entries. Pass expected basenames (without leading paths) — any one missing +# is a failure. To assert "manifest only", pass just `manifest.yaml`. +assert_dump_contains() { + local label="$1" + shift + local tar_file + tar_file="$(mktemp)" + curl -fsS "${REST_BASE}/runtime/rule/dump" -o "${tar_file}" \ + || fail "GET /runtime/rule/dump failed (${label})" + local entries + entries="$(tar -tzf "${tar_file}" 2>&1)" \ + || { rm -f "${tar_file}"; fail "${label}: dump body is not a valid tar.gz: ${entries}"; } + log "${label} dump entries: ${entries//$'\n'/, }" + rm -f "${tar_file}" + for required in "$@"; do + echo "${entries}" | grep -q "${required}" \ + || fail "${label}: dump missing required entry ${required} (got: ${entries})" + done + log " ✓ ${label} dump contains: $*" +} + +# ---- flow ----------------------------------------------------------------- + +log "waiting for OAP runtime-rule port ${OAP_REST_PORT}" +for _ in $(seq 1 60); do + curl -fsS "${REST_BASE}/runtime/rule/list" >/dev/null 2>&1 && break + sleep 2 +done + +# Resolve the emitter container so subsequent step_set calls don't pay the +# `docker ps` cost twice. +emitter_container >/dev/null + +# Phase 1 — CREATE. +log "=== Phase 1: CREATE seed-rule.yaml ===" +step_set "create" +resp="$(post_rule "${SEED_RULES_DIR}/seed-rule.yaml")" +assert_apply_status "structural_applied" "${resp}" +list_row "ACTIVE" >/dev/null +hash_initial="$(list_row ACTIVE | jq -r '.contentHash')" +log " initial contentHash=${hash_initial}" +await_metric_for_step "e2e_rr_requests" "create" + +# Phase 2 — UPDATE-FILTER (body-only, same shape). +log "=== Phase 2: UPDATE-FILTER seed-rule-filter-only.yaml ===" +step_set "update_filter" +resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-filter-only.yaml")" +assert_apply_status "filter_only_applied" "${resp}" +hash_filter_only="$(list_row ACTIVE | jq -r '.contentHash')" +[[ "${hash_filter_only}" != "${hash_initial}" ]] \ + || fail "FILTER_ONLY apply did not advance /list contentHash" +log " contentHash advanced to ${hash_filter_only}" +await_metric_for_step "e2e_rr_requests" "update_filter" + +# Phase 3 — UPDATE-STRUCTURAL (adds e2e_rr_pool metric). +log "=== Phase 3: UPDATE-STRUCTURAL seed-rule-structural.yaml ===" +step_set "structural" +resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-structural.yaml" "allowStorageChange=true")" +assert_apply_status "structural_applied" "${resp}" +hash_structural="$(list_row ACTIVE | jq -r '.contentHash')" +[[ "${hash_structural}" != "${hash_filter_only}" ]] \ + || fail "STRUCTURAL apply did not advance /list contentHash" +log " contentHash advanced to ${hash_structural}" +await_metric_for_step "e2e_rr_requests" "structural" +await_metric_for_step "e2e_rr_pool" "structural" + +# Phase 4 — DUMP (mid-flight). +log "=== Phase 4: DUMP (mid-flight) ===" +assert_dump_contains "mid-flight" "manifest" "${NAME}" + +# Phase 5 — ILLEGAL-APPLY × 4. Each rejection must: +# (a) return the documented HTTP status code, +# (b) leave /list contentHash unchanged (the bad rule never replaced the +# active one), +# (c) leave the existing rule's MAL converter alive — a rejection that +# accidentally tore down the converter would freeze the bucket id for +# step=structural even though contentHash stays unchanged. +# We do NOT change `step` here: the structural rule from phase 3 keeps +# aggregating regardless of which rule the operator just tried to push, so +# any `step=illegal_*` rows that appear would be the existing rule's output, +# not evidence the rejection failed. + +log "=== Phase 5a: ILLEGAL malformed YAML ===" +struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" +post_rule_expect_status "${SEED_RULES_DIR}/illegal-malformed.yaml" "400" >/dev/null +[[ "$(list_row ACTIVE | jq -r '.contentHash')" == "${hash_structural}" ]] \ + || fail "5a: contentHash moved after malformed YAML rejection" +assert_metric_step_advanced "e2e_rr_requests" "structural" "${struct_baseline}" 180 + +log "=== Phase 5b: ILLEGAL shape flip without allowStorageChange ===" +struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" +post_rule_expect_status "${SEED_RULES_DIR}/illegal-shape-flip.yaml" "409" >/dev/null +[[ "$(list_row ACTIVE | jq -r '.contentHash')" == "${hash_structural}" ]] \ + || fail "5b: contentHash moved after shape-flip rejection" +assert_metric_step_advanced "e2e_rr_requests" "structural" "${struct_baseline}" 180 + +log "=== Phase 5c: ILLEGAL /delete on ACTIVE row ===" +struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" +post_url_expect_status "${REST_BASE}/runtime/rule/delete?catalog=${CATALOG}&name=${NAME}" "409" >/dev/null +[[ "$(list_row ACTIVE | jq -r '.contentHash')" == "${hash_structural}" ]] \ + || fail "5c: row state changed after /delete-on-ACTIVE rejection" +assert_metric_step_advanced "e2e_rr_requests" "structural" "${struct_baseline}" 180 + +log "=== Phase 5d: ILLEGAL duplicate metric ownership (sibling rule) ===" +struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" +post_rule_expect_status "${SEED_RULES_DIR}/illegal-duplicate-metric.yaml" "409" "" "${SIBLING_NAME}" >/dev/null +list_no_row "${SIBLING_NAME}" +[[ "$(list_row ACTIVE | jq -r '.contentHash')" == "${hash_structural}" ]] \ + || fail "5d: primary rule's contentHash moved after sibling-conflict rejection" +assert_metric_step_advanced "e2e_rr_requests" "structural" "${struct_baseline}" 180 + +# Phase 6 — SHAPE-BREAK via the supported route: /inactivate → /delete → +# POST a new shape under the same (catalog, name). +log "=== Phase 6: SHAPE-BREAK ===" +step_set "shape_break_old" +log " /inactivate to release the old shape" +inactivate_url="${REST_BASE}/runtime/rule/inactivate?catalog=${CATALOG}&name=${NAME}" +retry_curl_post "${inactivate_url}" >/dev/null \ + || fail "shape-break: inactivate failed" +list_row "INACTIVE" >/dev/null +log " /delete to drop the old measure" +delete_url="${REST_BASE}/runtime/rule/delete?catalog=${CATALOG}&name=${NAME}" +retry_curl_post "${delete_url}" >/dev/null \ + || fail "shape-break: delete failed" +list_no_row + +step_set "shape_break_new" +log " POST INSTANCE-scope rule v4" +resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" +assert_apply_status "structural_applied" "${resp}" +hash_shape_break="$(list_row ACTIVE | jq -r '.contentHash')" +log " contentHash after shape break = ${hash_shape_break}" +# Rule v4 is INSTANCE-scope; swctl now needs --instance-name to resolve +# the entity. Set ENTITY_INSTANCE for the remainder of the flow (phases +# 6, 8 read it; phase 7's expect-empty doesn't need it but harmless). +ENTITY_INSTANCE="e2e-rr-i1" +await_metric_for_step "e2e_rr_requests" "shape_break_new" + +# Phase 7 — INACTIVATE (soft-pause: backend schema + data preserved). +# Order matters: /inactivate FIRST, then flip step. Otherwise the brief +# window where the rule is still active aggregates a few `step=inactivate` +# samples and the soft-pause assertion below fails for the wrong reason. +log "=== Phase 7: INACTIVATE (soft-pause) ===" +retry_curl_post "${inactivate_url}" >/dev/null \ + || fail "phase-7: inactivate failed" +list_row "INACTIVE" >/dev/null +step_set "inactivate" +expect_no_metric_for_step "e2e_rr_requests" "inactivate" 30 + +# Phase 8 — ACTIVATE (re-POST same content; lossless). +log "=== Phase 8: ACTIVATE ===" +step_set "activate" +resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" +status="$(echo "${resp}" | jq -r '.applyStatus // empty')" +[[ "${status}" == "structural_applied" || "${status}" == "no_change" ]] \ + || fail "ACTIVATE: unexpected applyStatus=${status} (full: ${resp})" +list_row "ACTIVE" >/dev/null +await_metric_for_step "e2e_rr_requests" "activate" +# NOTE: we do NOT re-assert "no step=inactivate rows" here. Phase 7's in-window +# check already proved the soft-pause window dropped samples. After re-activate, +# OTel's PeriodicExportingMetricReader keeps exporting every counter's +# cumulative value on each tick, including the {step=inactivate} counter that +# still holds its last value from phase 7. Once the MAL converter is back, those +# cumulative re-exports flow into storage — that's emitter-side OTel semantics, +# not a runtime-rule contract violation. + +# Phase 9 — DELETE (destructive). +log "=== Phase 9: DELETE ===" +step_set "delete_attempt" +retry_curl_post "${inactivate_url}" >/dev/null \ + || fail "phase-9: inactivate-before-delete failed" +list_row "INACTIVE" >/dev/null +retry_curl_post "${delete_url}" >/dev/null \ + || fail "phase-9: delete failed" +list_no_row +log " ✓ row gone + backend probe agrees" + +# Phase 10 — DUMP (final). After DELETE, the dump should contain only the +# manifest — no rule files. +log "=== Phase 10: DUMP (final) ===" +assert_dump_contains "final" "manifest" + +log "=== runtime-rule-flow.sh PASSED ===" diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-duplicate-metric.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-duplicate-metric.yaml new file mode 100644 index 000000000000..0e366c56e6be --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-duplicate-metric.yaml @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ILLEGAL: a sibling rule that claims the same metric name (`requests`) +# the structural rule already owns. The lifecycle e2e ILLEGAL-APPLY phase +# posts this under a *different* `name` query parameter so the cross-file +# ownership guard fires, and expects HTTP 409 with a body that names the +# colliding owner. +metricPrefix: e2e_rr +expSuffix: service(['service_name'], Layer.GENERAL) +metricsRules: + - name: requests + exp: e2e_rr_request_count_total.sum(['service_name', 'step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-malformed.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-malformed.yaml new file mode 100644 index 000000000000..dc68e8e6f850 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-malformed.yaml @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ILLEGAL: malformed YAML — used by the lifecycle e2e ILLEGAL-APPLY phase +# to assert the runtime-rule REST handler returns 400 compile_failed +# without mutating /list or backend storage. The MAL parser rejects the +# unquoted `*** broken ***` token before any registration runs. +metricPrefix: e2e_rr +expSuffix: service(['service_name'], Layer.GENERAL) +metricsRules: + - name: requests + exp: *** broken *** diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-shape-flip.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-shape-flip.yaml new file mode 100644 index 000000000000..1661803efd42 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/illegal-shape-flip.yaml @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ILLEGAL: same-name shape break (SERVICE → SERVICE_INSTANCE scope) posted +# WITHOUT `?allowStorageChange=true`. The lifecycle e2e ILLEGAL-APPLY +# phase posts this against the structural rule already on the OAP and +# expects HTTP 409 storage_change_requires_explicit_approval. +metricPrefix: e2e_rr +expSuffix: instance(['service_name'], ['service_instance_id'], Layer.GENERAL) +metricsRules: + - name: requests + exp: e2e_rr_request_count_total.sum(['service_name', 'service_instance_id', 'step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-filter-only.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-filter-only.yaml new file mode 100644 index 000000000000..46b3a93bc541 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-filter-only.yaml @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# FILTER_ONLY update of seed-rule.yaml. +# Same metric name, same shape (still service-scope sum), expression body +# scaled by 10 so post-update bucket values jump by an order of magnitude +# and the e2e can confirm the swap took effect without backend DDL. +metricPrefix: e2e_rr +expSuffix: service(['service_name'], Layer.GENERAL) +metricsRules: + - name: requests + exp: (e2e_rr_request_count_total * 10).sum(['service_name', 'step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-instance.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-instance.yaml new file mode 100644 index 000000000000..55d738564de7 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-instance.yaml @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# v4 — shape-break variant. Same metric name (`requests`), but the scope +# changes from SERVICE to SERVICE_INSTANCE, which is a same-name shape +# break that the runtime-rule guardrail refuses unless the operator +# explicitly approves. Posted via the SHAPE-BREAK phase after a +# /inactivate + /delete cycle (the supported route through the API), +# and again via the ACTIVATE phase to prove the soft-pause window +# preserves history while the new shape is reapplied. +metricPrefix: e2e_rr +expSuffix: instance(['service_name'], ['service_instance_id'], Layer.GENERAL) +metricsRules: + - name: requests + exp: e2e_rr_request_count_total.sum(['service_name', 'service_instance_id', 'step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-structural.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-structural.yaml new file mode 100644 index 000000000000..3ad6aab6d1f8 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule-structural.yaml @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# STRUCTURAL update of seed-rule-filter-only.yaml. +# Adds a second metric (e2e_rr_pool) derived from the gauge — STRUCTURAL +# because the metric set changes. Each backend's CreatingListener fires +# again; the e2e asserts the new measure / table / index lands and the +# new metric is queryable. The pre-existing e2e_rr_requests metric stays +# under its scaled expression. +metricPrefix: e2e_rr +expSuffix: service(['service_name'], Layer.GENERAL) +metricsRules: + - name: requests + exp: (e2e_rr_request_count_total * 10).sum(['service_name', 'step']) + - name: pool + exp: e2e_rr_pool_size.sum(['service_name', 'step']) diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule.yaml b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule.yaml new file mode 100644 index 000000000000..4e00e9aaf175 --- /dev/null +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/seed-rules/seed-rule.yaml @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# v1 — initial CREATE rule for the runtime-rule lifecycle e2e. +# Derives one SkyWalking metric (e2e_rr_requests) from the OTLP counter +# emitted by the otlp-emitter container, keyed by service + step so each +# lifecycle phase's samples land on a distinct row the verify queries can +# attribute back to the phase that wrote them. +# Verifies first-time DDL on each backend — boots without any static +# otel-rules / lal so the runtime-rule apply path is the one that +# creates the underlying measure / table / index. +metricPrefix: e2e_rr +expSuffix: service(['service_name'], Layer.GENERAL) +metricsRules: + - name: requests + exp: e2e_rr_request_count_total.sum(['service_name', 'step']) diff --git a/test/e2e-v2/java-test-service/e2e-mock-baseline-server/pom.xml b/test/e2e-v2/java-test-service/e2e-mock-baseline-server/pom.xml index 167c1bd65c44..e24dc67e3a3a 100644 --- a/test/e2e-v2/java-test-service/e2e-mock-baseline-server/pom.xml +++ b/test/e2e-v2/java-test-service/e2e-mock-baseline-server/pom.xml @@ -117,7 +117,7 @@ com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:1.42.1:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/test/e2e-v2/java-test-service/e2e-protocol/pom.xml b/test/e2e-v2/java-test-service/e2e-protocol/pom.xml index 868fc6215a44..fa2fc16dc295 100644 --- a/test/e2e-v2/java-test-service/e2e-protocol/pom.xml +++ b/test/e2e-v2/java-test-service/e2e-protocol/pom.xml @@ -81,7 +81,7 @@ com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:1.42.1:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/test/e2e-v2/java-test-service/opentelemetry-proto/pom.xml b/test/e2e-v2/java-test-service/opentelemetry-proto/pom.xml index 12823fffcdc8..353fdcf20a76 100644 --- a/test/e2e-v2/java-test-service/opentelemetry-proto/pom.xml +++ b/test/e2e-v2/java-test-service/opentelemetry-proto/pom.xml @@ -81,7 +81,7 @@ com.google.protobuf:protoc:3.19.2:exe:${os.detected.classifier} grpc-java - io.grpc:protoc-gen-grpc-java:1.42.1:exe:${os.detected.classifier} + io.grpc:protoc-gen-grpc-java:${grpc.version}:exe:${os.detected.classifier} diff --git a/test/e2e-v2/script/env b/test/e2e-v2/script/env index ae6486451f04..5fb3415a029d 100644 --- a/test/e2e-v2/script/env +++ b/test/e2e-v2/script/env @@ -23,7 +23,7 @@ SW_AGENT_CLIENT_JS_COMMIT=f08776d909eb1d9bc79c600e493030651b97e491 SW_AGENT_CLIENT_JS_TEST_COMMIT=4f1eb1dcdbde3ec4a38534bf01dded4ab5d2f016 SW_KUBERNETES_COMMIT_SHA=2850db1502283a2d8516146c57cc2b49f1da934b SW_ROVER_COMMIT=79292fe07f17f98f486e0c4471213e1961fb2d1d -SW_BANYANDB_COMMIT=d0f41f9ad139c917c1398c2e62a9f7034214495f +SW_BANYANDB_COMMIT=69c8f4d20ebb6532ea4c16a7ed7114dd6ec9770b SW_AGENT_PHP_COMMIT=d1114e7be5d89881eec76e5b56e69ff844691e35 SW_PREDICTOR_COMMIT=54a0197654a3781a6f73ce35146c712af297c994 From 9eeb0a8f66beabe6eb2d98436251faf41e9fc30f Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Wed, 29 Apr 2026 23:01:27 +0800 Subject: [PATCH 2/5] Runtime-rule hot-update: rename SPI for clarity, unify revertToBundled flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rename `StorageManipulationOpt` factory methods + `Mode` constants throughout core and consumers (`fullInstall` → `withSchemaChange`, `localCacheOnly` → `withoutSchemaChange`, `localCacheVerify` → `verifySchemaOnly`, `createIfAbsent` → `schemaCreateIfAbsent`). Rename `Kind.STATIC` → `Kind.BUNDLED` to align with `Status.BUNDLED`. Loader name prefix flips from `static:` to `bundled:` so diagnostics match the vocabulary. Rename runtime-rule SPI: `loadStaticRuleFile` → `recordBundledClaims` (it stamps claim metadata, not load), `reloadStatic` → `installBundled`, `dropBackend` → `installRuntime` (purpose: install runtime DSL locally for delta computation), and `DSLRuntimeApply.applyInline` → `apply`. Thread `Kind` through `RuleEngine.compile`, `DSLRuntimeApply.apply`, and `compileAndVerify`. Unify `/delete?mode=revertToBundled` into a single `DSLRuntimeDelete.revertToBundled` method that runs the standard apply pipeline against the bundled YAML: install runtime locally → `apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange)` so the commit's delta drops runtime-only metrics and installs bundled-only ones → reset state to boot-seeded. Eliminates the prior re-register-then-drop dance and reuses the same code path operators already exercise via `/addOrUpdate`. Default `/delete` no longer drops backend schema; the row is removed and any backend resource is left as an inert artefact (matches bundled-rule deletion semantics on disk). The schema-change moment lives only on the explicit `?mode=revertToBundled` path. Fix StaticRuleLoader.loadAll: now uses `rules.compute` to overlay bundled content and RUNNING state on the engine-installed Applied (`putIfAbsent` was a no-op because `recordBundledClaims` had already created the entry, so bundled-only rules were missing content/state for `/list`, suspend, and first-edit classify). Fix RuleSync.cleanupGoneKeys: pass `withoutSchemaChange` unconditionally so a peer-promoted-to-main node cannot drop the backend during gone-keys cleanup (contradicting the operator-facing contract that default `/delete` preserves backend resources). Fix revertToBundled rollback: when bundled apply fails, unregister the step-1 runtime install so local state matches the persisted INACTIVE row (previously left runtime serving silently after a failed revert). Add `requires_revert_to_bundled` 409 response: default `/delete` against a rule with a bundled YAML twin is refused so letting bundled silently take over the `(catalog, name)` requires an explicit operator decision. Reword `requires_inactivate_first` and `revert_to_bundled_failed` responses to match new behavior. Documentation: update `backend-runtime-rule-api.md` (status table, error codes, loaderKind values, `/delete` storage semantics per backend), `runtime-rule-hot-update.md` design doc (four `/delete` paths spelled out), and `changes.md` changelog entry. --- docs/en/changes/changes.md | 19 +- .../runtime-rule-hot-update.md | 39 ++- .../setup/backend/backend-runtime-rule-api.md | 63 +++-- .../oap/meter/analyzer/v2/Analyzer.java | 18 +- .../oap/meter/analyzer/v2/MetricConvert.java | 10 +- .../core/analysis/meter/MeterSystem.java | 20 +- .../worker/ManagementStreamProcessor.java | 2 +- .../worker/MetricsStreamProcessor.java | 8 +- .../analysis/worker/NoneStreamProcessor.java | 2 +- .../worker/RecordStreamProcessor.java | 2 +- .../analysis/worker/TopNStreamProcessor.java | 2 +- .../core/classloader/ClassLoaderGc.java | 2 +- .../classloader/DSLClassLoaderManager.java | 8 +- .../core/classloader/RuleClassLoader.java | 6 +- .../core/storage/model/ModelInstaller.java | 2 +- .../core/storage/model/ModelRegistry.java | 6 +- .../storage/model/StorageManipulationOpt.java | 86 +++--- .../core/storage/model/StorageModels.java | 4 +- .../DSLClassLoaderManagerTest.java | 6 +- .../core/classloader/RuleClassLoaderTest.java | 8 +- .../core/storage/model/StorageModelsTest.java | 8 +- .../runtimerule/apply/LalFileApplier.java | 4 +- .../runtimerule/apply/MalFileApplier.java | 16 +- .../runtimerule/engine/RuleEngine.java | 96 ++++--- .../runtimerule/engine/lal/LalRuleEngine.java | 56 ++-- .../runtimerule/engine/mal/MalRuleEngine.java | 155 ++++------ .../module/RuntimeRuleModuleProvider.java | 20 +- .../runtimerule/reconcile/DSLManager.java | 59 ++-- .../reconcile/DSLRuntimeApply.java | 38 ++- .../reconcile/DSLRuntimeDelete.java | 265 ++++++++++++------ .../reconcile/DSLRuntimeUnregister.java | 25 +- .../runtimerule/reconcile/RuleSync.java | 26 +- .../reconcile/StaticRuleLoader.java | 26 +- .../receiver/runtimerule/rest/DeleteMode.java | 20 +- .../runtimerule/rest/RuntimeRuleService.java | 175 +++++++----- .../runtimerule/apply/MalFileApplierTest.java | 2 +- .../rest/RuntimeRuleRestHandlerTest.java | 10 +- .../banyandb/BanyanDBIndexInstaller.java | 16 +- 38 files changed, 749 insertions(+), 581 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index ff8d7151c6d8..8a146805ab13 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -15,13 +15,18 @@ on disk are not auto-resurrected when an `inactivate` removes the runtime override. This is the safe way to take a rule offline. * `delete` — removes an `INACTIVE` row (active rules return `409 requires_inactivate_first`). - For runtime-only rules with no bundled YAML on disk, the backend measure is dropped and - the rule is fully gone. For rules that have a bundled YAML, `delete` is non-destructive: - backend resources runtime claimed that bundled does not (or claims at a different shape) - are dropped, bundled-shared at matching shape is preserved, the row is removed, and the - bundled rule is reinstalled into a `static:` loader on the local node — peers converge - via the periodic reconcile. `?mode=revertToBundled` is an explicit operator hint that - fails with `400 no_bundled_twin` when no bundled YAML exists. + For runtime-only rules with no bundled YAML on disk, the row is dropped; the backend + measure (if any) is left in place as an inert artefact, matching bundled-rule deletion + semantics (removing a YAML from `otel-rules/` on disk doesn't drop its measure either). + For rules that have a bundled YAML twin, plain `delete` returns `409 + requires_revert_to_bundled` — letting bundled silently take over the + `(catalog, name)` is a meaningful state change that requires an explicit operator + decision. Re-issue with `?mode=revertToBundled` to fall back to bundled: that path runs + the schema-change pipeline (rehydrates the runtime DSL locally, then applies the + bundled YAML through the standard apply pipeline so the runtime→bundled delta drops + runtime-only metrics, registers bundled-only metrics, and reuses bundled-shared metrics + at matching shape) before removing the row. Returns `400 no_bundled_twin` when + `?mode=revertToBundled` is used without a bundled YAML on disk. * `get` / `bundled` / `list` / `dump` — read-side endpoints for fetching a single rule's YAML (with `ETag` support; `?source=bundled` reads the on-disk bundled YAML even when a runtime override is in place), listing the bundled-vs-runtime overlay per catalog, diff --git a/docs/en/concepts-and-designs/runtime-rule-hot-update.md b/docs/en/concepts-and-designs/runtime-rule-hot-update.md index 9ee963f4cca6..135c8a301fd0 100644 --- a/docs/en/concepts-and-designs/runtime-rule-hot-update.md +++ b/docs/en/concepts-and-designs/runtime-rule-hot-update.md @@ -221,17 +221,34 @@ never destroys data they might want back: reset), but the **backend measure and its data are explicitly preserved**. Re-activation via `/addOrUpdate` reuses the existing measure; the cost is a recompile, not a backfill or a metric-identity change. -- **`/delete`** is the **destructive** endpoint — the **only** one that drops - data. It refuses to operate on an `ACTIVE` row (returns `HTTP 409 - requires_inactivate_first`), so destruction always goes through the explicit - two-step `/inactivate → /delete` workflow. On an `INACTIVE` row it drops the - backend measure and removes the entry; on an absent row it is an idempotent - `200 not_found`. - -If a static version of the rule exists on disk, `/delete` of the runtime entry -causes the rule to revert to the static content on the next periodic scan. This is -the intended recovery path for "undo all operator state, go back to what ships in -the OAP distribution." +- **`/delete`** removes the runtime row. It refuses to operate on an `ACTIVE` + row (returns `HTTP 409 requires_inactivate_first`), so destruction always goes + through the explicit two-step `/inactivate → /delete` workflow. On an absent + row it is an idempotent `200 not_found`. The `/inactivate` step has already + torn down OAP-internal state under `withoutSchemaChange` (handlers, prototypes, + Models cleared; backend measure preserved). What `/delete` does next depends on + whether a bundled YAML twin exists on disk: + - **No bundled twin (default mode)** — drops the row only; the backend measure + (if any) is left in place as an inert artefact. This matches bundled-rule + deletion semantics: removing a YAML from `otel-rules/` on disk doesn't drop + its measure either. Operators who want backend cleanup must purge the + measure out-of-band with the storage backend's tools. + - **Bundled twin exists, default mode** — refused with `HTTP 409 + requires_revert_to_bundled`. Letting bundled silently take over the + `(catalog, name)` after the row goes away is a meaningful state change that + requires an explicit operator decision. + - **Bundled twin exists, `?mode=revertToBundled`** — schema-change path. + Bundled may have a different shape than runtime, so the apply pipeline runs + a unified flow: re-install the prior runtime DSL locally under + `withoutSchemaChange` (no backend touch), apply the bundled YAML through + the standard `compile → fireSchemaChanges → verify → commit` pipeline with + `withSchemaChange`. The commit's delta drops runtime-only metrics through + the listener chain, registers bundled-only metrics, and reuses bundled-shared + metrics at matching shape. The runtime row is then removed and the bundled + rule is the active loader on this node. Peers converge via the periodic + scan. + - **No bundled twin, `?mode=revertToBundled`** — refused with `HTTP 400 + no_bundled_twin`. ### Inactive rules still hold their names diff --git a/docs/en/setup/backend/backend-runtime-rule-api.md b/docs/en/setup/backend/backend-runtime-rule-api.md index 40243f1699e6..d94fc879384e 100644 --- a/docs/en/setup/backend/backend-runtime-rule-api.md +++ b/docs/en/setup/backend/backend-runtime-rule-api.md @@ -86,7 +86,7 @@ server returns `400 compile_failed`. |--------|------------------------------------------------------------------------------------|---------------|--------| | POST | `/runtime/rule/addOrUpdate?catalog=&name=[&allowStorageChange=true][&force=true]` | raw rule YAML | Creates or replaces a rule. Edits that keep the same metric storage shape are applied without pausing the cluster. Edits that add, remove, or reshape metrics pause affected traffic, update and verify backend storage, save the rule, and then resume. If the posted content exactly matches the current `ACTIVE` rule, the server returns `no_change`; `force=true` skips that shortcut for recovery. | | POST | `/runtime/rule/inactivate?catalog=&name=` | empty | Soft-pauses a rule. OAP stops using the rule and saves it as `INACTIVE`, while the backend measure and historical data remain available for reactivation. | -| POST | `/runtime/rule/delete?catalog=&name=[&mode=revertToBundled]` | empty | Removes an `INACTIVE` runtime row. Active rules return `409 requires_inactivate_first`. **No bundled twin on disk** → destructive: backend resource is dropped and the rule is fully gone. **Bundled twin on disk** → non-destructive: backend is preserved (bundled will reuse it), the row is removed, and the bundled rule is reinstalled into a `static:` loader on the local node. Peers converge via the gone-keys reconcile path on their next tick. `?mode=revertToBundled` is an explicit operator hint that requires a bundled twin (returns `400 no_bundled_twin` when none exists) — useful for scripts that want to fail loudly if their assumption was wrong. The OAP-side teardown (cluster-wide unparking, dispatcher / worker / catalog / model removal, stored-rule removal) is uniform; the **storage-side** effect is per-backend (see below). | +| POST | `/runtime/rule/delete?catalog=&name=[&mode=revertToBundled]` | empty | Removes an `INACTIVE` runtime row. Active rules return `409 requires_inactivate_first`. **No bundled twin, default mode** → drops the row; the backend measure (if any) stays as an inert artefact. **Bundled twin exists, default mode** → returns `409 requires_revert_to_bundled` because letting bundled silently take over the `(catalog, name)` is a meaningful state change requiring an explicit operator decision. **Bundled twin exists, `?mode=revertToBundled`** → schema-change path: the orchestrator rehydrates the runtime DSL locally and runs the bundled YAML through the standard apply pipeline so the runtime→bundled delta drops runtime-only metrics, registers bundled-only metrics, and reuses bundled-shared metrics at matching shape; the row is then removed and the bundled rule is the active loader on the local node. Peers converge via the gone-keys reconcile path on their next tick. **No bundled twin, `?mode=revertToBundled`** → returns `400 no_bundled_twin`. | **Read endpoints** @@ -99,23 +99,22 @@ server returns `400 compile_failed`. ### `/delete` storage semantics — per backend -`/delete` always tears down the rule on the OAP side: the cluster unparks the affected -dispatchers, removes the workers, drops the model from the in-memory registry, removes -the stored rule, and the rule no longer appears in `/runtime/rule/list`. What happens to -the **on-disk data** depends on the storage plugin: +`/delete` tears down the runtime DSL on the OAP side (the `/inactivate` step that must +precede it has already unparked dispatchers, removed workers, and dropped the model from +the in-memory registry) and removes the stored rule from `/runtime/rule/list`. What +happens to the **backend schema and data** depends on the path taken: -| Backend | After `/delete` | Old data still queryable? | -|---|---|---| -| **BanyanDB** | The measure / stream group + schema are dropped (`dropMeasure` / `dropStream`). | No — rows are gone. | -| **Elasticsearch** | `dropTable` is a documented **no-op**. The merging index (e.g. `metrics-all`) and any per-metric index stay. | Yes — historical samples remain in place until TTL expires. | -| **JDBC (H2 / MySQL / PostgreSQL / TiDB / OceanBase)** | `dropTable` is a documented **no-op**. The merging table (e.g. `meter_sum_`) stays. | Yes — historical samples remain in place until TTL expires. | +| Path | Backend effect | +|---|---| +| **Default mode, no bundled twin** | The runtime DSL was already torn down by `/inactivate`. The backend measure (if any) is left in place as an inert artefact — no listener writes to it, but the schema and historical rows stay. This matches bundled-rule deletion semantics on disk: removing a YAML from `otel-rules/` doesn't drop its measure either. Reclaim manually via the storage backend's tools if you need the schema gone. | +| **Default mode, bundled twin** | Refused with `409 requires_revert_to_bundled`. The operator must opt in explicitly. | +| **`?mode=revertToBundled`, bundled twin** | Schema-change path. Bundled may have a different shape than runtime. The runtime DSL is rehydrated locally so the apply pipeline can compute the runtime→bundled delta. Metrics in the delta:
• **Runtime-only** (in runtime, not in bundled) — dropped via the listener chain. On BanyanDB the measure / stream is dropped. On ES / JDBC `dropTable` is a documented no-op (their tables are append-only; TTL reclaims space).
• **Bundled-only** (in bundled, not in runtime) — created.
• **Bundled-shared at matching shape** — reused; no schema mutation.
• **Bundled-shared at differing shape** — reshaped via the listener chain (additive subset each backend supports online: `client.update` for BanyanDB, add-column for JDBC, mapping append for ES). | +| **`?mode=revertToBundled`, no bundled twin** | Refused with `400 no_bundled_twin`. | -The ES / JDBC behaviour is intentional and consistent with how the static catalog treats -table lifecycle on those backends: tables are append-only, and TTL — not DDL — reclaims -space. If you need the data gone immediately, drop the table out-of-band with the storage -backend's own tools after `/delete` returns. +Historical query semantics on ES and JDBC are unchanged from prior releases: tables stay +beyond `dropTable` and TTL reclaims rows. -A re-`addOrUpdate` of the same rule (same name, same scope and downsampling) replays +A re-`addOrUpdate` of the same rule (same name, scope and downsampling) replays schema registration. On BanyanDB this re-creates the measure; on ES / JDBC this is a no-op against the existing index / table. In both cases new samples land alongside any retained history. @@ -175,6 +174,16 @@ false. > a clean identity. Treat the flag as an explicit "I accept data loss" affirmation, not a > convenience toggle. +> **Edge case — `/addOrUpdate` after a `/delete` of a runtime-only rule.** Default `/delete` +> with no bundled twin leaves the backend measure in place as an inert artefact. A later +> `/addOrUpdate` against the same `(catalog, name)` has no `priorContent` to diff against, +> so the storage-change guardrail will not refuse the request even when the new content +> reuses the same metric names with a different shape. The apply pipeline's listener chain +> may reshape the inert backend measure silently. If you suspect a stale schema from a +> previously-deleted rule, push the new rule with `allowStorageChange=true` so the intent +> is explicit; or rename the metrics in the new rule so the old schema stays inert and a +> fresh measure is created instead. + ### Recovery from a failed apply When an `/addOrUpdate` fails during validation or apply, the node does **not** lose the @@ -279,7 +288,8 @@ the response formats listed above; their error responses use the same JSON shape | 200 OK | `inactivated` | row flipped to `INACTIVE`; backend measure and data preserved | | 200 OK | `static_tombstoned` | `/inactivate` against a rule that exists only on disk; an `INACTIVE` tombstone row is now persisted | | 200 OK | `already_inactive` | `/inactivate` against an already-inactive row; idempotent no-op | -| 200 OK | `deleted` | row hard-deleted; backend measure dropped (MAL) or in-process handlers removed (LAL) | +| 200 OK | `deleted` | `/delete` of a rule with no bundled twin; row removed, backend measure left as inert artefact | +| 200 OK | `reverted_to_bundled` | `/delete?mode=revertToBundled`; runtime row removed, bundled rule installed via the apply pipeline (schema change handled by the standard delta path) | | 200 OK | `not_found` | `/inactivate` or `/delete` against an absent rule; idempotent no-op | | 200 OK | `filter_only_persisted` | row persisted but the in-memory swap threw on this node; converges on the next periodic scan | @@ -288,9 +298,13 @@ the response formats listed above; their error responses use the same JSON shape | Status | `applyStatus` | Meaning | |-------------------|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------| | 400 Bad Request | `compile_failed`, `empty_body`, `invalid_*` | rule parse failure or request validation failure; row was NOT persisted | +| 400 Bad Request | `invalid_catalog`, `invalid_mode` | unknown `catalog=` or `mode=` query value | +| 400 Bad Request | `no_bundled_twin` | `/delete?mode=revertToBundled` against a rule with no bundled YAML on disk; drop the mode flag, or check that the bundled YAML exists | | 409 Conflict | `storage_change_requires_explicit_approval` | update would move storage identity and `allowStorageChange` was not set — no cluster pause, no persist, no side effects | | 409 Conflict | `update_in_progress` | another apply is already in flight for this rule; retry after a few seconds | | 409 Conflict | `requires_inactivate_first` | `/delete` against an `ACTIVE` row; run `/inactivate` first, then `/delete` | +| 409 Conflict | `requires_revert_to_bundled` | `/delete` (default mode) against a rule with a bundled YAML twin on disk; either re-issue with `?mode=revertToBundled` to fall back to bundled, or leave the row `INACTIVE` | +| 409 Conflict | `delete_refused` | cross-file ownership conflict: bundled's claims overlap another active bundle. Update or `/inactivate` the conflicting bundle(s) first | | 503 Service Unavailable | `storage_unavailable` | storage could not be read while checking the current rule; retry when storage is healthy | **Cluster-routing errors — usually transient** @@ -311,7 +325,9 @@ the response formats listed above; their error responses use the same JSON shape | 500 Internal Server Error | `persist_failed` | row write failed; on filter-only this node still serves the pre-edit rule, on structural the local node rolled back and resumed peers | | 500 Internal Server Error | `commit_deferred` | apply succeeded and row was persisted, but the local finishing step failed on this node. Storage is authoritative and peers will converge; this node will retry on its next periodic scan | | 500 Internal Server Error | `teardown_deferred` | row was inactivated, but local cleanup failed; this node retries on the next periodic scan | -| 500 Internal Server Error | `dao_unavailable`, `inactivate_failed`, `delete_backend_drop_failed`, `delete_failed`, other `*_failed` | management storage or backend cleanup failed; no destructive row removal is completed unless the backend cleanup succeeded | +| 500 Internal Server Error | `revert_to_bundled_failed` | bundled apply failed during DDL or verify (typically a backend-storage issue — BanyanDB unreachable, shape rejection, or schema-barrier timeout). The orchestrator unwound the step-1 runtime install so local state matches the persisted INACTIVE row. Retry once storage recovers. | +| 500 Internal Server Error | `revert_to_bundled_precondition_failed` | revertToBundled prep step failed (no engine for catalog, MeterSystem unavailable for installRuntime). Local state is unchanged. Retry when the prerequisite recovers. | +| 500 Internal Server Error | `dao_unavailable`, `inactivate_failed`, `delete_failed`, other `*_failed` | management storage or local cleanup failed; check the message for the specific failure point. | ## Per-node list output @@ -355,10 +371,10 @@ the OAP is actually serving on this node. Reading these three fields together: | Bundled rule shipped on disk; operator never touched it | `BUNDLED` | `NONE` | `true` | Bundled YAML, served from the OAP's shared default classloader (registered at boot by the catalog loaders). | | Operator pushed `/addOrUpdate` overriding a bundled rule | `ACTIVE` | `RUNTIME` | `true` | Runtime override in a per-file `runtime-rule:` loader. Compare `contentHash` with `bundledContentHash` to detect drift. | | Operator pushed `/addOrUpdate` for a brand-new rule (no bundled twin) | `ACTIVE` | `RUNTIME` | `false` | Runtime override in a per-file `runtime-rule:` loader. No bundled fallback. | -| Operator `/inactivate`d a runtime override of a bundled rule | `INACTIVE` | `NONE` | `true` | Nothing — handlers are unregistered. The bundled rule does **not** auto-resurrect; to turn it back on, push `/addOrUpdate` (with the bundled YAML or your own) or call `/delete` (which reverts to bundled). | +| Operator `/inactivate`d a runtime override of a bundled rule | `INACTIVE` | `NONE` | `true` | Nothing — handlers are unregistered. The bundled rule does **not** auto-resurrect; to turn it back on, push `/addOrUpdate` (with the bundled YAML or your own) or call `/delete?mode=revertToBundled` (which reverts to bundled via the schema-change path). Plain `/delete` is refused with `409 requires_revert_to_bundled` to force the explicit decision. | | Operator `/inactivate`d a bundled-only rule | `INACTIVE` | `NONE` | `true` | Nothing — same as above. The `INACTIVE` row is a tombstone carrying the bundled YAML at inactivate-time. | | Operator `/inactivate`d a brand-new runtime rule | `INACTIVE` | `NONE` | `false` | Nothing — handlers gone. To turn back on: `/addOrUpdate` (with new content) or `/delete` (rule is fully gone). | -| `/delete` propagating after a bundled-twin row was removed | `n/a` (no row) | `STATIC` | `true` | Bundled rule, freshly compiled into a `static:` loader. Equivalent to a fresh boot of bundled. | +| `/delete?mode=revertToBundled` propagating after a bundled-twin row was removed | `n/a` (no row) | `BUNDLED` | `true` | Bundled rule, freshly compiled into a `bundled:` loader. Equivalent to a fresh boot of bundled. | Quick decision rules for an operator reading `/list`: @@ -367,7 +383,7 @@ Quick decision rules for an operator reading `/list`: - `status=ACTIVE` + `bundled=true` + `contentHash == bundledContentHash` → runtime override matches bundled. UIs typically render this as "Override (matches bundled)" — common after an explicit `/addOrUpdate ?source=bundled` revert. - `status=ACTIVE` + `bundled=false` → runtime-only rule, no on-disk twin. - `status=INACTIVE` → soft-paused. The DAO row preserves the content the operator last had; `/list` does not surface it (call `GET /runtime/rule` for the YAML). -- `loaderKind=STATIC` → a `static:` loader is currently serving (transient, between `/delete` and the next clean state). +- `loaderKind=BUNDLED` → a `bundled:` loader is currently serving (typical after `/delete?mode=revertToBundled`, where the bundled YAML was compiled into a fresh per-file loader). - `loaderKind=NONE` → no per-file loader. For `BUNDLED` this is normal (shared default loader). For `INACTIVE` this is the rule being off. - `status` — `ACTIVE` or `INACTIVE` for stored rows. `BUNDLED` and `n/a` are synthesized @@ -393,8 +409,9 @@ Quick decision rules for an operator reading `/list`: rule cleanup issue worth investigating. - `loaderKind` — origin of the per-file class loader currently serving this rule: - `RUNTIME` — operator-pushed runtime override. - - `STATIC` — bundled rule serving via static fall-over (a runtime override was previously - in place, then removed; the bundled YAML was reloaded into a fresh `static:` loader). + - `BUNDLED` — bundled rule serving via bundled fall-over (a runtime override was previously + in place, then `/delete?mode=revertToBundled` reinstalled the bundled YAML in a fresh + `bundled:` loader). - `NONE` — no per-file loader (typical for bundled-only rules served from the shared default loader; also a row whose loader has been retired but not yet replaced). - `loaderName` — formatted loader name (`:/@`), the same @@ -431,7 +448,7 @@ This makes the "compare runtime override against bundled" workflow a two-call se fetch the runtime body with the default request, then fetch the bundled body with `?source=bundled` and diff in the editor. `POST /runtime/rule/delete` drops the runtime override; the next `/list` will show the row served by the bundled fall-over -(`loaderKind=STATIC`). +(`loaderKind=BUNDLED`). ## Consistency model — at a glance diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java index 3a607214a4ec..3186609c32d0 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/Analyzer.java @@ -177,14 +177,14 @@ public static Analyzer prepare(final String metricName, final javassist.ClassPool pool, final ClassLoader targetClassLoader) { // Static boot / default path: create-if-absent. Runtime-rule on-demand apply passes - // fullInstall() via the explicit-opt overload. + // withSchemaChange() via the explicit-opt overload. return prepare(metricName, filter, expression, meterSystem, yamlSource, pool, targetClassLoader, - StorageManipulationOpt.createIfAbsent()); + StorageManipulationOpt.schemaCreateIfAbsent()); } /** * Prepare overload that carries a {@link StorageManipulationOpt}. Runtime-rule peer-side - * apply passes {@link StorageManipulationOpt#localCacheOnly()} so subsequent + * apply passes {@link StorageManipulationOpt#withoutSchemaChange()} so subsequent * {@link #register()} call skips server-side DDL. */ public static Analyzer prepare(final String metricName, @@ -200,7 +200,7 @@ public static Analyzer prepare(final String metricName, Analyzer analyzer = new Analyzer(metricName, filter, e, meterSystem, ctx); analyzer.pool = pool; analyzer.targetClassLoader = targetClassLoader; - analyzer.storageOpt = storageOpt == null ? StorageManipulationOpt.createIfAbsent() : storageOpt; + analyzer.storageOpt = storageOpt == null ? StorageManipulationOpt.schemaCreateIfAbsent() : storageOpt; analyzer.resolveTypeFromMetadata(); return analyzer; } @@ -238,13 +238,13 @@ public void register() { private ClassLoader targetClassLoader; /** * Storage-install policy threaded through to {@link MeterSystem#create}. Startup uses - * {@link StorageManipulationOpt#createIfAbsent()} (the default when callers don't set + * {@link StorageManipulationOpt#schemaCreateIfAbsent()} (the default when callers don't set * it — never reshape the backend at boot). Main-node on-demand apply sets - * {@link StorageManipulationOpt#fullInstall()}. Peer-node apply sets - * {@link StorageManipulationOpt#localCacheOnly()} so local Metrics classes + BanyanDB + * {@link StorageManipulationOpt#withSchemaChange()}. Peer-node apply sets + * {@link StorageManipulationOpt#withoutSchemaChange()} so local Metrics classes + BanyanDB * MetadataRegistry populate without server-side DDL. */ - private StorageManipulationOpt storageOpt = StorageManipulationOpt.createIfAbsent(); + private StorageManipulationOpt storageOpt = StorageManipulationOpt.schemaCreateIfAbsent(); /** * Analyse the full sample family map and produce meter-system metrics. @@ -423,7 +423,7 @@ private void createMetric(final ScopeType scopeType, // bundle drops together on hot-remove. if (pool != null && targetClassLoader != null) { // Per-file: generated Metrics class goes directly into the supplied RuleClassLoader. - // storageOpt controls server-side DDL: fullInstall() on main, localCacheOnly() + // storageOpt controls server-side DDL: withSchemaChange() on main, withoutSchemaChange() // on peer — see the Analyzer class-level Javadoc for the main/peer contract. meterSystem.create(metricName, functionName, scopeType, pool, targetClassLoader, storageOpt); } else { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java index cd721e809dac..cd86c1424054 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/MetricConvert.java @@ -80,15 +80,15 @@ public static Stream log(Try t, String debugMessage) { public MetricConvert(MetricRuleConfig rule, MeterSystem service) { // Static boot default: create-if-absent semantics. Runtime-rule on-demand callers use - // the explicit-opt overload and pass fullInstall() to get reshape permission. - this(rule, service, null, null, StorageManipulationOpt.createIfAbsent()); + // the explicit-opt overload and pass withSchemaChange() to get reshape permission. + this(rule, service, null, null, StorageManipulationOpt.schemaCreateIfAbsent()); } public MetricConvert(final MetricRuleConfig rule, final MeterSystem service, final javassist.ClassPool pool, final ClassLoader targetClassLoader) { this(rule, service, pool, targetClassLoader, - StorageManipulationOpt.createIfAbsent()); + StorageManipulationOpt.schemaCreateIfAbsent()); } /** @@ -98,8 +98,8 @@ public MetricConvert(final MetricRuleConfig rule, final MeterSystem service, * @param service MeterSystem target for registration * @param pool per-file Javassist pool, or null to use the shared default * @param targetClassLoader per-file ClassLoader, or null to use the shared default - * @param storageOpt policy for backend-side install; main-node passes fullInstall, - * peer-node passes localCacheOnly to skip server DDL + * @param storageOpt policy for backend-side install; main-node passes withSchemaChange, + * peer-node passes withoutSchemaChange to skip server DDL */ public MetricConvert(final MetricRuleConfig rule, final MeterSystem service, final javassist.ClassPool pool, diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java index 90c225adf530..4d84de678fb1 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/meter/MeterSystem.java @@ -152,7 +152,7 @@ public synchronized void create(String metricsName, // Static boot path: create-if-absent semantics so a backend that already holds this // metric under a different shape is preserved and reported, not silently reshaped. createInternal(metricsName, functionName, type, dataType, classPool, MeterClassPackageHolder.class, - StorageManipulationOpt.createIfAbsent()); + StorageManipulationOpt.schemaCreateIfAbsent()); } /** @@ -167,8 +167,8 @@ public synchronized void create(String metricsName, * Runtime-rule entry point: create a streaming calculation under a caller-supplied * per-file {@code ClassPool} + {@code ClassLoader}, with a caller-specified * {@link StorageManipulationOpt} policy. Main-node apply passes - * {@link StorageManipulationOpt#fullInstall()} (the usual install path); peer-node apply - * passes {@link StorageManipulationOpt#localCacheOnly()} so local state is populated + * {@link StorageManipulationOpt#withSchemaChange()} (the usual install path); peer-node apply + * passes {@link StorageManipulationOpt#withoutSchemaChange()} so local state is populated * (MeterSystem meterPrototypes, BanyanDB MetadataRegistry, StorageModels entry) without * firing server-side {@code createMeasure} / {@code update}. */ @@ -383,7 +383,7 @@ public synchronized void create(String metricsName, throw new IllegalArgumentException("classLoaderNeighbor must not be null"); } createInternal(metricsName, functionName, type, dataType, pool, classLoaderNeighbor, - StorageManipulationOpt.fullInstall()); + StorageManipulationOpt.withSchemaChange()); } /** @@ -418,12 +418,12 @@ public synchronized void create(String metricsName, * @return {@code true} if a metric was found and removed, {@code false} otherwise */ public synchronized boolean removeMetric(final String metricsName) { - return removeMetric(metricsName, StorageManipulationOpt.fullInstall()); + return removeMetric(metricsName, StorageManipulationOpt.withSchemaChange()); } /** * Opt-aware {@code removeMetric}. Runtime-rule peer-side callers pass - * {@link StorageManipulationOpt#localCacheOnly()} so {@code ModelInstaller.dropTable} is + * {@link StorageManipulationOpt#withoutSchemaChange()} so {@code ModelInstaller.dropTable} is * NOT invoked on the shared storage — the cluster main owns that side-effect. * *

Order is backend-first / local-state-second so failure is retriable. The earlier @@ -435,11 +435,11 @@ public synchronized boolean removeMetric(final String metricsName) { * caches; on failure we leave {@code meterPrototypes} populated and the CtClass attached * so a retry hits the backend again. * - *

Failure surface: under {@code fullInstall} the storage-model cascade failure is + *

Failure surface: under {@code withSchemaChange} the storage-model cascade failure is * propagated as a {@link RuntimeException}. The REST {@code /inactivate} path depends on * this to surface 500 {@code teardown_deferred} when BanyanDB's delete-measure threw — * without it the handler would return 200 inactivated despite the measure still being - * live. Under {@code localCacheOnly} the cascade fires {@code whenRemoving} but the + * live. Under {@code withoutSchemaChange} the cascade fires {@code whenRemoving} but the * peer's {@code ModelInstaller.dropTable} is suppressed by policy, so any throw is * logged and swallowed — the peer has no backend debt. Streaming-chain drain failures * are always logged and swallowed: stale workers self-drain within one tick. @@ -454,7 +454,7 @@ public synchronized boolean removeMetric(final String metricsName, final Storage // Cascade storage-model removal (Hour / Day / Minute) FIRST. ModelRegistry.remove // fires whenRemoving on every listener, so each backend's ModelInstaller.dropTable // runs — real delete for BanyanDB, no-op for JDBC / Elasticsearch, skipped outright - // when the caller is a peer-side (LOCAL_CACHE_ONLY) apply. If a listener throws, + // when the caller is a peer-side (WITHOUT_SCHEMA_CHANGE) apply. If a listener throws, // ModelRegistry.remove keeps the model in its registry so this retry path stays // open: the caller (Reconciler unregisterBundle) preserves appliedMal[key] and the // next tick (or operator retry) re-enters this method, finds meterPrototypes still @@ -472,7 +472,7 @@ public synchronized boolean removeMetric(final String metricsName, final Storage + "; backend drop did not complete. Local state preserved for retry.", t); } - // Non-escalating opt (peer-side localCacheOnly, etc.) — backend drop is + // Non-escalating opt (peer-side withoutSchemaChange, etc.) — backend drop is // suppressed by policy anyway, so a listener throw here is the listener's // local bookkeeping misbehaving, not real backend debt. Fall through to // clear local state. diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java index 992466530e7a..c11660fe636a 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/ManagementStreamProcessor.java @@ -81,7 +81,7 @@ public void create(final ModuleDefineHolder moduleDefineHolder, final Stream str // Management stream doesn't read data from database during the persistent process. Keep the timeRelativeID == false always. Model model = modelSetter.add(streamClass, stream.scopeId(), new Storage(stream.name(), false, DownSampling.None), - StorageManipulationOpt.createIfAbsent()); + StorageManipulationOpt.schemaCreateIfAbsent()); final ManagementPersistentWorker persistentWorker = new ManagementPersistentWorker(moduleDefineHolder, model, managementDAO); workers.put(streamClass, persistentWorker); diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java index d147de336c76..e9d22434e15e 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/MetricsStreamProcessor.java @@ -179,11 +179,11 @@ public void create(ModuleDefineHolder moduleDefineHolder, /** * Opt-aware variant invoked from the runtime-rule MAL path. Peer nodes pass - * {@link StorageManipulationOpt#localCacheOnly()} so every downstream {@code ModelRegistry.add} + * {@link StorageManipulationOpt#withoutSchemaChange()} so every downstream {@code ModelRegistry.add} * records per-resource outcomes and suppresses server-side install. Main-node on-demand - * callers (REST {@code /addOrUpdate}) pass {@link StorageManipulationOpt#fullInstall()}. + * callers (REST {@code /addOrUpdate}) pass {@link StorageManipulationOpt#withSchemaChange()}. * Startup-path callers (stream registration for static rules) pass - * {@link StorageManipulationOpt#createIfAbsent()} so boot never reshapes the backend. + * {@link StorageManipulationOpt#schemaCreateIfAbsent()} so boot never reshapes the backend. */ public void create(ModuleDefineHolder moduleDefineHolder, StreamDefinition stream, @@ -196,7 +196,7 @@ private void create(ModuleDefineHolder moduleDefineHolder, StreamDefinition stream, Class metricsClass, MetricStreamKind kind) throws StorageException { - this.create(moduleDefineHolder, stream, metricsClass, kind, StorageManipulationOpt.createIfAbsent()); + this.create(moduleDefineHolder, stream, metricsClass, kind, StorageManipulationOpt.schemaCreateIfAbsent()); } private void create(ModuleDefineHolder moduleDefineHolder, diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java index f0f3fa2efc1f..014363735ae2 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/analysis/worker/NoneStreamProcessor.java @@ -80,7 +80,7 @@ public void create(ModuleDefineHolder moduleDefineHolder, Stream stream, Class sweep() { if (done != null) { collectedTotal.incrementAndGet(); log.info("rule loader collected: {}:{}/{} hash={} ttg={}ms", - done.kind() == DSLClassLoaderManager.Kind.STATIC ? "static" : "runtime-rule", + done.kind() == DSLClassLoaderManager.Kind.BUNDLED ? "bundled" : "runtime-rule", done.catalog().getWireName(), done.rule(), done.contentHashShort(), System.currentTimeMillis() - done.retiredAtMs()); drained.add(done); diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java index 78766183ee49..629fab2f28ee 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManager.java @@ -48,7 +48,7 @@ * Per-file static loaders only appear after a runtime override on a bundled rule is removed * (via {@code /inactivate} or {@code /delete}): the runtime loader retires here, then the * engine reloads the bundled YAML from {@code StaticRuleRegistry} and calls - * {@link #newBuilder} with {@link Kind#STATIC} to mint a fresh loader hosting the bundled + * {@link #newBuilder} with {@link Kind#BUNDLED} to mint a fresh loader hosting the bundled * compile output. So at any moment there is at most one per-file loader for a given key, and * only when the key has actually fallen over. * @@ -61,11 +61,11 @@ public final class DSLClassLoaderManager { /** Origin of a loader. {@code RUNTIME} loaders host operator-pushed runtime-rule overrides; - * {@code STATIC} loaders host bundled rules brought back into service after a runtime + * {@code BUNDLED} loaders host bundled rules brought back into service after a runtime * override on the same key was removed. The active loader for a given key is always at * most one; manager keys are {@code (catalog, rule)}, not {@code (catalog, rule, kind)}. */ public enum Kind { - STATIC, RUNTIME + BUNDLED, RUNTIME } /** Process-wide singleton. */ @@ -239,7 +239,7 @@ private void sweepInternal() { log.warn("rule loader leak suspected: {}:{}/{} hash={} pending {} ms " + "(threshold {}). Check for lingering handler registrations or " + "samples buffered in DataCarrier partitions.", - r.kind() == Kind.STATIC ? "static" : "runtime-rule", + r.kind() == Kind.BUNDLED ? "bundled" : "runtime-rule", r.catalog().getWireName(), r.rule(), r.contentHashShort(), ageMs, STALE_LOADER_WARN_THRESHOLD_MS); } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java index 623946c0e751..95f1076e689c 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoader.java @@ -51,9 +51,9 @@ public final class RuleClassLoader extends URLClassLoader implements BytecodeClassDefiner { private static final DateTimeFormatter NAME_TS = DateTimeFormatter.ofPattern("MMdd-HHmmss"); - /** Origin tag — {@link DSLClassLoaderManager.Kind#STATIC} for fall-over reload of bundled + /** Origin tag — {@link DSLClassLoaderManager.Kind#BUNDLED} for fall-over reload of bundled * rules; {@link DSLClassLoaderManager.Kind#RUNTIME} for operator-pushed overrides. Visible - * in {@link #getName()} via the {@code static:} / {@code runtime-rule:} prefix. */ + * in {@link #getName()} via the {@code bundled:} / {@code runtime-rule:} prefix. */ @Getter private final DSLClassLoaderManager.Kind kind; @Getter @@ -94,7 +94,7 @@ public Class defineClass(final String className, final byte[] bytecode) { private static String buildLoaderName(final DSLClassLoaderManager.Kind kind, final Catalog catalog, final String rule) { - final String prefix = kind == DSLClassLoaderManager.Kind.STATIC ? "static" : "runtime-rule"; + final String prefix = kind == DSLClassLoaderManager.Kind.BUNDLED ? "bundled" : "runtime-rule"; return prefix + ":" + catalog.getWireName() + "/" + rule + "@" + LocalDateTime.now().format(NAME_TS); } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java index 721bd8bf0d9a..735e7b379dc2 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java @@ -166,7 +166,7 @@ protected final void overrideColumnName(String columnName, String newName) { * Check whether the storage entity exists, reporting per-resource outcomes on * {@code opt}. Backends with in-isExists side effects (BanyanDB's auto-update of * {@code Measure}/{@code IndexRule}/{@code IndexRuleBinding}) honour - * {@link StorageManipulationOpt#isLocalCacheOnly()} to suppress server writes when the + * {@link StorageManipulationOpt#isWithoutSchemaChange()} to suppress server writes when the * caller is a peer node. */ public abstract InstallInfo isExists(Model model, StorageManipulationOpt opt) throws StorageException; diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java index b05dd785a349..c6aadcf4d266 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelRegistry.java @@ -53,7 +53,7 @@ Model add(Class aClass, int scopeId, Storage storage, StorageManipulationOpt * each. Used by runtime rule hot-update (MAL/LAL hot-remove); not intended to be called during * the startup path. * - *

Peer-node callers pass {@link StorageManipulationOpt#localCacheOnly()} so installers + *

Peer-node callers pass {@link StorageManipulationOpt#withoutSchemaChange()} so installers * skip the server-side drop and record {@link StorageManipulationOpt.Outcome#SKIPPED_NOT_ALLOWED} * against the affected resources. * @@ -67,7 +67,7 @@ interface CreatingListener { /** * Invoked when a model is registered via {@link ModelRegistry#add}. Listeners receive * the {@link StorageManipulationOpt} the caller threaded through the registry — skip - * server-side DDL when {@link StorageManipulationOpt#isLocalCacheOnly()}, and record + * server-side DDL when {@link StorageManipulationOpt#isWithoutSchemaChange()}, and record * per-resource outcomes on the opt for the caller to inspect. */ void whenCreating(Model model, StorageManipulationOpt opt) throws StorageException; @@ -77,7 +77,7 @@ interface CreatingListener { * so listeners that don't own server-side resources (e.g., pure schema caches) compile * without boilerplate. Storage installers that own physical schema (BanyanDB measures) * override this and skip the server-side drop when - * {@link StorageManipulationOpt#isLocalCacheOnly()}. + * {@link StorageManipulationOpt#isWithoutSchemaChange()}. */ default void whenRemoving(Model model, StorageManipulationOpt opt) throws StorageException { } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index 935788772112..3fe2e66eab50 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -37,7 +37,7 @@ * constructor is private. If a future scenario genuinely needs a fifth mode, add it to * {@link Mode} here so every caller keeps picking from a known set. * - *

{@link #fullInstall()} — {@link Mode#FULL_INSTALL} (predicate: {@link #isFullInstall()})

+ *

{@link #withSchemaChange()} — {@link Mode#WITH_SCHEMA_CHANGE} (predicate: {@link #isWithSchemaChange()})

*

Callers: *

    *
  • Main-node REST apply ({@code /addOrUpdate}, {@code /delete}) — operator-driven, @@ -48,7 +48,7 @@ * (rare — REST usually wins the race)
  • *
*

Note: {@code /inactivate} is a soft-pause that goes through - * {@link Mode#LOCAL_CACHE_ONLY} — backend schema and data are preserved; only + * {@link Mode#WITHOUT_SCHEMA_CHANGE} — backend schema and data are preserved; only * OAP-internal state (compiled bundles, dispatch, prototypes) is torn down so * cheap re-activation works on the next {@code /addOrUpdate}. *

Backend behaviour: full DDL — create missing tables / measures, drop retired ones, @@ -56,7 +56,7 @@ * shape mismatch, and create / update index rules + bindings. Reshaping is treated as * intended because the caller came in through an on-demand operator request. * - *

{@link #createIfAbsent()} — {@link Mode#CREATE_IF_ABSENT} (predicate: {@link #isCreateIfAbsent()})

+ *

{@link #schemaCreateIfAbsent()} — {@link Mode#SCHEMA_CREATE_IF_ABSENT} (predicate: {@link #isSchemaCreateIfAbsent()})

*

Callers: *

    *
  • Startup-time model registration (every OAP, via stream processors — static MAL / @@ -70,7 +70,7 @@ * skip surfaces the mismatch to the operator, who must reshape via the on-demand * runtime-rule REST endpoint (the only workflow that may change backend schema). * - *

    {@link #localCacheVerify()} — {@link Mode#LOCAL_CACHE_VERIFY} (predicate: {@link #isLocalCacheVerify()})

    + *

    {@link #verifySchemaOnly()} — {@link Mode#VERIFY_SCHEMA_ONLY} (predicate: {@link #isVerifySchemaOnly()})

    *

    Callers: *

      *
    • Boot-time reconciler pass on a non-init OAP — the operator declared @@ -79,7 +79,7 @@ * declares.
    • *
    *

    Backend behaviour: read-only inspection. The installer issues the same metadata - * read RPCs as {@link Mode#CREATE_IF_ABSENT} but never invokes create / update / drop. On + * read RPCs as {@link Mode#SCHEMA_CREATE_IF_ABSENT} but never invokes create / update / drop. On * resource missing OR shape mismatch the installer throws — the exception propagates up * through the module bootstrap and causes the OAP process to exit, which under k8s * results in a pod backloop until either the init OAP has caught up or the operator has @@ -88,7 +88,7 @@ * what's declared. Local {@code MetadataRegistry} is populated only when the live shape * matches the declared shape. * - *

    {@link #localCacheOnly()} — {@link Mode#LOCAL_CACHE_ONLY} (predicate: {@link #isLocalCacheOnly()})

    + *

    {@link #withoutSchemaChange()} — {@link Mode#WITHOUT_SCHEMA_CHANGE} (predicate: {@link #isWithoutSchemaChange()})

    *

    Callers: *

      *
    • Peer-node reconciler tick (peer is not the hash-selected main for this file — @@ -103,7 +103,7 @@ * {@link Outcome#SKIPPED_NOT_ALLOWED SKIPPED_NOT_ALLOWED} outcomes instead of firing * {@code createTable} / {@code dropTable}. Peer's local MeterSystem still compiles * Metrics classes and populates {@code meterPrototypes} — that's pure in-JVM work the - * opt doesn't (and shouldn't) gate. Differs from {@link Mode#LOCAL_CACHE_VERIFY} in two + * opt doesn't (and shouldn't) gate. Differs from {@link Mode#VERIFY_SCHEMA_ONLY} in two * ways: no server RPCs (cache populates from local model), and missing / mismatched * resources are not a fatal error (the next tick will retry, or the * main will catch up). @@ -129,7 +129,7 @@ public enum Mode { * for ES). Reshape is treated as intended because the caller explicitly asked * for it via the operator REST endpoint. */ - FULL_INSTALL(Flags.builder() + WITH_SCHEMA_CHANGE(Flags.builder() .inspectBackend(true) .createMissing(true) .updateOnMismatch(true) @@ -143,20 +143,20 @@ public enum Mode { * call update / reshape. Operator must reconcile via the runtime-rule REST * endpoint — boot is not allowed to silently mutate backend shape. */ - CREATE_IF_ABSENT(Flags.builder() + SCHEMA_CREATE_IF_ABSENT(Flags.builder() .inspectBackend(true) .createMissing(true) .build()), /** * Boot path on a non-init OAP. Installer issues the same read-only inspection - * RPCs as {@link #CREATE_IF_ABSENT} but never creates / updates / drops. On + * RPCs as {@link #SCHEMA_CREATE_IF_ABSENT} but never creates / updates / drops. On * resource missing or shape mismatch the installer throws; the * exception propagates up through module bootstrap and exits the process. * Under k8s this causes a pod backloop until the init OAP has caught up or the * operator has aligned rule files with the backend. Local {@code MetadataRegistry} * is populated only when the live shape matches the declared shape. */ - LOCAL_CACHE_VERIFY(Flags.builder() + VERIFY_SCHEMA_ONLY(Flags.builder() .inspectBackend(true) .failOnAbsence(true) .failOnShapeMismatch(true) @@ -165,10 +165,10 @@ public enum Mode { * Peer-node reconciler tick path. Zero server RPCs — local caches populate from * the declared model and the main is trusted to own backend DDL. Missing or * mismatched resources are not an error: the next tick will retry, and the main - * will eventually converge. Distinct from {@link #LOCAL_CACHE_VERIFY} in that + * will eventually converge. Distinct from {@link #VERIFY_SCHEMA_ONLY} in that * verification is skipped entirely, not run-and-fail. */ - LOCAL_CACHE_ONLY(Flags.builder().build()); + WITHOUT_SCHEMA_CHANGE(Flags.builder().build()); @Getter private final Flags flags; @@ -195,7 +195,7 @@ public enum Mode { public static final class Flags { /** * Issue read RPCs to the backend (existence + shape compare). False on - * {@link Mode#LOCAL_CACHE_ONLY} where the contract is "zero server RPCs". When + * {@link Mode#WITHOUT_SCHEMA_CHANGE} where the contract is "zero server RPCs". When * false the installer must populate local caches from the declared model and * return early without inspecting the backend. */ @@ -209,40 +209,40 @@ public static final class Flags { /** * Call backend update primitives ({@code client.update}, JDBC {@code ALTER * TABLE}, ES mapping append) when a present resource's live shape diverges from - * the declared shape. Only {@link Mode#FULL_INSTALL} (the operator-driven path) + * the declared shape. Only {@link Mode#WITH_SCHEMA_CHANGE} (the operator-driven path) * permits this — boot must never silently reshape backend storage. * *

      Note: BanyanDB's index-rule / index-rule-binding update path is gated by * {@link #failOnShapeMismatch} instead of this flag, preserving the long-standing * behaviour that init-mode OAPs reconcile index rules even under - * {@link Mode#CREATE_IF_ABSENT}.

      + * {@link Mode#SCHEMA_CREATE_IF_ABSENT}.

      */ private final boolean updateOnMismatch; /** * Call backend drop primitives ({@code client.dropMeasure} / {@code dropStream} * / etc.) from {@link ModelRegistry.CreatingListener#whenRemoving}. Only - * {@link Mode#FULL_INSTALL} (operator-driven runtime-rule deletion) permits - * this; peers under {@link Mode#LOCAL_CACHE_ONLY} short-circuit with + * {@link Mode#WITH_SCHEMA_CHANGE} (operator-driven runtime-rule deletion) permits + * this; peers under {@link Mode#WITHOUT_SCHEMA_CHANGE} short-circuit with * {@link Outcome#SKIPPED_NOT_ALLOWED}. */ private final boolean dropOnRemoval; /** * Throw a {@link org.apache.skywalking.oap.server.core.storage.StorageException} * when a resource is absent on the backend after inspection. Used by - * {@link Mode#LOCAL_CACHE_VERIFY} to fail boot rather than silently start + * {@link Mode#VERIFY_SCHEMA_ONLY} to fail boot rather than silently start * against an unprepared backend. */ private final boolean failOnAbsence; /** * Throw a {@link org.apache.skywalking.oap.server.core.storage.StorageException} * when a present resource's live shape diverges from the declared shape. Used - * by {@link Mode#LOCAL_CACHE_VERIFY} so boot does not silently start against a + * by {@link Mode#VERIFY_SCHEMA_ONLY} so boot does not silently start against a * backend whose schema disagrees with the rule file. */ private final boolean failOnShapeMismatch; /** * Re-throw cascaded backend errors to the caller (REST handler, operator - * tooling) instead of swallowing them. Set on {@link Mode#FULL_INSTALL}; other + * tooling) instead of swallowing them. Set on {@link Mode#WITH_SCHEMA_CHANGE}; other * modes log and continue so a peer-side bookkeeping glitch doesn't take down * the node. */ @@ -267,55 +267,55 @@ public Flags getFlags() { return mode.getFlags(); } - public static StorageManipulationOpt fullInstall() { - return new StorageManipulationOpt(Mode.FULL_INSTALL); + public static StorageManipulationOpt withSchemaChange() { + return new StorageManipulationOpt(Mode.WITH_SCHEMA_CHANGE); } - public static StorageManipulationOpt createIfAbsent() { - return new StorageManipulationOpt(Mode.CREATE_IF_ABSENT); + public static StorageManipulationOpt schemaCreateIfAbsent() { + return new StorageManipulationOpt(Mode.SCHEMA_CREATE_IF_ABSENT); } - public static StorageManipulationOpt localCacheVerify() { - return new StorageManipulationOpt(Mode.LOCAL_CACHE_VERIFY); + public static StorageManipulationOpt verifySchemaOnly() { + return new StorageManipulationOpt(Mode.VERIFY_SCHEMA_ONLY); } - public static StorageManipulationOpt localCacheOnly() { - return new StorageManipulationOpt(Mode.LOCAL_CACHE_ONLY); + public static StorageManipulationOpt withoutSchemaChange() { + return new StorageManipulationOpt(Mode.WITHOUT_SCHEMA_CHANGE); } /** - * True for {@link Mode#FULL_INSTALL}. The on-demand operator workflow — drops, + * True for {@link Mode#WITH_SCHEMA_CHANGE}. The on-demand operator workflow — drops, * updates, and reshapes are permitted because the caller explicitly asked for them. */ - public boolean isFullInstall() { - return mode == Mode.FULL_INSTALL; + public boolean isWithSchemaChange() { + return mode == Mode.WITH_SCHEMA_CHANGE; } /** - * True for {@link Mode#CREATE_IF_ABSENT}. The static boot workflow — create absent + * True for {@link Mode#SCHEMA_CREATE_IF_ABSENT}. The static boot workflow — create absent * resources, skip + record {@link Outcome#SKIPPED_SHAPE_MISMATCH} on a resource that * already exists with a different shape. Never update or drop. */ - public boolean isCreateIfAbsent() { - return mode == Mode.CREATE_IF_ABSENT; + public boolean isSchemaCreateIfAbsent() { + return mode == Mode.SCHEMA_CREATE_IF_ABSENT; } /** - * True for {@link Mode#LOCAL_CACHE_VERIFY}. Boot-time strict verification on a + * True for {@link Mode#VERIFY_SCHEMA_ONLY}. Boot-time strict verification on a * non-init OAP — installer issues read-only inspection RPCs and throws on missing or * shape-mismatched resources. No DDL. */ - public boolean isLocalCacheVerify() { - return mode == Mode.LOCAL_CACHE_VERIFY; + public boolean isVerifySchemaOnly() { + return mode == Mode.VERIFY_SCHEMA_ONLY; } /** - * True for {@link Mode#LOCAL_CACHE_ONLY}. The {@code BanyanDBIndexInstaller.isExists} + * True for {@link Mode#WITHOUT_SCHEMA_CHANGE}. The {@code BanyanDBIndexInstaller.isExists} * short-circuit reads this to skip every server RPC and populate * {@code MetadataRegistry} only. */ - public boolean isLocalCacheOnly() { - return mode == Mode.LOCAL_CACHE_ONLY; + public boolean isWithoutSchemaChange() { + return mode == Mode.WITHOUT_SCHEMA_CHANGE; } private StorageManipulationOpt(final Mode mode) { @@ -426,8 +426,8 @@ public enum Outcome { /** Resource present and matches the intended shape. No action taken. */ EXISTING_MATCHED, /** Resource present but live shape differs from intended; update was NOT applied - * because the caller is in {@link Mode#LOCAL_CACHE_ONLY}. Caller may re-push with - * {@link #fullInstall()} to reconcile. {@link ResourceOutcome#getDiff()} carries + * because the caller is in {@link Mode#WITHOUT_SCHEMA_CHANGE}. Caller may re-push with + * {@link #withSchemaChange()} to reconcile. {@link ResourceOutcome#getDiff()} carries * a short description of the difference. */ EXISTING_MISMATCH, /** Installer ran {@code createTable} (or equivalent) and the resource now exists. */ diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java index 49dcc226b52f..558b47378f34 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageModels.java @@ -338,11 +338,11 @@ public void addModelListener(final CreatingListener listener) throws StorageExce } // A late-registering listener catches up on every previously-added model. These // models were added with their original caller's policy; the listener now receives - // them under createIfAbsent() because this catch-up is boot-time model registration, + // them under schemaCreateIfAbsent() because this catch-up is boot-time model registration, // not an on-demand operator reshape — we want the same "create-if-absent + report // shape mismatch" semantics, never auto-reshape. for (Model model : modelsSnapshot) { - listener.whenCreating(model, StorageManipulationOpt.createIfAbsent()); + listener.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent()); } } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java index 6c07e4e27b25..abac015e685f 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/DSLClassLoaderManagerTest.java @@ -106,8 +106,8 @@ void loaderNameKindPrefixIsConsistentWithBuildKind() { Catalog.LAL, rule, DSLClassLoaderManager.Kind.RUNTIME, "h"); assertTrue(runtimeLoader.getName().startsWith("runtime-rule:lal/" + rule)); - final RuleClassLoader staticLoader = DSLClassLoaderManager.INSTANCE.newBuilder( - Catalog.LAL, rule, DSLClassLoaderManager.Kind.STATIC, "h"); - assertTrue(staticLoader.getName().startsWith("static:lal/" + rule)); + final RuleClassLoader bundledLoader = DSLClassLoaderManager.INSTANCE.newBuilder( + Catalog.LAL, rule, DSLClassLoaderManager.Kind.BUNDLED, "h"); + assertTrue(bundledLoader.getName().startsWith("bundled:lal/" + rule)); } } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java index ab0df81e3678..1a03400c03ad 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/classloader/RuleClassLoaderTest.java @@ -53,12 +53,12 @@ void runtimeKindLoaderNameHasRuntimeRulePrefix() { } @Test - void staticKindLoaderNameHasStaticPrefix() { + void bundledKindLoaderNameHasBundledPrefix() { final RuleClassLoader loader = new RuleClassLoader( - DSLClassLoaderManager.Kind.STATIC, Catalog.LOG_MAL_RULES, "service-resp", "h", + DSLClassLoaderManager.Kind.BUNDLED, Catalog.LOG_MAL_RULES, "service-resp", "h", Thread.currentThread().getContextClassLoader()); - assertTrue(loader.getName().startsWith("static:log-mal-rules/service-resp@"), - "expected static prefix, got: " + loader.getName()); + assertTrue(loader.getName().startsWith("bundled:log-mal-rules/service-resp@"), + "expected bundled prefix, got: " + loader.getName()); } @Test diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java index 8aee0e2d86b4..0a91d2e780f2 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageModelsTest.java @@ -72,7 +72,7 @@ public void rolledBackOnListenerFailure() throws StorageException { }); Assertions.assertThrows(StorageException.class, () -> models.add(TestModel.class, -1, new Storage("StorageModelsRollbackTest", false, DownSampling.Hour), - StorageManipulationOpt.fullInstall())); + StorageManipulationOpt.withSchemaChange())); // Registry must not retain the model — a retry would otherwise dedup-skip the // listener instead of attempting the DDL again. assertEquals(0, models.allModels().size()); @@ -87,7 +87,7 @@ public void removeKeepsModelOnListenerFailure() throws StorageException { StorageModels models = new StorageModels(); models.add(TestModel.class, -1, new Storage("StorageModelsRemoveRetryTest", false, DownSampling.Hour), - StorageManipulationOpt.fullInstall()); + StorageManipulationOpt.withSchemaChange()); assertEquals(1, models.allModels().size()); // Listener that throws on remove (simulating BanyanDB delete-measure transient failure). @@ -106,7 +106,7 @@ public void whenRemoving(final Model model, final StorageManipulationOpt opt) th }); Assertions.assertThrows(StorageException.class, - () -> models.remove(TestModel.class, StorageManipulationOpt.fullInstall())); + () -> models.remove(TestModel.class, StorageManipulationOpt.withSchemaChange())); // Model must still be in the registry — the next retry needs to find and drive // dropTable again. Otherwise the operator's /inactivate succeeds locally but the // backend measure stays orphaned forever. @@ -118,7 +118,7 @@ public void testStorageModels() throws StorageException { StorageModels models = new StorageModels(); models.add(TestModel.class, -1, new Storage("StorageModelsTest", false, DownSampling.Hour), - StorageManipulationOpt.fullInstall() + StorageManipulationOpt.withSchemaChange() ); final List allModules = models.allModels(); diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java index 5345312bdd20..8ba89cb2bf04 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/LalFileApplier.java @@ -109,8 +109,8 @@ public Applied apply(final String yamlContent, final String sourceName, } /** - * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#STATIC} mints a {@code static:} - * loader so the static fall-over path (bundled rule serving again after the runtime + * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#BUNDLED} mints a {@code bundled:} + * loader so the bundled fall-over path (bundled rule serving again after the runtime * override is removed) is distinguishable from the runtime path in logs and diagnostics. */ public Applied apply(final String yamlContent, final String sourceName, diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java index d6a96e905864..b9560aaa26c5 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java @@ -90,8 +90,8 @@ public Applied apply(final String yamlContent, final String sourceName, } /** - * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#STATIC} mints a {@code static:} - * loader so the static fall-over path (bundled rule serving again after the runtime + * Origin-tagged overload: {@link DSLClassLoaderManager.Kind#BUNDLED} mints a {@code bundled:} + * loader so the bundled fall-over path (bundled rule serving again after the runtime * override is removed) is distinguishable from the runtime path in logs and diagnostics. */ public Applied apply(final String yamlContent, final String sourceName, @@ -143,11 +143,11 @@ public Applied apply(final String yamlContent, final String sourceName, /** * Back-compat overload: callers that haven't yet picked a storage policy pass - * {@link StorageManipulationOpt#fullInstall()}. Main-node apply path. + * {@link StorageManipulationOpt#withSchemaChange()}. Main-node apply path. */ public Applied apply(final String yamlContent, final String sourceName, final String contentHash) throws ApplyException { - return apply(yamlContent, sourceName, contentHash, StorageManipulationOpt.fullInstall()); + return apply(yamlContent, sourceName, contentHash, StorageManipulationOpt.withSchemaChange()); } /** @@ -156,15 +156,15 @@ public Applied apply(final String yamlContent, final String sourceName, * {@code ClassLoaderGc} output. */ public Applied apply(final String yamlContent, final String sourceName) throws ApplyException { - return apply(yamlContent, sourceName, "", StorageManipulationOpt.fullInstall()); + return apply(yamlContent, sourceName, "", StorageManipulationOpt.withSchemaChange()); } /** * Reverse of {@link #apply}: drop every metric name the previous apply registered under * the given {@link StorageManipulationOpt storage policy}. Main-node callers pass - * {@link StorageManipulationOpt#fullInstall()} so {@code BanyanDBIndexInstaller.dropTable} + * {@link StorageManipulationOpt#withSchemaChange()} so {@code BanyanDBIndexInstaller.dropTable} * actually deletes the server-side measure. Peer-node callers pass - * {@link StorageManipulationOpt#localCacheOnly()} so local teardown (L1/L2 drain, + * {@link StorageManipulationOpt#withoutSchemaChange()} so local teardown (L1/L2 drain, * {@code meterPrototypes} eviction, CtClass detach) still runs but the server-side drop * is suppressed — main owns server-side state. * @@ -204,7 +204,7 @@ public void remove(final Set metricNames, final StorageManipulationOpt s /** Back-compat overload: full-install policy (server-side drop fires). */ public void remove(final Set metricNames) { - remove(metricNames, StorageManipulationOpt.fullInstall()); + remove(metricNames, StorageManipulationOpt.withSchemaChange()); } private Rule parse(final String yamlContent, final String sourceName) throws ApplyException { diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java index f2f5bc3ef76d..3056694f175c 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/RuleEngine.java @@ -21,7 +21,9 @@ import java.util.Map; import java.util.Set; import java.util.function.Consumer; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.module.ModuleManager; /** @@ -121,24 +123,27 @@ * Engine clears its applied-state entry, drops registered * dispatcher handlers, retires the classloader, fires alarm reset * for the prior metric set. Storage opt determines whether - * backend schema is dropped (fullInstall) or preserved - * (localCacheOnly — the {@code /inactivate} contract). + * backend schema is dropped (withSchemaChange) or preserved + * (withoutSchemaChange — the {@code /inactivate} contract). * * - *

      4. Destructive {@code /delete} (driven by {@code DSLRuntimeDelete}) + *

      4. Bundled-revert prep (driven by {@code DSLRuntimeDelete}) *

      - *   dropBackend(catalog, name, content, ctx) — called by REST {@code /delete}
      - *                         after {@code /inactivate} has already cleared the engine's
      - *                         applied state. Engines with backend schema (MAL) re-register
      - *                         prototypes locally then tear down under fullInstall so the
      - *                         listener chain runs the destructive cascade. Engines without
      - *                         backend (LAL) implement as no-op — the DAO row deletion alone
      - *                         discharges the rule.
      + *   installRuntime(catalog, name, content, ctx) — called by REST
      + *                         {@code /delete?mode=revertToBundled} after {@code /inactivate}
      + *                         has already cleared the engine's applied state. Engines with
      + *                         backend schema (MAL) re-register prototypes under
      + *                         {@code withoutSchemaChange} so a subsequent {@code apply} of
      + *                         the bundled YAML can compute its delta against runtime and
      + *                         drop runtime-only metrics through the standard commit path.
      + *                         Engines without backend (LAL) implement as no-op. Symmetric
      + *                         with {@link #recordBundledClaims} — both install a DSL into
      + *                         local applied state without firing the listener chain.
        * 
      * *

      5. Boot / recovery (driven by {@code StaticRuleLoader}) *

      - *   loadStaticRuleFile(catalog, name, content) — called once at boot for every static rule
      + *   recordBundledClaims(catalog, name, content) — called once at boot for every static rule
        *                         the catalog loaders compiled at module start, and again on each
        *                         tick for any static rule whose DB row got {@code /delete}d while
        *                         the disk content remained. Engine seeds a synthetic applied
      @@ -243,7 +248,7 @@ public interface RuleEngine {
            * the loader stay DSL-agnostic — it doesn't need to know whether the engine's applied
            * state is keyed on metric names or {@code (layer, ruleName)} tuples.
            */
      -    boolean loadStaticRuleFile(String catalog, String name, String content);
      +    boolean recordBundledClaims(String catalog, String name, String content);
       
           /**
            * Build the engine's concrete {@link ApplyContext} subtype from the shared
      @@ -257,16 +262,25 @@ public interface RuleEngine {
            * generated classes + per-file classloader + delta info. NO backend DDL fired here, NO
            * scheduler-cache mutation. Throws {@link RuntimeException} on compile failure;
            * scheduler stamps {@code applyError} on the snapshot and surfaces to the caller.
      +     *
      +     * 

      {@code kind} controls how the per-file classloader is tagged in {@link + * org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager}: + * {@link DSLClassLoaderManager.Kind#RUNTIME} for {@code /addOrUpdate} and tick paths, + * {@link DSLClassLoaderManager.Kind#BUNDLED} for the {@code /delete?mode=revertToBundled} + * path that re-installs the bundled YAML through the standard apply pipeline. The kind + * surfaces in {@code /list} so operators can tell at a glance whether a key is being + * served by a runtime override or a bundled fall-over. */ CompiledDSL compile(RuntimeRuleManagementDAO.RuntimeRuleFile file, Classification classification, + DSLClassLoaderManager.Kind kind, C ctx); /** * Phase: schema changes. Drive the listener chain (BanyanDB define / drop, ES index * mapping, JDBC table, etc.) for the deltas this CompiledDSL represents. The * {@code StorageManipulationOpt} on the context controls whether the listeners actually - * fire (full / localCacheOnly / localCacheVerify). LAL impl is a no-op (no backend + * fire (full / withoutSchemaChange / verifySchemaOnly). LAL impl is a no-op (no backend * schema). Throws on backend failure; scheduler invokes {@link #rollback}. */ void fireSchemaChanges(CompiledDSL compiled, C ctx); @@ -297,53 +311,44 @@ CompiledDSL compile(RuntimeRuleManagementDAO.RuntimeRuleFile file, /** * Tear down a previously-applied bundle (or a static-only bundle). Driven by - * {@code /inactivate} (with {@code localCacheOnly} so backend stays), {@code /delete} - * (with {@code fullInstall} so backend drops), and the tick's gone-keys cleanup on main. + * {@code /inactivate} (with {@code withoutSchemaChange} so backend stays), {@code /delete} + * (with {@code withSchemaChange} so backend drops), and the tick's gone-keys cleanup on main. * Engine clears its own dispatcher state + per-key applied entry. Shared post-cleanup * (content-cache clear) is the orchestrator's concern after this call returns. */ void unregister(String catalog, String name, C ctx); /** - * Discharge backend schema for {@code /delete}. By the time the REST handler invokes - * {@code /delete}, {@code /inactivate} has already cleared the engine's applied state - * — a naive {@link #unregister} call would no-op the destructive cascade and the - * backend resource would orphan once the DAO row is deleted. Engines that own backend - * schema (MAL) re-register prototypes locally then tear down under fullInstall so the - * listener chain runs the destructive cascade on the existing resource. Engines without - * backend (LAL) implement this as a no-op — the row deletion alone discharges the rule. + * Install a runtime DSL into local applied state without firing the backend listener + * chain. Pairs with {@link #recordBundledClaims}: that method seeds bundled YAML at boot + * time; this one re-installs runtime content after {@code /inactivate} has cleared the + * applied state, so a subsequent {@link + * org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeApply#apply} + * of the bundled YAML can compute a runtime→bundled delta and drop runtime-only metrics + * through the standard commit path. * - *

      {@code bundledContent} controls the destructiveness: - *

        - *
      • {@code null} — destructive: drop all backend resources the runtime row - * claimed. The rule is being permanently removed (no bundled twin on disk to - * fall back to).
      • - *
      • non-null — delta: drop only metrics that {@code runtimeContent} claims but - * {@code bundledContent} does not, plus metrics in both at different shape. - * Bundled-shared metrics at matching shape are preserved (no data loss for the - * measures bundled will reuse on its synchronous reload). Used when {@code - * /delete} reverts to a bundled twin.
      • - *
      + *

      Engines with backend schema (MAL) compile + register meterPrototypes / Models under + * {@code withoutSchemaChange} so BanyanDB / ES / JDBC stay untouched. Engines without backend + * (LAL) implement this as a no-op — bundled's apply pipeline reinstalls handlers without + * needing a runtime delta. * *

      Throws {@link IllegalStateException} when a prerequisite fails (e.g., MeterSystem - * unavailable, parse error in either content); the caller (the {@code DSLRuntimeDelete} - * orchestrator) propagates the throw so the REST handler aborts the row deletion — - * refusing to delete the row is the correct failure mode (an orphaned backend resource - * with no DAO row to drive a retry is worse). + * unavailable, parse error in the runtime content); the caller (the {@code + * DSLRuntimeDelete} orchestrator) propagates the throw so the REST handler aborts + * the row deletion before any destructive moment. */ - void dropBackend(String catalog, String name, String runtimeContent, - String bundledContent, C ctx); + void installRuntime(String catalog, String name, String runtimeContent, C ctx); /** * After a runtime override has been removed for {@code (catalog, name)}, reload the * bundled rule from {@link * org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry} (if any) and bring - * it back into service via a fresh {@code static:} loader from + * it back into service via a fresh {@code bundled:} loader from * {@link org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager}. * *

      Returns {@code true} when a bundled rule was found and reinstalled; {@code false} * when no bundled rule exists for this key (the rule is genuinely gone) or the engine - * doesn't participate in static fall-over (e.g. its catalog has no {@code StaticRuleRegistry} + * doesn't participate in bundled fall-over (e.g. its catalog has no {@code StaticRuleRegistry} * entries). * *

      Errors during reload propagate as {@link RuntimeException}s the orchestrator logs @@ -356,7 +361,12 @@ void dropBackend(String catalog, String name, String runtimeContent, * drives reset itself) doesn't double-reset. * @param moduleManager scheduler-supplied module manager so the engine can resolve its * backend dispatcher (MeterSystem / LogFilterListener.Factory). + * @param storageOpt storage policy for the reload. Main-node {@code /delete} passes + * {@code schemaCreateIfAbsent} so backend resources removed by the delta + * drop get recreated; peer-node tick reconcile passes the tick's + * own opt (typically {@code withoutSchemaChange}) so peers don't + * double-write DDL. */ - boolean reloadStatic(String catalog, String name, Consumer> alarmResetter, - ModuleManager moduleManager); + boolean installBundled(String catalog, String name, Consumer> alarmResetter, + ModuleManager moduleManager, StorageManipulationOpt storageOpt); } diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java index 6fd1de48ab64..d5ef5e416adf 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/lal/LalRuleEngine.java @@ -34,6 +34,7 @@ import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; +import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.receiver.runtimerule.apply.DeltaClassifier; import org.apache.skywalking.oap.server.receiver.runtimerule.apply.LalFileApplier; @@ -171,7 +172,7 @@ public Map> activeClaimsExcluding(final String selfKey) { } @Override - public boolean loadStaticRuleFile(final String catalog, final String name, final String content) { + public boolean recordBundledClaims(final String catalog, final String name, final String content) { final String key = DSLScriptKey.key(catalog, name); if (appliedFor(rules, key) != null) { return false; @@ -208,6 +209,7 @@ public LalApplyContext newApplyContext(final ApplyInputs inputs) { @Override public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final LalApplyContext ctx) { final String key = DSLScriptKey.key(file.getCatalog(), file.getName()); final String sourceName = file.getCatalog() + "/" + file.getName(); @@ -221,7 +223,7 @@ public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final LalFileApplier.Applied oldApplied = appliedFor(ctx.getRules(), key); try { final LalFileApplier.Applied newApplied = lalApplier.apply( - file.getContent(), sourceName, newHash); + file.getContent(), sourceName, newHash, kind); return new CompiledLalDSL(file.getCatalog(), file.getName(), newHash, classification, file.getContent(), oldApplied, newApplied); } catch (final LalFileApplier.ApplyException ae) { @@ -353,9 +355,23 @@ public void rollback(final CompiledDSL compiled, final LalApplyContext ctx) { return; } try { - final String oldHash = ContentHash - .sha256Hex(prior.getContent()); - lalApplier.apply(prior.getContent(), sourceName, oldHash); + final String oldHash = ContentHash.sha256Hex(prior.getContent()); + final LalFileApplier.Applied restored = lalApplier.apply( + prior.getContent(), sourceName, oldHash); + // Promote the restored loader through the manager so /list reflects the + // actual serving loader and a later /delete or fall-over can retire it + // through the graveyard. Without this promote, the manager's active map + // would still point at the (failed-and-discarded) new loader's prior entry + // — a stale view the orchestrator never recovers from until the next apply. + if (restored.getRuleClassLoader() != null) { + DSLClassLoaderManager.INSTANCE.commit(restored.getRuleClassLoader()) + .filter(displaced -> displaced != restored.getRuleClassLoader()) + .ifPresent(DSLClassLoaderManager.INSTANCE::retire); + } + ctx.getRules().compute(key, (k, prev) -> prev == null + ? new AppliedRuleScript(c.getCatalog(), c.getName(), + prior.getContent(), null).withApplied(restored) + : prev.withContentAndApplied(prior.getContent(), restored)); log.info("runtime-rule LAL engine: rollback OK for {}/{} — {} partial registration(s) removed and prior DSL restored", c.getCatalog(), c.getName(), c.getNewApplied().getRegistered().size()); } catch (final LalFileApplier.ApplyException e) { @@ -426,12 +442,12 @@ public void unregister(final String catalog, final String name, final LalApplyCo staticKeys.size(), catalog, name); } - /** No-op: LAL has no backend schema. {@code /delete}'s row deletion alone discharges - * the rule — no destructive cascade or delta-drop needed. */ + /** No-op: LAL has no backend schema, so {@code /delete?mode=revertToBundled} doesn't + * need to install prior runtime claims for delta computation — bundled's apply + * pipeline reinstalls handlers without needing a runtime delta. */ @Override - public void dropBackend(final String catalog, final String name, - final String runtimeContent, final String bundledContent, - final LalApplyContext ctx) { + public void installRuntime(final String catalog, final String name, + final String runtimeContent, final LalApplyContext ctx) { // Intentionally no-op. } @@ -441,14 +457,16 @@ public void dropBackend(final String catalog, final String name, * the bundled rule's compiled classes would be gone and operators would have to restart * the OAP to get the bundled DSL serving again. * - *

      Compiles via {@code lalApplier.apply(..., Kind.STATIC)} so the per-file loader is - * minted with the {@code static:} prefix — diagnostics can tell at a glance whether a - * key is being served by a runtime override or a static fall-over. + *

      Compiles via {@code lalApplier.apply(..., Kind.BUNDLED)} so the per-file loader is + * minted with the {@code bundled:} prefix — diagnostics can tell at a glance whether a + * key is being served by a runtime override or a bundled fall-over. */ @Override - public boolean reloadStatic(final String catalog, final String name, + public boolean installBundled(final String catalog, final String name, final Consumer> alarmResetter, - final ModuleManager moduleManager) { + final ModuleManager moduleManager, + final StorageManipulationOpt storageOpt) { + // LAL has no backend schema — storageOpt is unused here, accepted for SPI symmetry. if (!CATALOGS.contains(catalog)) { return false; } @@ -466,8 +484,8 @@ public boolean reloadStatic(final String catalog, final String name, final String hash = ContentHash.sha256Hex(staticContent); try { final LalFileApplier.Applied fresh = lalApplier.apply( - staticContent, sourceName, hash, DSLClassLoaderManager.Kind.STATIC); - // Promote the new static: loader. The displaced prior, if any, is retired — + staticContent, sourceName, hash, DSLClassLoaderManager.Kind.BUNDLED); + // Promote the new bundled: loader. The displaced prior, if any, is retired — // typically null here (we're called immediately after unregister, which already // dropRuntime'd the old runtime loader). if (fresh.getRuleClassLoader() != null) { @@ -486,12 +504,12 @@ public boolean reloadStatic(final String catalog, final String name, final ReentrantLock lock = prev != null ? prev.getLock() : new ReentrantLock(); return new AppliedRuleScript(catalog, name, staticContent, null, lock, fresh); }); - log.info("runtime-rule LAL engine: static fall-over OK for {}/{} — {} rule(s) " + log.info("runtime-rule LAL engine: bundled fall-over OK for {}/{} — {} rule(s) " + "registered from bundled YAML", catalog, name, fresh.getRegistered().size()); return true; } catch (final LalFileApplier.ApplyException ae) { - log.warn("runtime-rule LAL engine: static fall-over for {}/{} failed to compile " + log.warn("runtime-rule LAL engine: bundled fall-over for {}/{} failed to compile " + "the bundled YAML; bundled rule will stay dark until next /addOrUpdate " + "or restart", catalog, name, ae); return false; diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java index d4af4d8d9bae..9f670200b79c 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/engine/mal/MalRuleEngine.java @@ -252,7 +252,7 @@ public Map> activeClaimsExcluding(final String selfKey) { } @Override - public boolean loadStaticRuleFile(final String catalog, final String name, final String content) { + public boolean recordBundledClaims(final String catalog, final String name, final String content) { final String key = DSLScriptKey.key(catalog, name); if (appliedFor(rules, key) != null) { return false; @@ -292,6 +292,7 @@ public MalApplyContext newApplyContext(final ApplyInputs inputs) { @Override public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final MalApplyContext ctx) { final String key = DSLScriptKey.key(file.getCatalog(), file.getName()); final String sourceName = file.getCatalog() + "/" + file.getName(); @@ -311,7 +312,7 @@ public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final MalFileApplier.Applied fresh; try { fresh = applier.apply( - file.getContent(), sourceName, newHash, ctx.getStorageOpt()); + file.getContent(), sourceName, newHash, ctx.getStorageOpt(), kind); } catch (final MalFileApplier.ApplyException ae) { // Engine-internal partial rollback: undo whatever this attempt managed to // register before the throw. Old appliedMal[key] is untouched — it's still @@ -346,7 +347,7 @@ public CompiledDSL compile(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final MalFileApplier.Applied newApplied; try { newApplied = applier.apply( - file.getContent(), sourceName, newHash, ctx.getStorageOpt()); + file.getContent(), sourceName, newHash, ctx.getStorageOpt(), kind); } catch (final MalFileApplier.ApplyException ae) { // Engine-internal partial rollback: undo only the metrics this attempt would // have created or re-shaped (added ∪ shape-break). Unchanged metrics short- @@ -459,8 +460,8 @@ public void commit(final CompiledDSL compiled, final MalApplyContext ctx) { final CompiledMalDSL c = (CompiledMalDSL) compiled; // Drop metrics this bundle no longer claims (STRUCTURAL/NEW only — FILTER_ONLY has // identical metric sets). Honours the caller's storage opt: a peer-driven tick uses - // localCacheOnly here so the cluster-shared backend isn't touched; the main's REST - // path uses fullInstall to fire dropTable through the listener chain. Must run + // withoutSchemaChange here so the cluster-shared backend isn't touched; the main's REST + // path uses withSchemaChange to fire dropTable through the listener chain. Must run // BEFORE the swap so the about-to-be-displaced applier still owns the prototypes. if (c.getClassification() != Classification.FILTER_ONLY && c.getDelta() != null @@ -537,13 +538,15 @@ public void rollback(final CompiledDSL compiled, final MalApplyContext ctx) { *

      The {@link MalApplyContext#getStorageOpt()} parameter decides whether the listener * chain reaches the backend: *

        - *
      • {@code localCacheOnly} — soft-pause path. Local state is cleared (meterPrototypes, + *
      • {@code withoutSchemaChange} — soft-pause path. Local state is cleared (meterPrototypes, * Models from registry, appliedMal entry, classloader retired) but the listener's * {@code dropTable} is skipped, so the BanyanDB measure / ES index / JDBC table stays * intact. This is the {@code /inactivate} contract.
      • - *
      • {@code fullInstall} — destructive path. Same local cleanup PLUS the listener fires - * {@code dropTable} so the backend resource is removed. This is the {@code /delete} - * contract and the tick's gone-keys cleanup on main.
      • + *
      • {@code withSchemaChange} — schema-change path. Same local cleanup PLUS the listener + * fires {@code dropTable} so the backend resource is removed. Used by the + * {@code /delete?mode=revertToBundled} pipeline (where the engine's commit drops + * runtime-only metrics through the listener chain) and by STRUCTURAL + * {@code /addOrUpdate} that drops shape-broken metrics.
      • *
      * *

      Cascade-first ordering. {@code applier.remove} runs before {@code @@ -618,77 +621,35 @@ public void unregister(final String catalog, final String name, final MalApplyCo } /** - * Discharge backend schema for {@code /delete}. {@code bundledContent} controls the - * destructiveness: + * Install the runtime MAL DSL into local MeterSystem state under + * {@code withoutSchemaChange} so a subsequent {@code apply} of the bundled YAML can + * compute its delta against the prior runtime claim set. The listener chain is + * suppressed — backend schema is untouched here. The destructive moment lives inside + * the standard apply pipeline that runs immediately after this call: its + * {@link #commit} sees the loaded prior, calls + * {@code applier.remove(removedMetrics, withSchemaChange)}, and the listener cascade fires + * {@code dropTable} on the runtime-only metrics that bundled doesn't claim. * - *

        - *
      • {@code null} — destructive: re-register prototypes locally under - * {@code localCacheOnly} (so the listener chain doesn't re-create the measure - * we're about to drop) and then tear down via {@link #unregister} under - * {@code fullInstall}. The two-step dance is needed because {@code /inactivate} - * has already cleared {@code appliedMal[key]}; without re-register, unregister - * would no-op the cascade and the backend would orphan.
      • - *
      • non-null — delta: classify {@code runtimeContent} → {@code bundledContent} - * and drop only metrics the runtime row claims that bundled does NOT claim, plus - * metrics in both at different shape. Bundled-shared metrics at matching shape - * are preserved (no data loss for the measures bundled will reuse on its - * synchronous reload). The drop runs under {@code fullInstall} so the listener - * cascade fires.
      • - *
      - * - *

      Throws {@link IllegalStateException} on MeterSystem unavailability or re-register - * failure; the caller propagates so the REST handler aborts {@code dao.delete}. + *

      Throws {@link IllegalStateException} on MeterSystem unavailability or register + * failure; the caller propagates so the REST handler aborts {@code dao.delete} before + * any destructive moment. */ @Override - public void dropBackend(final String catalog, final String name, - final String runtimeContent, final String bundledContent, - final MalApplyContext ctx) { + public void installRuntime(final String catalog, final String name, + final String runtimeContent, final MalApplyContext ctx) { final MalFileApplier applier = resolveApplier(); if (applier == null) { throw new IllegalStateException( - "MeterSystem unavailable; cannot drop backend measure for " + catalog + "/" - + name + " — refusing to delete the row and orphan the measure. Retry " - + "when MeterSystem is up."); - } - if (bundledContent != null) { - dropBackendDelta(catalog, name, runtimeContent, bundledContent, applier); - return; + "MeterSystem unavailable; cannot install runtime locally for " + catalog + "/" + + name + " — refusing to delete the row. Retry when MeterSystem is up."); } - dropBackendDestructive(catalog, name, runtimeContent, applier, ctx); - } - - private void dropBackendDelta(final String catalog, final String name, - final String runtimeContent, final String bundledContent, - final MalFileApplier applier) { - final DSLDelta delta = DeltaClassifier.classifyMal(runtimeContent, bundledContent); - final Set toDrop = new HashSet<>(); - toDrop.addAll(delta.removedMetrics()); - toDrop.addAll(delta.shapeBreakMetrics()); - if (toDrop.isEmpty()) { - log.info("runtime-rule MAL engine: /delete bundled-twin delta empty for {}/{} — " - + "nothing to drop, bundled will reuse all existing measures", - catalog, name); - return; - } - log.info("runtime-rule MAL engine: /delete bundled-twin delta for {}/{} — dropping {} " - + "runtime-only / shape-break metric(s): {}", - catalog, name, toDrop.size(), toDrop); - applier.remove(toDrop, StorageManipulationOpt.fullInstall()); - } - - private void dropBackendDestructive(final String catalog, final String name, - final String runtimeContent, final MalFileApplier applier, - final MalApplyContext ctx) { final String key = DSLScriptKey.key(catalog, name); final String sourceName = catalog + "/" + name; final String hash = ContentHash.sha256Hex(runtimeContent); - // Re-register prototypes locally so unregister has Models + meterPrototypes to walk. - // localCacheOnly suppresses listener-side backend define — we don't want to recreate - // the measure we're about to drop. try { final MalFileApplier.Applied applied = applier.apply( - runtimeContent, sourceName, hash, StorageManipulationOpt.localCacheOnly()); + runtimeContent, sourceName, hash, StorageManipulationOpt.withoutSchemaChange()); if (applied.getRuleClassLoader() != null) { DSLClassLoaderManager.INSTANCE.commit(applied.getRuleClassLoader()) .filter(prior -> prior != applied.getRuleClassLoader()) @@ -699,41 +660,21 @@ private void dropBackendDestructive(final String catalog, final String name, .withContentAndApplied(runtimeContent, applied) : prev.withContentAndApplied(runtimeContent, applied)); } catch (final MalFileApplier.ApplyException ae) { - // Roll back any partial state that DID land before the throw — every other apply - // path does the same. localCacheOnly matches the apply: backend was untouched. if (ae.getPartiallyRegistered() != null && !ae.getPartiallyRegistered().isEmpty()) { try { applier.remove(ae.getPartiallyRegistered(), - StorageManipulationOpt.localCacheOnly()); + StorageManipulationOpt.withoutSchemaChange()); } catch (final Throwable rollbackErr) { - log.warn("runtime-rule /delete: rollback of partial re-register also " + log.warn("runtime-rule installRuntime: rollback of partial re-register also " + "failed for {}/{}; {} prototype(s) may persist locally until OAP " + "restart.", catalog, name, ae.getPartiallyRegistered().size(), rollbackErr); } } throw new IllegalStateException( - "re-register for backend drop failed for " + catalog + "/" + name - + "; refusing to delete the row to avoid orphaning the measure. " - + "Cause: " + ae.getMessage(), ae); + "installRuntime for revert-to-bundled failed for " + catalog + "/" + name + + "; refusing to delete the row. Cause: " + ae.getMessage(), ae); } - - // Tear down with fullInstall: drops backend (listener whenRemoving fires dropTable - // for each downsampling variant) and clears the re-registered local state. We need - // to swap the storage opt for this call — clone the context with fullInstall. - final MalApplyContext fullInstallCtx = withStorageOpt(ctx, StorageManipulationOpt.fullInstall()); - unregister(catalog, name, fullInstallCtx); - } - - /** Clone {@code ctx} with the given storage opt. Used by the destructive - * {@link #dropBackend} path to flip from {@code localCacheOnly} (re-register) to - * {@code fullInstall} (destructive teardown). */ - private static MalApplyContext withStorageOpt(final MalApplyContext ctx, - final StorageManipulationOpt opt) { - final ApplyInputs inputs = new ApplyInputs( - ctx.getModuleManager(), opt, - ctx.getAlarmResetter(), ctx.getRules()); - return new MalApplyContext(inputs); } /** @@ -743,10 +684,10 @@ private static MalApplyContext withStorageOpt(final MalApplyContext ctx, * gone and operators would have to restart the OAP to get the bundled metrics flowing * again. * - *

      Compiles via {@code applier.apply(..., Kind.STATIC)} so the per-file loader is minted - * with the {@code static:} prefix — diagnostics can tell at a glance whether a key is - * being served by a runtime override or a static fall-over. The applier internally runs - * under {@code localCacheOnly}: the bundled metric backend already exists (it pre-dates + *

      Compiles via {@code applier.apply(..., Kind.BUNDLED)} so the per-file loader is minted + * with the {@code bundled:} prefix — diagnostics can tell at a glance whether a key is + * being served by a runtime override or a bundled fall-over. The applier internally runs + * under {@code withoutSchemaChange}: the bundled metric backend already exists (it pre-dates * the override), so we only need to re-register local prototypes and re-publish the * MetricConvert. * @@ -754,9 +695,10 @@ private static MalApplyContext withStorageOpt(final MalApplyContext ctx, * clean window. */ @Override - public boolean reloadStatic(final String catalog, final String name, + public boolean installBundled(final String catalog, final String name, final Consumer> alarmResetter, - final ModuleManager moduleManager) { + final ModuleManager moduleManager, + final StorageManipulationOpt storageOpt) { if (!CATALOGS.contains(catalog)) { return false; } @@ -773,15 +715,14 @@ public boolean reloadStatic(final String catalog, final String name, final String sourceName = catalog + "/" + name; final String hash = ContentHash.sha256Hex(staticContent); try { - // createIfAbsent rather than localCacheOnly: when reload follows a /delete that - // dropped runtime-only / shape-break measures (via dropBundledTwinDelta), some - // bundled-claimed measures may be missing in the backend. createIfAbsent recreates - // them without affecting backends that already match. + // Storage opt is caller-supplied: main-node /delete passes schemaCreateIfAbsent so + // measures the delta-drop just removed get recreated; peer-node tick reconcile + // passes the tick's own opt (typically withoutSchemaChange) so peers don't double- + // write DDL the main has already applied. final MalFileApplier.Applied fresh = applier.apply( - staticContent, sourceName, hash, - StorageManipulationOpt.createIfAbsent(), - DSLClassLoaderManager.Kind.STATIC); - // Promote the new static: loader. Any prior loader (typically null — unregister + staticContent, sourceName, hash, storageOpt, + DSLClassLoaderManager.Kind.BUNDLED); + // Promote the new bundled: loader. Any prior loader (typically null — unregister // already dropRuntime'd it) is retired so the graveyard observes its collection. if (fresh.getRuleClassLoader() != null) { DSLClassLoaderManager.INSTANCE.commit(fresh.getRuleClassLoader()) @@ -799,12 +740,12 @@ public boolean reloadStatic(final String catalog, final String name, }); pushRuntimeConverter(catalog, name, fresh.getMetricConvert()); alarmResetter.accept(fresh.getRegisteredMetricNames()); - log.info("runtime-rule MAL engine: static fall-over OK for {}/{} — {} metric(s) " + log.info("runtime-rule MAL engine: bundled fall-over OK for {}/{} — {} metric(s) " + "re-registered from bundled YAML", catalog, name, fresh.getRegisteredMetricNames().size()); return true; } catch (final MalFileApplier.ApplyException ae) { - log.warn("runtime-rule MAL engine: static fall-over for {}/{} failed to compile " + log.warn("runtime-rule MAL engine: bundled fall-over for {}/{} failed to compile " + "the bundled YAML; bundled metrics will stay dark until next /addOrUpdate " + "or restart", catalog, name, ae); return false; diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java index db226847aac3..66bd6de55d58 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java @@ -112,8 +112,8 @@ * │ • per-file lock acquisition │ * │ • Suspend/Resume RPC fan-out │ * │ • cross-file ownership guard (DAO + appliedX) │ - * │ • storage-opt selection (fullInstall / localCacheOnly / │ - * │ localCacheVerify) — gates whether DDL fires │ + * │ • storage-opt selection (withSchemaChange / withoutSchemaChange / │ + * │ verifySchemaOnly) — gates whether DDL fires │ * │ • persistence (RuntimeRuleManagementDAO.save) + 2-PC stash │ * │ for STRUCTURAL via StructuralCommitCoordinator │ * │ • DSLRuntimeUnregister orchestrator routes teardown to engine │ @@ -169,10 +169,10 @@ *

      * *

      Peers converge on the next dslManager tick by reading the persisted row and re-running - * the same engines under {@link StorageManipulationOpt#localCacheOnly} — peers register + * the same engines under {@link StorageManipulationOpt#withoutSchemaChange} — peers register * local handlers + prototypes but skip backend DDL since main has already done the writes. - * {@code /inactivate} is soft-pause (localCacheOnly — backend preserved, OAP-internal state - * torn down); {@code /delete} is destructive (fullInstall so the listener chain fires + * {@code /inactivate} is soft-pause (withoutSchemaChange — backend preserved, OAP-internal state + * torn down); {@code /delete} is destructive (withSchemaChange so the listener chain fires * {@code dropTable}). Both ride the same {@link * org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLRuntimeUnregister} * orchestrator that dispatches to {@code engine.unregister}. @@ -349,12 +349,12 @@ public void notifyAfterCompleted() throws ModuleStartException { // In a k8s rollout the list flips to non-empty as soon as self joins it, then keeps // changing for tens of seconds as more pods boot. Gating on it neither guaranteed // membership stability nor saved a wasteful first apply. If this tick runs under - // {@code localCacheOnly} because peer list is empty, the next scheduled tick (2 s + // {@code withoutSchemaChange} because peer list is empty, the next scheduled tick (2 s // later) re-evaluates with whatever {@code RemoteClientManager} now shows and re- - // applies under {@code fullInstall} if this node resolves as main. Backend DDL is + // applies under {@code withSchemaChange} if this node resolves as main. Backend DDL is // idempotent so the re-apply costs nothing. try { - // atBoot=true so a no-init OAP picks localCacheVerify and refuses to + // atBoot=true so a no-init OAP picks verifySchemaOnly and refuses to // start with a missing or shape-mismatched backend (k8s pod backloop) // instead of silently registering local workers against schema that // doesn't exist. Init / default-mode OAPs are unaffected — their boot @@ -363,12 +363,12 @@ public void notifyAfterCompleted() throws ModuleStartException { log.info("Runtime rule dslManager: synchronous first tick completed " + "(runtime-only DB rows are now applied locally)."); } catch (final RuntimeException re) { - // Boot pass under localCacheVerify re-throws missing/mismatch as a + // Boot pass under verifySchemaOnly re-throws missing/mismatch as a // RuntimeException so module bootstrap aborts. Translate to // ModuleStartException so the OAP exit message points the operator at // the right place. throw new ModuleStartException( - "Runtime rule dslManager boot pass failed under localCacheVerify; " + "Runtime rule dslManager boot pass failed under verifySchemaOnly; " + "the backend schema is missing or diverges from the declared rule. " + "Bring up the init OAP first or align rule files with the backend, " + "then restart this node.", diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java index 317128476fef..ac085f77e557 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -33,6 +33,7 @@ import org.apache.skywalking.oap.server.core.CoreModule; import org.apache.skywalking.oap.server.core.alarm.AlarmKernelService; import org.apache.skywalking.oap.server.core.alarm.AlarmModule; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; @@ -105,7 +106,7 @@ * only when a forwarded request arrives at a node that itself doesn't believe it's main * (split cluster view). The tick picks its storage opt via {@link #tickStorageOpt(boolean)}; * the per-endpoint REST opts are routed by the handler ({@code /addOrUpdate} → - * {@code fullInstall}, {@code /inactivate} → {@code localCacheOnly}, {@code /delete} → + * {@code withSchemaChange}, {@code /inactivate} → {@code withoutSchemaChange}, {@code /delete} → * dedicated {@link DSLRuntimeDelete} path). */ @Slf4j @@ -162,17 +163,17 @@ public final class DSLManager { private final DSLRuntimeApply dslRuntimeApply; /** Destructive {@code /delete} pipeline. Re-registers prototypes locally then tears down - * under fullInstall so the backend cascade fires before the DAO row is deleted. + * under withSchemaChange so the backend cascade fires before the DAO row is deleted. * Exposed via {@code @Getter}. */ @Getter private final DSLRuntimeDelete dslRuntimeDelete; - /** Boot-time seed + tick-time rehydrate of static rules. Exposed via {@code @Getter} + /** Boot-time seed + tick-time re-install of bundled rules. Exposed via {@code @Getter} * so the module provider can drive the boot-time load directly. */ @Getter private final StaticRuleLoader staticRuleLoader; - /** One-tick body — DB diff + apply + gone-keys cleanup + static rehydrate. */ + /** One-tick body — DB diff + apply + gone-keys cleanup + bundled re-install. */ private final RuleSync ruleSync; /** Catalog → engine lookup. Built once here from the per-DSL maps the scheduler owns; @@ -203,7 +204,7 @@ public DSLManager(final ModuleManager moduleManager, ); this.dslRuntimeDelete = new DSLRuntimeDelete( this.engineRegistry, this.moduleManager, - this.rules, this::invokeAlarmReset + this.rules, this::invokeAlarmReset, this.dslRuntimeApply ); this.staticRuleLoader = new StaticRuleLoader( this.engineRegistry, this.rules, @@ -228,13 +229,13 @@ public void tick() { /** * Variant invoked once at boot from {@code RuntimeRuleModuleProvider.notifyAfterCompleted} * with {@code atBoot=true}. The boot pass on a no-init OAP picks - * {@link StorageManipulationOpt#localCacheVerify()} so missing or shape-mismatched + * {@link StorageManipulationOpt#verifySchemaOnly()} so missing or shape-mismatched * backend schema fails the bootstrap (k8s pod backloop) instead of silently * proceeding. The scheduled executor calls the no-arg overload so subsequent ticks - * stay on the lenient {@code localCacheOnly} retry path. + * stay on the lenient {@code withoutSchemaChange} retry path. * *

      Boot semantics are scoped to no-init mode only — init-mode OAPs continue to - * pick {@link StorageManipulationOpt#createIfAbsent()} (boot creates), and + * pick {@link StorageManipulationOpt#schemaCreateIfAbsent()} (boot creates), and * default-mode OAPs continue to pick by cluster main-ness. */ public void tick(final boolean atBoot) { @@ -317,7 +318,7 @@ private Map readCurrentDbRules return rules; } - /** Run one tick — DB diff + apply + gone-keys cleanup + static rehydrate. Delegates to + /** Run one tick — DB diff + apply + gone-keys cleanup + bundled re-install. Delegates to * {@link RuleSync}. */ private void applyDeltasFromDatabase(final boolean atBoot) { ruleSync.runOnce(atBoot); @@ -357,22 +358,24 @@ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.Runtim */ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, final boolean deferCommit) { - return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.fullInstall()); + return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.withSchemaChange()); } /** * Storage-opt overload of {@link #applyNowForRuleFile(RuntimeRuleManagementDAO.RuntimeRuleFile, boolean)}. * - *

      The REST {@code /inactivate} path passes {@link StorageManipulationOpt#localCacheOnly()} + *

      The REST {@code /inactivate} path passes {@link StorageManipulationOpt#withoutSchemaChange()} * here so the OAP-internal teardown — MeterSystem prototypes, MetricsStreamProcessor * entry / persistent workers, BatchQueue handlers, retired RuleClassLoader — runs to * completion while the backend's measure / table / index, and the data already stored - * under the pre-inactivate metric, are left intact. {@code /delete} (and STRUCTURAL - * {@code /addOrUpdate} that drops shape-broken metrics) keeps {@code fullInstall()} so - * the destructive cascade reaches the backend as before. + * under the pre-inactivate metric, are left intact. STRUCTURAL {@code /addOrUpdate} + * keeps {@code withSchemaChange()} so the listener chain reaches the backend for shape + * changes; {@code /delete} default mode does not run the apply pipeline at all (the + * row is just removed), and {@code /delete?mode=revertToBundled} uses + * {@code withSchemaChange()} via {@link DSLRuntimeDelete#revertToBundled}. * *

      Other call sites should keep using the no-opt overload above so the documented - * "REST path = fullInstall, peer tick = localCacheOnly" routing rule is unchanged. + * "REST path = withSchemaChange, peer tick = withoutSchemaChange" routing rule is unchanged. */ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, final boolean deferCommit, @@ -545,7 +548,7 @@ private void handleApply(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile // 5. Engine pipeline — compile + fireSchemaChanges + verify. final DSLRuntimeApply.Outcome outcome = dslRuntimeApply.compileAndVerify( - ruleFile, cl, buildApplyInputs(storageOpt)); + ruleFile, cl, DSLClassLoaderManager.Kind.RUNTIME, buildApplyInputs(storageOpt)); if (outcome.status == DSLRuntimeApply.Outcome.Status.COMPILE_FAILED) { // Engine has already rolled back partial registrations. log.error("runtime-rule dslManager CRITICAL: apply COMPILE_FAILED for {}/{}: {}", @@ -707,7 +710,7 @@ private void invokeAlarmReset(final Set affectedMetricNames) { *

      RunningMode (boot/init context). *

        *
      • {@code init} mode — OAP is the dedicated initialiser; install schema if - * absent. {@link StorageManipulationOpt#createIfAbsent()} matches what the + * absent. {@link StorageManipulationOpt#schemaCreateIfAbsent()} matches what the * rest of the static-rule install path does in init mode (idempotent against * backends that already hold the table). *
      • {@code no-init} mode — this OAP must NOT touch the backend; the init OAP @@ -715,12 +718,12 @@ private void invokeAlarmReset(final Set affectedMetricNames) { * or a scheduled tick: *
          *
        • Boot pass ({@code atBoot=true}) → - * {@link StorageManipulationOpt#localCacheVerify()}. Strict: backend + * {@link StorageManipulationOpt#verifySchemaOnly()}. Strict: backend * resources must already exist with the declared shape. A missing or * mismatched schema fails the bootstrap (k8s pod backloop) — operator must * bring up the init OAP first, or align rule files with the backend. *
        • Scheduled tick ({@code atBoot=false}) → - * {@link StorageManipulationOpt#localCacheOnly()}. Lenient: the timer + * {@link StorageManipulationOpt#withoutSchemaChange()}. Lenient: the timer * retries forever without raising errors so transient absence (init OAP * still catching up between ticks) self-heals. *
        @@ -729,17 +732,17 @@ private void invokeAlarmReset(final Set affectedMetricNames) { * *

        Cluster main-ness (default mode only). *

          - *
        • Self is main → {@link StorageManipulationOpt#fullInstall()}. The REST path + *
        • Self is main → {@link StorageManipulationOpt#withSchemaChange()}. The REST path * has the same shape; tick rarely runs on main because REST usually * converges the main's state first. - *
        • Peer (someone else is main) → {@link StorageManipulationOpt#localCacheOnly()}. + *
        • Peer (someone else is main) → {@link StorageManipulationOpt#withoutSchemaChange()}. * Local MeterSystem + MetadataRegistry populate so the peer dispatches samples * correctly, but no server-side DDL fires. *
        * *

        When the cluster module isn't wired (embedded test topology), {@link * MainRouter#isSelfMain} returns {@code true} and the default-mode branch falls - * through to {@code fullInstall} — single-process deployments are always main. + * through to {@code withSchemaChange} — single-process deployments are always main. * * @param atBoot true for the synchronous one-shot pass invoked from * {@code RuntimeRuleModuleProvider.notifyAfterCompleted}; false for @@ -747,21 +750,21 @@ private void invokeAlarmReset(final Set affectedMetricNames) { */ private StorageManipulationOpt tickStorageOpt(final boolean atBoot) { if (RunningMode.isInitMode()) { - return StorageManipulationOpt.createIfAbsent(); + return StorageManipulationOpt.schemaCreateIfAbsent(); } if (RunningMode.isNoInitMode()) { return atBoot - ? StorageManipulationOpt.localCacheVerify() - : StorageManipulationOpt.localCacheOnly(); + ? StorageManipulationOpt.verifySchemaOnly() + : StorageManipulationOpt.withoutSchemaChange(); } try { final RemoteClientManager rcm = moduleManager.find(CoreModule.NAME).provider() .getService(RemoteClientManager.class); return MainRouter.isSelfMain(rcm) - ? StorageManipulationOpt.fullInstall() - : StorageManipulationOpt.localCacheOnly(); + ? StorageManipulationOpt.withSchemaChange() + : StorageManipulationOpt.withoutSchemaChange(); } catch (final Throwable t) { - return StorageManipulationOpt.fullInstall(); + return StorageManipulationOpt.withSchemaChange(); } } } diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java index e4b1277b28a2..31a017828798 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeApply.java @@ -19,6 +19,7 @@ package org.apache.skywalking.oap.server.receiver.runtimerule.reconcile; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; @@ -58,8 +59,14 @@ * *

        Deferred commit. The scheduler can invoke {@link #compileAndVerify} (no commit) * for the STRUCTURAL REST 2-PC path, then drive {@link #commit} or {@link #rollback} - * separately after row-persist resolves. The simpler {@link #applyInline} variant does + * separately after row-persist resolves. The simpler {@link #apply} variant does * compile + verify + commit in one call for the tick path and FILTER_ONLY REST path. + * + *

        Loader kind. All entry points take a {@link DSLClassLoaderManager.Kind} that + * tags the per-file classloader. {@link DSLClassLoaderManager.Kind#RUNTIME} for {@code + * /addOrUpdate} and tick paths; {@link DSLClassLoaderManager.Kind#BUNDLED} for the + * {@code /delete?mode=revertToBundled} path that re-installs the bundled YAML through + * this same pipeline. */ @Slf4j public final class DSLRuntimeApply { @@ -71,19 +78,20 @@ public DSLRuntimeApply(final RuleEngineRegistry engineRegistry) { } /** - * Run compile → fireSchemaChanges → verify → commit (or rollback on verify failure) inline. - * Used by the tick path and the FILTER_ONLY REST path where there is no row-persist gate - * to wait on. + * Run compile → fireSchemaChanges → verify → commit (or rollback on verify failure) all in + * one call. Used by the tick path and the FILTER_ONLY REST path where there is no row- + * persist gate to wait on. */ - public Outcome applyInline(final RuntimeRuleManagementDAO.RuntimeRuleFile file, - final Classification classification, - final ApplyInputs inputs) { + public Outcome apply(final RuntimeRuleManagementDAO.RuntimeRuleFile file, + final Classification classification, + final DSLClassLoaderManager.Kind kind, + final ApplyInputs inputs) { final RuleEngine engine = engineRegistry.forCatalog(file.getCatalog()); if (engine == null) { return Outcome.compileFailed( "no engine registered for catalog '" + file.getCatalog() + "'", null); } - return applyInlineTyped(engine, file, classification, inputs); + return applyTyped(engine, file, classification, kind, inputs); } /** @@ -93,13 +101,14 @@ public Outcome applyInline(final RuntimeRuleManagementDAO.RuntimeRuleFile file, */ public Outcome compileAndVerify(final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final ApplyInputs inputs) { final RuleEngine engine = engineRegistry.forCatalog(file.getCatalog()); if (engine == null) { return Outcome.compileFailed( "no engine registered for catalog '" + file.getCatalog() + "'", null); } - return compileAndVerifyTyped(engine, file, classification, inputs); + return compileAndVerifyTyped(engine, file, classification, kind, inputs); } /** Drive {@code engine.commit} on a previously {@link #compileAndVerify}-produced outcome. */ @@ -121,12 +130,13 @@ public void rollback(final Outcome outcome) { rollbackTyped(outcome); } - private static Outcome applyInlineTyped( + private static Outcome applyTyped( final RuleEngine engine, final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final ApplyInputs inputs) { - final Outcome step = compileAndVerifyTypedHelper(engine, file, classification, inputs); + final Outcome step = compileAndVerifyTypedHelper(engine, file, classification, kind, inputs); if (step.status != Outcome.Status.READY_TO_COMMIT) { return step; } @@ -140,19 +150,21 @@ private static Outcome compileAndVerifyTyped( final RuleEngine engine, final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final ApplyInputs inputs) { - return compileAndVerifyTypedHelper(engine, file, classification, inputs); + return compileAndVerifyTypedHelper(engine, file, classification, kind, inputs); } private static Outcome compileAndVerifyTypedHelper( final RuleEngine engine, final RuntimeRuleManagementDAO.RuntimeRuleFile file, final Classification classification, + final DSLClassLoaderManager.Kind kind, final ApplyInputs inputs) { final C ctx = engine.newApplyContext(inputs); final CompiledDSL compiled; try { - compiled = engine.compile(file, classification, ctx); + compiled = engine.compile(file, classification, kind, ctx); } catch (final EngineCompileException ece) { log.error("runtime-rule apply: compile FAILED for {}/{}: {}", file.getCatalog(), file.getName(), ece.getMessage(), ece); diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java index 3bf4b259c8ea..ce6af280a0ce 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java @@ -21,34 +21,74 @@ import java.util.ArrayList; import java.util.List; import java.util.Map; +import java.util.Optional; import java.util.Set; import java.util.concurrent.locks.ReentrantLock; import java.util.function.Consumer; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; +import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyContext; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.ApplyInputs; +import org.apache.skywalking.oap.server.receiver.runtimerule.engine.Classification; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngine; import org.apache.skywalking.oap.server.receiver.runtimerule.engine.RuleEngineRegistry; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; /** - * Destructive {@code /delete} pipeline. Third orchestrator alongside {@link DSLRuntimeApply} - * (NEW / FILTER_ONLY / STRUCTURAL apply) and {@link DSLRuntimeUnregister} (INACTIVE / gone-keys - * tear-down). {@code /delete} is the one endpoint that physically drops backend schema — - * {@code /inactivate} preserves it for cheap re-activation. + * {@code /delete?mode=revertToBundled} orchestrator. Third orchestrator alongside + * {@link DSLRuntimeApply} (NEW / FILTER_ONLY / STRUCTURAL apply) and + * {@link DSLRuntimeUnregister} (INACTIVE / gone-keys tear-down). * - *

        This orchestrator is a thin dispatcher: it acquires the per-file lock, runs the cross- - * file ownership guard (defence-in-depth — {@code /addOrUpdate} should have caught it - * already), and routes to {@link RuleEngine#dropBackend}. Engines that own backend - * schema (MAL) execute the re-register-then-drop dance there; engines without backend (LAL) - * implement the SPI method as a no-op. + *

        Two paths through {@code /delete}. The REST handler chooses based on operator + * intent and bundled-twin presence: + *

          + *
        • DEFAULT mode, no bundled twin — REST does {@code dao.delete} directly. The runtime + * was already torn down locally by the prior {@code /inactivate}; the backend measure + * (if any) stays as an inert artefact, matching bundled-rule deletion semantics. + * This orchestrator is not involved.
        • + *
        • DEFAULT mode, bundled twin exists — REST refuses with 409 + * {@code requires_revert_to_bundled}. Operator must opt in.
        • + *
        • {@code revertToBundled} mode, bundled twin exists — REST calls + * {@link #revertToBundled} and then {@code dao.delete}. This is the schema-change + * path; bundled may have a different shape than runtime, so the runtime backend + * must be dropped cleanly before bundled installs its own measure.
        • + *
        • {@code revertToBundled} mode, no bundled twin — REST returns 400; this orchestrator + * is not invoked.
        • + *
        * - *

        The caller (REST {@code /delete}) holds the per-file lock; this orchestrator re-acquires - * it (lock is reentrant) so the implementation is correct whether called inline or from a - * background path. + *

        How the schema change happens. {@code /inactivate} cleared the engine's applied + * state, so a naive bundled apply has no prior state to diff against — it would just + * register bundled's metrics and leave any runtime-only metrics orphaned. To get the + * proper diff, {@link #revertToBundled} runs the steps below in order: + *

        + *   1. {@link RuleEngine#installRuntime} re-registers prior runtime claims under
        + *      {@code withoutSchemaChange} (no backend touch). Now the rules map points at
        + *      runtime's claim set as if it were ACTIVE again.
        + *   2. {@link DSLRuntimeApply#apply} runs the standard pipeline against the bundled YAML
        + *      with {@code Kind.BUNDLED} + {@code withSchemaChange}. Engine.compile sees the
        + *      step-1 runtime install as prior, classifies STRUCTURAL, computes the runtime→bundled
        + *      delta. Engine.commit fires {@code applier.remove(removedMetrics, withSchemaChange)}
        + *      which drops runtime-only measures via the listener chain, and the new compile
        + *      registers bundled-only measures. Bundled-shared metrics at matching shape are
        + *      reused; at differing shape, the listener cascade reshapes them via the standard
        + *      {@code allowStorageChange} contract.
        + *   3. The rules-map entry's state is reset to {@code null} so the next gone-keys
        + *      reconcile (peer ticks see the absent DAO row) treats this as boot-seeded bundled
        + *      rather than a dangling ACTIVE.
        + * 
        + * + *

        The REST handler invokes {@code dao.delete} after this orchestrator returns success; + * a DAO failure leaves the local node with bundled installed and the runtime row still + * present, which the next reconciler tick reapplies as runtime — eventually consistent, + * and the operator can retry the revert. + * + *

        The orchestrator re-acquires the per-file lock (REST already holds it; the lock is + * reentrant) so the implementation is correct whether called inline or from a background + * path. */ @Slf4j public class DSLRuntimeDelete { @@ -57,71 +97,98 @@ public class DSLRuntimeDelete { private final ModuleManager moduleManager; private final Map rules; private final Consumer> alarmResetter; + private final DSLRuntimeApply dslRuntimeApply; public DSLRuntimeDelete(final RuleEngineRegistry engineRegistry, final ModuleManager moduleManager, final Map rules, - final Consumer> alarmResetter) { + final Consumer> alarmResetter, + final DSLRuntimeApply dslRuntimeApply) { this.engineRegistry = engineRegistry; this.moduleManager = moduleManager; this.rules = rules; this.alarmResetter = alarmResetter; + this.dslRuntimeApply = dslRuntimeApply; + } + + /** Outcome of a {@link #revertToBundled} call. */ + public static final class Result { + public enum Status { + /** Steps 1–3 all succeeded. Bundled is now serving locally. */ + REVERTED, + /** Step 1 succeeded; step 2 (bundled apply pipeline) failed. In practice this + * is almost always a backend-storage failure during DDL or verify — BanyanDB + * rejected the measure shape, the schema-barrier didn't propagate within the + * timeout, or the storage backend was unreachable. Bundled YAML parse and + * Javassist generation are theoretical failure modes but extremely rare + * (bundled YAML has already been loaded successfully at boot). The engine + * has self-rolled-back its partial registrations and the orchestrator has + * unregistered the step-1 runtime install, so local state matches the + * persisted INACTIVE row. The operator can retry the revert once the storage + * backend recovers. */ + BUNDLED_APPLY_FAILED, + /** Cross-file ownership guard rejected the revert. Bundled's claims overlap + * with another active bundle. */ + REFUSED_CONFLICT, + /** Pre-step bookkeeping failed (no engine for catalog, MeterSystem unavailable, + * bundled YAML missing, etc.). Local state has not been mutated. */ + PRECONDITION_FAILED + } + + public final Status status; + public final String error; + + Result(final Status status, final String error) { + this.status = status; + this.error = error; + } } /** - * Discharge backend debt for the {@code (catalog, name)} bundle the REST handler is about - * to {@code /delete}. Routes to {@link RuleEngine#dropBackend} — engines that own - * backend schema do the re-register-then-drop dance; engines without backend no-op. - * - * @throws IllegalStateException if a cross-file ownership conflict is detected, or the - * engine cannot discharge its backend debt (MeterSystem unavailable, parse error in - * the inactive content). The caller (REST handler) aborts {@code dao.delete} on this - * throw — refusing to delete the row is the correct failure mode. + * Run the revert-to-bundled pipeline for {@code (catalog, name)}. The REST handler + * has already verified that the bundled YAML twin exists on disk. Returns a + * {@link Result} describing the outcome; the REST handler maps that to an HTTP + * response and decides whether to proceed with {@code dao.delete}. */ - public void dropBackendForDelete(final String catalog, final String name, final String content) { + public Result revertToBundled(final String catalog, final String name, + final String runtimeContent) { final RuleEngine engine = engineRegistry.forCatalog(catalog); if (engine == null) { - log.warn("runtime-rule dslManager: no engine registered for catalog '{}' on " - + "/delete of {}/{}; skipping", catalog, catalog, name); - return; + return new Result(Result.Status.PRECONDITION_FAILED, + "no engine registered for catalog '" + catalog + "'"); + } + final Optional bundled = StaticRuleRegistry.active().find(catalog, name); + if (!bundled.isPresent()) { + return new Result(Result.Status.PRECONDITION_FAILED, + "no bundled YAML on disk for " + catalog + "/" + name); } final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); perFile.lock(); try { - // Defence-in-depth ownership guard. /addOrUpdate's check should have prevented - // this — if a race or DAO blip slipped one through, dropping the backend resource - // here would tear down a metric another active file is still using. - final List activeConflicts = checkOwnershipConflicts(engine, catalog, name, content); + // Defence-in-depth ownership guard. After the revert, bundled's claims become + // the live claim set for this key. Refuse if any of bundled's claims are + // already owned by another active bundle — letting bundled register would + // clobber whatever the other active bundle is currently serving. + final List activeConflicts = checkOwnershipConflicts( + engine, catalog, name, bundled.get()); if (!activeConflicts.isEmpty()) { - throw new IllegalStateException( - "/delete refused for " + catalog + "/" + name + ": claim(s) " - + activeConflicts + " are now owned by another active bundle. " - + "The /addOrUpdate cross-file ownership check should have caught " - + "this; this is a safety net. Update or /inactivate the conflicting " - + "bundle(s) first."); - } - // The engine's dropBackend handles both modes via bundledContent: - // * null → destructive cascade (drop everything runtime claimed) - // * non-null → delta drop (only runtime-only + shape-break metrics; bundled- - // shared at matching shape is preserved for bundled to reuse on - // its synchronous reload below). - final String bundledContent = - StaticRuleRegistry.active().find(catalog, name).orElse(null); - if (bundledContent != null) { - log.info("runtime-rule /delete: bundled twin exists for {}/{} — running " - + "delta-aware cleanup (drop runtime-only / shape-break, keep bundled-shared)", - catalog, name); + return new Result(Result.Status.REFUSED_CONFLICT, + "/delete?mode=revertToBundled refused for " + catalog + "/" + name + + ": bundled claim(s) " + activeConflicts + " are owned by " + + "another active bundle. Update or /inactivate the conflicting " + + "bundle(s) first, or accept that the bundled rule is masked " + + "until they are released."); } - dropBackend(engine, catalog, name, content, bundledContent); + return runRevert(engine, catalog, name, runtimeContent, bundled.get()); } finally { perFile.unlock(); } } private List checkOwnershipConflicts(final RuleEngine engine, final String catalog, - final String name, final String content) { + final String name, final String bundledContent) { final String selfKey = DSLScriptKey.key(catalog, name); - final Set planned = engine.claimedKeys(content, catalog + "/" + name); + final Set planned = engine.claimedKeys(bundledContent, catalog + "/" + name); final List conflicts = new ArrayList<>(); for (final Map.Entry> other : engine.activeClaimsExcluding(selfKey).entrySet()) { for (final String pk : planned) { @@ -133,52 +200,64 @@ private List checkOwnershipConflicts(final RuleEngine engine, final S return conflicts; } - /** - * Synchronously reload the bundled rule into a fresh {@code static:} loader after a - * {@code /delete} of a row whose {@code (catalog, name)} has a bundled YAML on disk. - * The REST handler calls this so the operator's response reflects the post-delete - * reality (bundled is already serving) rather than waiting for the next tick. - * - * @return {@code true} when a bundled rule was reloaded; {@code false} when no bundled - * twin exists or the engine doesn't participate in static fall-over for this - * catalog. Errors are logged at WARN and surfaced as {@code false}. - */ - public boolean reloadBundledIfPresent(final String catalog, final String name) { - final RuleEngine engine = engineRegistry.forCatalog(catalog); - if (engine == null) { - return false; - } - if (!StaticRuleRegistry.active().find(catalog, name).isPresent()) { - return false; - } - final ReentrantLock perFile = AppliedRuleScript.lockFor(rules, catalog, name); - perFile.lock(); + /** Wildcard-capture helper. Threads the engine's typed context through the three + * steps that need a strong-typed {@code C}: install runtime, run apply, reset state. */ + private Result runRevert(final RuleEngine engine, + final String catalog, final String name, + final String runtimeContent, + final String bundledContent) { + // Step 1. Install runtime locally (no backend touch). The next step's compile + // sees this as the prior state and computes the runtime→bundled delta against it. + final ApplyInputs withoutSchema = new ApplyInputs( + moduleManager, StorageManipulationOpt.withoutSchemaChange(), + alarmResetter, rules); + final C ctx = engine.newApplyContext(withoutSchema); try { - return engine.reloadStatic(catalog, name, alarmResetter, moduleManager); + engine.installRuntime(catalog, name, runtimeContent, ctx); + } catch (final IllegalStateException ise) { + return new Result(Result.Status.PRECONDITION_FAILED, + "installRuntime failed for " + catalog + "/" + name + ": " + ise.getMessage()); } catch (final Throwable t) { - log.warn("runtime-rule /delete: bundled fall-over reload failed for {}/{}; " - + "peer tick will retry via gone-keys path", catalog, name, t); - return false; - } finally { - perFile.unlock(); + log.error("runtime-rule revertToBundled: installRuntime threw for {}/{}", + catalog, name, t); + return new Result(Result.Status.PRECONDITION_FAILED, t.getMessage()); } - } - /** - * Wildcard-capture helper. Threads {@code bundledContent} through to {@link - * RuleEngine#dropBackend}: a null value triggers the destructive cascade (drop - * everything runtime had); a non-null value triggers the delta drop (drop only - * metrics runtime had that bundled doesn't claim, preserve bundled-shared at - * matching shape). fullInstall makes the listener chain run. - */ - private void dropBackend( - final RuleEngine engine, final String catalog, final String name, - final String runtimeContent, final String bundledContent) { - final ApplyInputs inputs = new ApplyInputs( - moduleManager, StorageManipulationOpt.fullInstall(), - alarmResetter, rules - ); - final C ctx = engine.newApplyContext(inputs); - engine.dropBackend(catalog, name, runtimeContent, bundledContent, ctx); + // Step 2. Run the standard apply pipeline against bundled. Engine.commit drops + // runtime-only measures via the delta, registers bundled-only measures, and + // reuses bundled-shared measures at matching shape. + final RuntimeRuleManagementDAO.RuntimeRuleFile bundledFile = + new RuntimeRuleManagementDAO.RuntimeRuleFile( + catalog, name, bundledContent, /* status */ null, /* updateTime */ 0L); + final ApplyInputs withSchema = new ApplyInputs( + moduleManager, StorageManipulationOpt.withSchemaChange(), + alarmResetter, rules); + final DSLRuntimeApply.Outcome outcome = dslRuntimeApply.apply( + bundledFile, Classification.STRUCTURAL, + DSLClassLoaderManager.Kind.BUNDLED, withSchema); + if (outcome.status != DSLRuntimeApply.Outcome.Status.COMMITTED) { + log.warn("runtime-rule revertToBundled: bundled apply did not commit for {}/{}: {} ({})", + catalog, name, outcome.error, outcome.status); + // Step 1 installed runtime locally under withoutSchemaChange — handlers and + // meterPrototypes are live again. Bundled apply failed (engine self-rolled + // back its own partial registrations) but step 1 is still in place. The + // operator's intent was /inactivate (handlers OFF) followed by a failed + // revert — leaving runtime live silently violates the inactivate. Tear it + // back down so local state matches the persisted INACTIVE row. The operator + // must retry /delete?mode=revertToBundled after fixing the bundled YAML. + final ApplyInputs cleanup = new ApplyInputs( + moduleManager, StorageManipulationOpt.withoutSchemaChange(), + alarmResetter, rules); + final C cleanupCtx = engine.newApplyContext(cleanup); + engine.unregister(catalog, name, cleanupCtx); + return new Result(Result.Status.BUNDLED_APPLY_FAILED, outcome.error); + } + + // Step 3. Mark the entry boot-seeded so gone-keys reconcile leaves it alone after + // dao.delete removes the row. Without this reset, the next tick would see state + // != null + DAO row absent and tear down what we just installed. + rules.computeIfPresent(DSLScriptKey.key(catalog, name), + (k, prev) -> prev.withState(null)); + return new Result(Result.Status.REVERTED, null); } } diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java index 50177efac1ee..89de25653449 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeUnregister.java @@ -43,9 +43,9 @@ * static-rule fallback, alarm reset target). The orchestrator owns the cross-DSL bookkeeping * (clearing the content side of {@link AppliedRuleScript} on success). * - *

        After a successful teardown, the engine's {@code reloadStatic} hook is invoked so any - * bundled-static rule that the now-removed runtime override was masking gets brought back into - * service via a fresh {@code static:} loader from + *

        After a successful teardown, the engine's {@code installBundled} hook is invoked so any + * bundled rule that the now-removed runtime override was masking gets brought back into + * service via a fresh {@code bundled:} loader from * {@link org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager}. * *

        {@code invokeAlarmOnRemove}. Two legitimate call modes: @@ -86,7 +86,7 @@ public boolean unregister(final String catalog, final String name, } /** - * Tear down a bundle's local registrations. {@code reloadStaticAfter} controls whether + * Tear down a bundle's local registrations. {@code installBundledAfter} controls whether * the bundled rule (if any) is reinstalled after the unregister: * *

          @@ -97,18 +97,18 @@ public boolean unregister(final String catalog, final String name, *
        • {@code true} — used by the row-gone reconcile path (a {@code /delete} cleared * the row, peer ticks observe the absence). The runtime override no longer * exists, so the bundled YAML (if any) should serve again — engines reload via - * {@link RuleEngine#reloadStatic} into a fresh {@code static:} loader.
        • + * {@link RuleEngine#installBundled} into a fresh {@code bundled:} loader. *
        * * @return {@code true} when a bundled fall-over was actually installed (caller may want * to retain the entry in the unified rules map rather than removing it); * {@code false} otherwise (no engine, no bundled twin, reload failed, or - * {@code reloadStaticAfter=false}). + * {@code installBundledAfter=false}). */ public boolean unregister(final String catalog, final String name, final boolean invokeAlarmOnRemove, final StorageManipulationOpt storageOpt, - final boolean reloadStaticAfter) { + final boolean installBundledAfter) { final RuleEngine engine = engineRegistry.forCatalog(catalog); if (engine == null) { log.warn("runtime-rule dslManager: no engine registered for catalog '{}' on " @@ -123,17 +123,20 @@ public boolean unregister(final String catalog, final String name, // Cross-DSL bookkeeping: clear the cached raw content so the next classify call sees // "no prior bundle". Engines deliberately don't touch this — it's shared between // catalogs and the orchestrator owns the lifecycle. State is preserved (set - // elsewhere — INACTIVE tombstone, NOT_LOADED, or reset by reloadStatic below). + // elsewhere — INACTIVE tombstone, NOT_LOADED, or reset by installBundled below). rules.computeIfPresent(DSLScriptKey.key(catalog, name), (k, prev) -> prev.withContent(null)); - if (!reloadStaticAfter) { + if (!installBundledAfter) { return false; } try { - return engine.reloadStatic(catalog, name, resetter, moduleManager); + // Pass the same storage opt the unregister ran under so peer-tick gone-keys + // doesn't write DDL the main has already applied (tickStorageOpt picks + // withoutSchemaChange for peers). + return engine.installBundled(catalog, name, resetter, moduleManager, storageOpt); } catch (final Throwable t) { - log.warn("runtime-rule dslManager: static fall-over reload failed for {}/{}; " + log.warn("runtime-rule dslManager: bundled fall-over reload failed for {}/{}; " + "bundled rule may stay dark until a successful re-apply or restart", catalog, name, t); return false; diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java index 0cba691a911c..9667aed66988 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/RuleSync.java @@ -56,7 +56,7 @@ * DSLManager#applyOneRuleFile} (which delegates to {@link DSLRuntimeApply} or {@link * DSLRuntimeUnregister} via the per-DSL drivers). Honours the per-tick * {@link StorageManipulationOpt} and the marker-debt promotion (peer that was - * localCacheOnly is now main → re-fire under fullInstall).
      • + * withoutSchemaChange is now main → re-fire under withSchemaChange). *
      • Gone-keys cleanup — anything in the snapshot that's not in the DB and not * static-shadowed gets {@link DSLRuntimeUnregister}'d. Snapshot removal is deferred * past unregister so a transient teardown failure doesn't lose the retry.
      • @@ -106,7 +106,7 @@ public RuleSync(final ModuleManager moduleManager, /** * Run the full tick body once. {@code atBoot=true} on the synchronous first tick from * {@code RuntimeRuleModuleProvider.notifyAfterCompleted}; the storage-opt picker uses this - * to choose {@code localCacheVerify} on no-init OAPs (fail boot if backend is not in shape). + * to choose {@code verifySchemaOnly} on no-init OAPs (fail boot if backend is not in shape). */ public void runOnce(final boolean atBoot) { final RuntimeRuleManagementDAO dao; @@ -220,15 +220,23 @@ private void cleanupGoneKeys(final Set seenKeys, final StorageManipulati // Map removal deferred to AFTER unregister succeeds. If unregister throws, // the entry stays so the next tick retries via the same removedKeys path. try { - // unregisterBundle with reloadStaticAfter=true: tear down the removed + // unregisterBundle with installBundledAfter=true: tear down the removed // runtime registrations, then if the rule has a bundled twin install - // it fresh via a static: loader. Returns true when a bundled fall-over - // landed — in that case we KEEP the rules entry (reloadStatic re-seeded + // it fresh via a bundled: loader. Returns true when a bundled fall-over + // landed — in that case we KEEP the rules entry (installBundled re-seeded // it as a bundled-served entry, equivalent to a boot-seeded one). // Otherwise the entry is fully gone and we remove it. - final boolean staticReloaded = unregister.unregisterBundle( - parts[0], parts[1], true, tickOpt, true); - if (!staticReloaded) { + // + // Storage opt: ALWAYS withoutSchemaChange for the unregister leg. The + // new /delete contract says default-mode (no bundled twin) leaves the + // backend as inert artefact, and the only schema-changing /delete path + // (revertToBundled) drives the schema mutation through the apply + // pipeline at REST time, not here. Passing the tickOpt would let a + // peer-promoted-to-main node drop the backend during gone-keys cleanup, + // contradicting the operator-facing contract. + final boolean bundledReloaded = unregister.unregisterBundle( + parts[0], parts[1], true, StorageManipulationOpt.withoutSchemaChange(), true); + if (!bundledReloaded) { rules.remove(gone); } } catch (final Throwable t) { @@ -253,7 +261,7 @@ void applyOneRuleFile(RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, String @FunctionalInterface public interface Unregister { boolean unregisterBundle(String catalog, String name, boolean invokeAlarmOnRemove, - StorageManipulationOpt storageOpt, boolean reloadStaticAfter); + StorageManipulationOpt storageOpt, boolean installBundledAfter); } /** Functional handle for per-tick storage-opt picking (init / no-init / main vs peer). */ diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java index 3c112196f940..9871ea69508d 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/StaticRuleLoader.java @@ -40,7 +40,7 @@ *
          *
        • {@link #loadAll} — boot-time load. Asks every engine to load its catalog's static * rules into the engine's internal applied state (via - * {@link RuleEngine#loadStaticRuleFile}), then seeds the shared + * {@link RuleEngine#recordBundledClaims}), then seeds the shared * {@code appliedContent} + {@code snapshot} maps so the first {@code /addOrUpdate} * classifier and the first Suspend lookup see the bundle.
        • *
        • {@link #loadIfMissing} — tick-time load. Re-loads any static rule whose DB row got @@ -67,7 +67,7 @@ * *

          DSL-agnostic. The actual per-DSL load — building a synthetic Applied artifact * with metric names (MAL) or registered-rule list (LAL) — happens behind - * {@link RuleEngine#loadStaticRuleFile}. This class only iterates {@code StaticRuleRegistry}, + * {@link RuleEngine#recordBundledClaims}. This class only iterates {@code StaticRuleRegistry}, * routes each entry to the matching engine, and updates the shared scheduler-side state on * success. */ @@ -116,16 +116,24 @@ public void loadAll() { continue; } final String content = e.getValue(); - if (!engine.loadStaticRuleFile(catalog, name, content)) { + if (!engine.recordBundledClaims(catalog, name, content)) { continue; } final String key = DSLScriptKey.key(catalog, name); final String contentHash = ContentHash.sha256Hex(content); - // Without these the first REST /addOrUpdate would classify against null prior - // content and return NEW even on a filter-only edit; the first Suspend RPC - // would lookup-miss. - rules.putIfAbsent(key, new AppliedRuleScript(catalog, name, content, - DSLRuntimeState.running(catalog, name, contentHash, nowMs))); + // recordBundledClaims has already stamped the synthetic Applied into the + // rules map (with content=null and state=null). Overlay the bundled content + // and a RUNNING state on that entry — without these, the first REST + // /addOrUpdate would classify against null prior content and return NEW even + // on a filter-only edit, and the first Suspend RPC would lookup-miss. + // putIfAbsent would no-op here because the engine already created the entry. + rules.compute(key, (k, prev) -> { + final DSLRuntimeState state = + DSLRuntimeState.running(catalog, name, contentHash, nowMs); + return prev == null + ? new AppliedRuleScript(catalog, name, content, state) + : prev.withContentAndState(content, state); + }); loaded++; } if (loaded > 0) { @@ -159,7 +167,7 @@ public void loadIfMissing(final Set seenKeys, final long nowMs, continue; } // Snapshot presence is the scheduler's "is this bundle tracked?" signal — engine - // ownership lives behind loadStaticRuleFile. If snapshot has the key, either the + // ownership lives behind recordBundledClaims. If snapshot has the key, either the // engine has it loaded or a runtime apply did it; either way nothing to redo. if (rules.containsKey(key)) { continue; diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java index 2af590618dce..bdf0d2c96e41 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/DeleteMode.java @@ -26,15 +26,19 @@ * of free-form strings. */ public enum DeleteMode { - /** No mode flag — apply the default {@code /delete} behaviour. If the rule has a - * bundled YAML on disk for {@code (catalog, name)}, the row is removed and bundled is - * reinstalled into a {@code static:} loader; backend resources are preserved. If no - * bundled twin exists, the destructive cascade fires (drops the backend resource + - * removes the row). */ + /** No mode flag — apply the default {@code /delete} behaviour. With no bundled twin + * on disk, the row is dropped and the backend measure (if any) is left in place as + * an inert artefact (operator-side cleanup of orphaned schemas is out of scope, same + * as for static rules removed from {@code otel-rules/}). With a bundled twin, the + * request is refused with {@code 409 requires_revert_to_bundled} so letting bundled + * silently take over the {@code (catalog, name)} requires an explicit operator + * decision. */ DEFAULT(""), - /** Operator explicitly asked to revert this rule to its bundled YAML. Identical to - * {@link #DEFAULT} when a bundled twin exists; returns {@code 400 no_bundled_twin} - * when one does not (vs {@link #DEFAULT}, which would still drop the runtime row). */ + /** Operator explicitly asked to revert this rule to its bundled YAML. Runs the + * schema-change pipeline (install runtime locally, apply bundled through the + * standard pipeline so the runtime→bundled delta drops runtime-only metrics and + * installs bundled-only ones) before removing the row. Returns {@code 400 + * no_bundled_twin} when no bundled YAML exists on disk for {@code (catalog, name)}. */ REVERT_TO_BUNDLED("revertToBundled"); @Getter diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index d7c60749ca83..3735a1acd788 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -95,7 +95,7 @@ * storage_change_requires_explicit_approval).

        • *
        • {@code /inactivate} — the soft-pause path. Broadcasts Suspend, flips the row to * INACTIVE, runs the OAP-internal teardown under - * {@link StorageManipulationOpt#localCacheOnly} via + * {@link StorageManipulationOpt#withoutSchemaChange} via * {@link DSLManager#applyNowForRuleFile}: dispatch handlers unregistered, prototypes * and Models cleared, alarm windows reset. The BanyanDB measure and its data are * explicitly preserved so reactivation via {@code /addOrUpdate} on the INACTIVE row @@ -103,16 +103,15 @@ * the same OAP-internal teardown. The inactive rule still HOLDS its metric / rule * names per the soft-pause contract — another file claiming any of those names is * rejected by the cross-file ownership guard.
        • - *
        • {@code /delete} — the destructive path. Requires the rule to already be INACTIVE - * (returns HTTP 409 {@code requires_inactivate_first} otherwise) — the two-step - * {@code /inactivate → /delete} workflow is enforced. {@code /delete} drives - * {@link DSLRuntimeDelete}: re-registers prototypes locally under - * {@code localCacheOnly} so the cascade has Models to walk, then runs the unregister - * path under {@code fullInstall} so the listener chain fires BanyanDB delete-measure - * on the live measure. Backend-drop failure aborts the row - * removal — an orphaned measure with no row left to retry is never possible. After - * the row is gone, if a static version exists on disk the rule reverts to that on - * the next dslManager tick.
        • + *
        • {@code /delete} — row removal. Requires the rule to already be INACTIVE (returns + * HTTP 409 {@code requires_inactivate_first} otherwise). Behaviour depends on bundled + * twin and {@code mode} flag: default mode with no twin drops the row and leaves the + * backend measure as inert artefact (matches bundled-rule deletion on disk); default + * mode with a bundled twin returns 409 {@code requires_revert_to_bundled} to force + * an explicit operator decision; {@code ?mode=revertToBundled} with a twin runs the + * schema-change pipeline through {@link DSLRuntimeDelete#revertToBundled} and + * reinstalls bundled before removing the row; {@code ?mode=revertToBundled} without + * a twin returns 400 {@code no_bundled_twin}.
        • *
        • {@code /list} returns an NDJSON view of every row merged with the dslManager's * per-node {@link DSLRuntimeState}. {@code /dump} streams a tar.gz of every row plus a * manifest so the entire admin surface can be backed up and restored.
        • @@ -473,8 +472,8 @@ public HttpResponse list(final String catalogFilter) { *
        • DAO row for {@code (catalog, name)} regardless of status — INACTIVE rules keep * their content under the soft-pause contract so the editor can re-edit.
        • *
        • {@link StaticRuleRegistry} fallback — bundled rules that have never been - * overridden by the operator. Returned with synthetic status {@code STATIC} and - * source {@code static}.
        • + * overridden by the operator. Returned with synthetic status {@code BUNDLED} + * and source {@code bundled}. *
        • Otherwise 404 {@code not_found}.
        • * * @@ -1452,14 +1451,16 @@ private HttpResponse runInactivePipeline(final String catalog, final String name // applyNowForRuleFile is idempotent; if the tick fires first, the second call is a // fast no-op on the matching hash. // - // SOFT-PAUSE semantics: pass {@link StorageManipulationOpt#localCacheOnly()} so the + // SOFT-PAUSE semantics: pass {@link StorageManipulationOpt#withoutSchemaChange()} so the // teardown unregisters every OAP-internal artefact (MeterSystem prototypes, // MetricsStreamProcessor entry / persistent workers, BatchQueue handlers, retired // RuleClassLoader) without firing the backend dropTable cascade. The measure / table // / index and any data already persisted under the pre-inactivate metric stay // intact — operators reactivate via {@code /addOrUpdate} and the existing data - // remains queryable through the new bundle. {@code /delete} is the only path that - // drops the backend schema. + // remains queryable through the new bundle. {@code /delete} removes the row but + // also leaves the backend schema in place (default mode); the only schema-changing + // {@code /delete} path is {@code ?mode=revertToBundled}, which runs through the + // apply pipeline so bundled's shape can replace runtime's. // // Teardown failure handling: surface as 500 teardown_deferred rather than 200 // inactivated. The DB row IS INACTIVE (persist already succeeded above) so peers @@ -1467,14 +1468,14 @@ private HttpResponse runInactivePipeline(final String catalog, final String name // completed (MalFileApplier swallowed per-metric failures, MetricsStreamProcessor // worker drain threw, etc.). Returning 200 would tell the operator "done" while // dispatch is still live; 500 + "teardown_deferred" accurately signals retriable - // state — the next dslManager tick re-runs the same localCacheOnly teardown. + // state — the next dslManager tick re-runs the same withoutSchemaChange teardown. final RuntimeRuleManagementDAO.RuntimeRuleFile inactiveFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, content, RuntimeRule.STATUS_INACTIVE, rule.getUpdateTime()); try { dslManager.applyNowForRuleFile(inactiveFile, false, - StorageManipulationOpt.localCacheOnly()); + StorageManipulationOpt.withoutSchemaChange()); } catch (final Throwable t) { log.warn("runtime-rule inactivate: local teardown deferred to tick for {}/{}", catalog, name, t); @@ -1536,15 +1537,18 @@ private HttpResponse doDelete(final String catalog, final String name, private HttpResponse doDeleteLocked(final String catalog, final String name, final DeleteMode mode, final RuntimeRuleManagementDAO dao) { - // /delete is the one destructive endpoint. /inactivate is a soft-pause that runs the - // OAP-internal teardown under localCacheOnly, deliberately preserving the BanyanDB - // measure + its data so a re-activation via /addOrUpdate is cheap and lossless. - // /delete drops the backend measure first, then removes the tombstone row. + // /inactivate is a soft-pause that runs the OAP-internal teardown under + // withoutSchemaChange, deliberately preserving the BanyanDB measure + its data so + // a re-activation via /addOrUpdate is cheap and lossless. /delete then removes the + // INACTIVE row. Default mode without a bundled twin leaves the backend measure as + // an inert artefact; with a bundled twin, default mode is refused (operator must + // opt in to ?mode=revertToBundled which runs the schema-change pipeline before + // removing the row). // // The two-step workflow (/inactivate → /delete) is enforced by the INACTIVE-status - // check below: an ACTIVE rule cannot be deleted in one shot. This separation makes - // the destructive moment explicit and lets operators reverse the soft-pause for a - // bounded window before committing to data loss. + // check below: an ACTIVE rule cannot be deleted in one shot. This separation lets + // operators reverse the soft-pause for a bounded window before committing to row + // removal. final RuntimeRuleManagementDAO.RuntimeRuleFile prior; try { prior = findRule(dao, catalog, name); @@ -1564,7 +1568,8 @@ private HttpResponse doDeleteLocked(final String catalog, final String name, jsonBody("requires_inactivate_first", catalog, name, "rule is ACTIVE; POST /runtime/rule/inactivate first, then /runtime/rule/delete. " + "Inactivate runs the soft-pause (handlers stop dispatching; backend " - + "measure preserved); delete drops the backend measure and removes the row.")); + + "measure preserved); /delete then removes the INACTIVE row " + + "(use ?mode=revertToBundled to revert to a bundled YAML twin).")); } final boolean bundledTwinExists = @@ -1578,54 +1583,92 @@ private HttpResponse doDeleteLocked(final String catalog, final String name, "mode=revertToBundled requires a bundled YAML on disk for this " + "(catalog, name); none was found"); } - - // Backend drop. /inactivate preserved the BanyanDB measure under localCacheOnly; - // discharge that debt now via the dslManager before the row goes away. The - // orchestrator skips the destructive cascade when a bundled twin exists (bundled - // will reuse the backend resource on the synchronous reload below). LAL has no - // backend schema so the call is a no-op for the lal catalog. A throw here aborts - // the row deletion — we do NOT proceed with dao.delete on backend-drop failure: - // that would orphan the measure with no way to find it again. - try { - dslManager.getDslRuntimeDelete().dropBackendForDelete(catalog, name, prior.getContent()); - } catch (final IllegalStateException refused) { - // Cross-file ownership conflict /addOrUpdate's guard didn't catch. Surface as - // 409 so the operator sees a clear "fix and retry" signal rather than 500. - log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, refused.getMessage()); + if (mode == DeleteMode.DEFAULT && bundledTwinExists) { + // Refuse the implicit revert. A bundled YAML on disk would silently take over + // the (catalog, name) the moment the runtime row goes away — that's a + // semantically meaningful state change and we want the operator to have + // declared it. Two legitimate continuations: (a) re-issue /delete with + // ?mode=revertToBundled to fall back to bundled, or (b) leave the rule + // INACTIVE (the soft-pause state from the previous /inactivate), which keeps + // both runtime and bundled effectively off until the operator re-activates. return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, - jsonBody("delete_refused", catalog, name, refused.getMessage())); - } catch (final Throwable t) { - log.error("runtime-rule /delete: backend drop threw for {}/{}", catalog, name, t); - return serverError("delete_backend_drop_failed", catalog, name, t.getMessage()); - } + jsonBody("requires_revert_to_bundled", catalog, name, + "a bundled YAML twin exists for this (catalog, name); deleting the " + + "runtime row would let bundled take over without an explicit " + + "operator decision. Re-issue with ?mode=revertToBundled to " + + "fall back to the bundled rule, or leave the row INACTIVE " + + "(soft-pause) to keep the rule off.")); + } + + if (mode == DeleteMode.REVERT_TO_BUNDLED) { + // Bundled-revert path is the schema-change path: bundled may have a different + // shape than runtime. The orchestrator runs the unified pipeline: + // (1) installRuntime to put prior runtime claims back locally, + // (2) apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange) — engine.commit + // drops runtime-only metrics through the standard delta path, + // (3) reset rules-map state to boot-seeded so gone-keys reconcile leaves + // it alone after dao.delete. + // dao.delete only runs after revertToBundled returns REVERTED — a precondition + // or compile failure aborts the row deletion so the operator can retry. + final DSLRuntimeDelete.Result revert; + try { + revert = dslManager.getDslRuntimeDelete() + .revertToBundled(catalog, name, prior.getContent()); + } catch (final Throwable t) { + log.error("runtime-rule /delete: revertToBundled threw for {}/{}", catalog, name, t); + return serverError("revert_to_bundled_failed", catalog, name, t.getMessage()); + } + switch (revert.status) { + case REFUSED_CONFLICT: + log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, revert.error); + return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, + jsonBody("delete_refused", catalog, name, revert.error)); + case PRECONDITION_FAILED: + log.error("runtime-rule /delete: revertToBundled precondition failed for {}/{}: {}", + catalog, name, revert.error); + return serverError("revert_to_bundled_precondition_failed", catalog, name, revert.error); + case BUNDLED_APPLY_FAILED: + log.error("runtime-rule /delete: bundled apply failed for {}/{}: {}", + catalog, name, revert.error); + return serverError("revert_to_bundled_failed", catalog, name, + "bundled apply failed (typically a storage-backend DDL/verify " + + "issue — BanyanDB unreachable, shape rejection, or schema-" + + "barrier timeout). The orchestrator unwound the step-1 " + + "runtime install so local state matches the persisted " + + "INACTIVE row. Retry once storage recovers. Cause: " + + revert.error); + case REVERTED: + default: + break; + } + try { + dao.delete(catalog, name); + } catch (final IOException e) { + log.error("failed to delete runtime rule {}/{}", catalog, name, e); + return serverError("delete_failed", catalog, name, e.getMessage()); + } + return ok(HttpStatus.OK, "reverted_to_bundled", catalog, name, + "runtime row removed; bundled rule installed via apply pipeline (schema " + + "change handled by the standard delta path); peers converge on next tick"); + } + + // No-bundled-twin DEFAULT path. /inactivate already tore down local handlers under + // withoutSchemaChange, so the runtime rule is no longer dispatching. The backend measure + // (if any) is left in place — it becomes an inert schema artefact, matching the + // bundled-rule deletion semantics (removing a YAML from otel-rules/ on disk doesn't + // drop its measure either). Operators who want backend cleanup must purge the + // measure out-of-band; this endpoint never re-installs the runtime DSL just to + // tear it down again. try { dao.delete(catalog, name); } catch (final IOException e) { log.error("failed to delete runtime rule {}/{}", catalog, name, e); return serverError("delete_failed", catalog, name, e.getMessage()); } - - // Synchronously reload the bundled rule (if any) so the operator's response - // reflects the post-delete reality — bundled is already serving via a static: - // loader on this node. Peer nodes converge via the gone-keys reconcile path on - // their next tick. A reload failure is logged and surfaced as a partial-success - // response (200 with applyStatus=reverted_to_bundled_partial) — the row is gone, - // the operator's intent landed, but bundled didn't compile cleanly on this node. - if (bundledTwinExists) { - final boolean reloaded = dslManager.getDslRuntimeDelete() - .reloadBundledIfPresent(catalog, name); - return ok(HttpStatus.OK, - reloaded ? "reverted_to_bundled" : "reverted_to_bundled_partial", - catalog, name, - reloaded - ? "runtime row removed; bundled rule reinstalled into a static: loader " - + "on this node; peers converge on next tick" - : "runtime row removed; bundled reload deferred (compile failed or " - + "engine unavailable); peers will retry via the gone-keys " - + "reconcile on their next tick"); - } return ok(HttpStatus.OK, "deleted", catalog, name, - "backend measure dropped, runtime row removed from storage; rule is fully gone"); + "runtime row removed; local handlers were already unregistered by /inactivate; " + + "any backend schema this rule installed is left in place as an inert " + + "artefact (drop manually if needed)"); } /** diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java index c07c74f77512..726b1751aee2 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java @@ -136,7 +136,7 @@ void removeCallsMeterSystemPerName() { // The inverse side of the contract: on unregister every metric name the prior apply // recorded must flow to MeterSystem.removeMetric. The dslManager relies on this to // drain L1/L2 handlers + drop the BanyanDB measure. The applier's no-opt overload - // delegates to the opt-aware removeMetric with fullInstall(), which is what we + // delegates to the opt-aware removeMetric with withSchemaChange(), which is what we // verify here. final Set names = setOf("meter_a", "meter_b", "meter_c"); applier.remove(names); diff --git a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java index c2e61548c56d..ac1fd1880b1c 100644 --- a/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java +++ b/oap-server/server-receiver-plugin/skywalking-runtime-rule-receiver-plugin/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java @@ -287,10 +287,10 @@ void deleteIsIdempotentOnAbsentRow() throws Exception { void inactivateUsesLocalCacheOnlySoBackendSchemaIsPreserved() throws Exception { // Soft-pause contract: /inactivate must drive the local teardown via the // applyNowForRuleFile overload that takes a StorageManipulationOpt — and that opt - // must be localCacheOnly(). The localCacheOnly path makes per-backend + // must be withoutSchemaChange(). The withoutSchemaChange path makes per-backend // whenRemoving record SKIPPED_NOT_ALLOWED instead of firing dropTable, so the // BanyanDB measure / JDBC table / ES index plus stored data survive the pause. - // /delete is the only path that drops backend schema (still uses fullInstall()). + // /delete is the only path that drops backend schema (still uses withSchemaChange()). final String yaml = minimalMalYaml(); whenDaoHasRow("otel-rules", "vm", yaml, RuntimeRule.STATUS_ACTIVE); whenReconcilerApplySucceeds("otel-rules", "vm"); @@ -304,10 +304,10 @@ void inactivateUsesLocalCacheOnlySoBackendSchemaIsPreserved() throws Exception { assertHttpStatus(resp, HttpStatus.OK); // Verify the soft-pause path was taken: 3-arg overload with deferCommit=false and - // a localCacheOnly opt. The destructive 2-arg overload (which would mean - // fullInstall and a dropTable cascade) must NOT have fired. + // a withoutSchemaChange opt. The destructive 2-arg overload (which would mean + // withSchemaChange and a dropTable cascade) must NOT have fired. verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(false), - Mockito.argThat(opt -> opt != null && opt.isLocalCacheOnly())); + Mockito.argThat(opt -> opt != null && opt.isWithoutSchemaChange())); verify(dslManager, never()).applyNowForRuleFile(any()); verify(dslManager, never()).applyNowForRuleFile(any(), Mockito.anyBoolean()); } diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index 7fc63ec649cb..abf4f288e392 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -146,7 +146,7 @@ public InstallInfo isExists(Model model, StorageManipulationOpt opt) throws Stor } else { // Run shape-compat checks unless we're in the legacy no-init poll loop // path. failOnAbsence implies the caller wants strict verification even - // in non-init mode (LOCAL_CACHE_VERIFY), so honour that instead of just + // in non-init mode (VERIFY_SCHEMA_ONLY), so honour that instead of just // gating on RunningMode. final boolean runShapeChecks = !RunningMode.isNoInitMode() || opt.getFlags().isFailOnAbsence(); if (model.isTimeSeries()) { @@ -248,7 +248,7 @@ private void fenceOnRevision(final BanyanDBClient client, final StorageManipulat public void createTable(Model model) throws StorageException { // Legacy entry point preserved for binary compatibility; orchestrator calls // the opt-aware overload. - createTable(model, StorageManipulationOpt.fullInstall()); + createTable(model, StorageManipulationOpt.withSchemaChange()); } @Override @@ -386,7 +386,7 @@ public void createTable(Model model, StorageManipulationOpt opt) throws StorageE public void dropTable(Model model) throws StorageException { // Legacy entry point: delegate to opt-aware overload with a default opt so // existing callers don't need to construct one. - dropTable(model, StorageManipulationOpt.fullInstall()); + dropTable(model, StorageManipulationOpt.withSchemaChange()); } @Override @@ -764,9 +764,9 @@ private long defineIndexRuleBinding(List indexRules, /** * Check if the measure exists and, when the live shape differs from the intended shape, - * either update it (on-demand operator workflow — {@link StorageManipulationOpt#isFullInstall()}) + * either update it (on-demand operator workflow — {@link StorageManipulationOpt#isWithSchemaChange()}) * or skip the update and record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH} - * (static boot workflow — {@link StorageManipulationOpt#isCreateIfAbsent()}). Boot MUST + * (static boot workflow — {@link StorageManipulationOpt#isSchemaCreateIfAbsent()}). Boot MUST * NOT reshape the backend — reshape is an explicit operator action only. */ private void checkMeasure(Measure measure, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { @@ -893,7 +893,7 @@ private void checkProperty(Property property, BanyanDBClient client, StorageMani /** * Check if the index rules exist and update them if necessary. In - * {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode the writes are + * {@link StorageManipulationOpt#isVerifySchemaOnly() verify} mode the writes are * skipped and a {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH} is * recorded instead — the orchestrator promotes that to a fatal boot error. */ @@ -947,7 +947,7 @@ private void checkIndexRules(String modelName, List indexRules, Banya /** * Check if the index rule binding exists and update it if necessary. In - * {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode skip the write and + * {@link StorageManipulationOpt#isVerifySchemaOnly() verify} mode skip the write and * record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH}. */ private void checkIndexRuleBinding(List indexRules, @@ -1022,7 +1022,7 @@ private void checkIndexRuleBinding(List indexRules, /** * Check if the TopN aggregation exists and update it if necessary. * If the TopN rules are not used, will be checked and deleted after install, in the `BanyanDBStorageProvider.notifyAfterCompleted()`. - * In {@link StorageManipulationOpt#isLocalCacheVerify() verify} mode skip the write + * In {@link StorageManipulationOpt#isVerifySchemaOnly() verify} mode skip the write * and record {@link StorageManipulationOpt.Outcome#SKIPPED_SHAPE_MISMATCH}. */ private void checkTopNAggregation(Model model, BanyanDBClient client, StorageManipulationOpt opt) throws BanyanDBException { From e52a03cbbc93ec6980503e1bb7372748437ecc0c Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Thu, 30 Apr 2026 07:47:51 +0800 Subject: [PATCH 3/5] Document receiver-runtime-rule config in vocabulary doc MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit application.yml gained a new receiver-runtime-rule provider block (port 17128, disabled by default, no authentication). configuration-vocabulary.md was missing this entry — add it with the nine env-var-backed knobs (selector, REST host / port / context path / idle timeout / accept queue / max header size, reconciler interval, self-heal threshold). --- docs/en/setup/backend/configuration-vocabulary.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/en/setup/backend/configuration-vocabulary.md b/docs/en/setup/backend/configuration-vocabulary.md index a1ef39c2aa4e..6effdea9dbeb 100644 --- a/docs/en/setup/backend/configuration-vocabulary.md +++ b/docs/en/setup/backend/configuration-vocabulary.md @@ -348,6 +348,15 @@ It divided into several modules, each of which has its own settings. The followi | - | - | enableTLS | Indicate if enable HTTPS for the server | SW_RECEIVER_AWS_FIREHOSE_HTTP_ENABLE_TLS | false | | - | - | tlsKeyPath | TLS key path | SW_RECEIVER_AWS_FIREHOSE_HTTP_TLS_KEY_PATH | | | - | - | tlsCertChainPath | TLS certificate chain path | SW_RECEIVER_AWS_FIREHOSE_HTTP_TLS_CERT_CHAIN_PATH | | +| receiver-runtime-rule | default | - | Hot-update admin endpoint for MAL / LAL rule files. **Disabled by default**: leave the selector empty to keep the provider unloaded; set `SW_RECEIVER_RUNTIME_RULE=default` to enable. The endpoint has **no authentication** in this iteration — gateway-protect with IP allow-lists and never expose it to the public internet. See [Runtime Rule Hot-Update API](backend-runtime-rule-api.md). | SW_RECEIVER_RUNTIME_RULE | (empty — disabled) | +| - | - | restHost | Binding IP of the runtime-rule admin endpoint. | SW_RECEIVER_RUNTIME_RULE_REST_HOST | 0.0.0.0 | +| - | - | restPort | Binding port of the runtime-rule admin endpoint. | SW_RECEIVER_RUNTIME_RULE_REST_PORT | 17128 | +| - | - | restContextPath | Web context path of the runtime-rule admin endpoint. | SW_RECEIVER_RUNTIME_RULE_REST_CONTEXT_PATH | / | +| - | - | restIdleTimeOut | Connector idle timeout of the runtime-rule admin endpoint (in milliseconds). | SW_RECEIVER_RUNTIME_RULE_REST_IDLE_TIMEOUT | 30000 | +| - | - | restAcceptQueueSize | ServerSocketChannel backlog of the runtime-rule admin endpoint. | SW_RECEIVER_RUNTIME_RULE_REST_QUEUE_SIZE | 0 | +| - | - | httpMaxRequestHeaderSize | Maximum length of all HTTP/1 request headers accepted by the admin endpoint (bytes). | SW_RECEIVER_RUNTIME_RULE_HTTP_MAX_REQUEST_HEADER_SIZE | 8192 | +| - | - | reconcilerIntervalSeconds | Period (seconds) of the cluster reconcile tick. Each node periodically re-reads stored rules and reconciles its local state against the DAO. | SW_RECEIVER_RUNTIME_RULE_RECONCILER_INTERVAL_SECONDS | 30 | +| - | - | selfHealThresholdSeconds | Time (seconds) a SUSPENDED rule waits before the self-heal backstop forces it back to RUNNING. | SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS | 60 | | ai-pipeline | default | | | | | - | - | uriRecognitionServerAddr | The address of the URI recognition server. | SW_AI_PIPELINE_URI_RECOGNITION_SERVER_ADDR | - | | - | - | uriRecognitionServerPort | The port of the URI recognition server. | SW_AI_PIPELINE_URI_RECOGNITION_SERVER_PORT | 17128 | From 8a8eb10d0420b1f83b15a39f9f1718a53b2e35b5 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Thu, 30 Apr 2026 07:50:32 +0800 Subject: [PATCH 4/5] Security notice: explicitly cover log data and validation-layer guidance MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reviewer feedback: 'telemetry' previously read as numeric metrics + response times only, but the same trust model applies to log lines — and log payloads are far more likely to carry attacker-controllable text (URIs, headers, exception messages from poisoned input) than numeric samples. - Rewrite the trust paragraph to say 'metrics, traces, and logs' explicitly and call out log data as the most common XSS/RCE vector landing in OAP/UI. - Add an explicit policy item recommending operators build a gateway / sidecar / service-mesh validation layer between agents and OAP. Several security vendors ship this; OAP does not validate telemetry itself. --- docs/en/security/README.md | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/docs/en/security/README.md b/docs/en/security/README.md index 1dfb7ace34d1..7907009a192f 100644 --- a/docs/en/security/README.md +++ b/docs/en/security/README.md @@ -4,8 +4,13 @@ The SkyWalking OAP server, UI, and agent deployments should run in a secure envi OAP server, UI, and agent deployments should only be reachable by the operation team on default deployment. -All telemetry data are trusted. The OAP server **would not validate any field** of the telemetry data to avoid extra -load for the server. +All telemetry data — including **metrics, traces, and logs** — are trusted. The OAP server +**would not validate any field** of the telemetry data to avoid extra load for the server. +Log data deserves explicit attention here: unlike numeric metric or response-time samples, +log lines are free-form text emitted by applications and routinely contain +attacker-controllable fragments (request URIs, query strings, headers, stack traces from +poisoned input). Treat log payloads as untrusted at the same level as raw HTTP request +bodies. It is up to the operator(OPS team) whether to expose the OAP server, UI, or some agent deployment to unsecured environment. @@ -15,11 +20,19 @@ The following security policies should be considered to add to secure your SkyWa 2. Set up TOKEN or username/password based authentications for the OAP server and UI through your Gateway. 3. Validate all fields of the traceable RPC(including HTTP 1/2, MQ) headers(header names are `sw8`, `sw8-x` and `sw8-correlation`) when requests are from out of the trusted zone. Or simply block/remove those headers unless you are using the client-js agent. -4. All fields of telemetry data(HTTP in raw text or encoded Protobuf format) should be validated and reject malicious - data. +4. All fields of telemetry data — metrics, traces, **and logs** (HTTP in raw text or encoded + Protobuf format) — should be validated and reject malicious data. Log fields in particular + carry attacker-controllable text (request URIs, headers, exception messages from poisoned + input) and are the most common vector for XSS/RCE payloads landing in the OAP and UI. +5. **Build a validation layer between agents and OAP.** The recommended deployment shape is an + operator-controlled gateway / sidecar / service mesh between agents and OAP that + authenticates the source, enforces rate limits, and validates / sanitises telemetry — + metrics, traces, and logs alike — before forwarding to OAP. Several security vendors offer + commercial implementations of this layer; the OAP itself does not perform that validation. -Without these protections, an attacker could embed executable Javascript code in those fields, causing XSS or even -Remote Code Execution (RCE) issues. +Without these protections, an attacker could embed executable Javascript code in those fields, +causing XSS or even Remote Code Execution (RCE) issues. Log data is especially exposed +because applications routinely emit raw user input verbatim into log messages. For some sensitive environment, consider to limit the telemetry report frequency in case of DoS/DDoS for exposed OAP and UI services. From 2d5784d5c6f88d335d2ebab8c412e0ff23023ebc Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Thu, 30 Apr 2026 07:57:48 +0800 Subject: [PATCH 5/5] Security notice: validate every field of every telemetry category MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reviewer feedback (apache/skywalking-mailing-list, 2026-04-29): the existing 'all telemetry data should be validated' wording read as numeric-metric-only to some readers. Make explicit that the validation contract covers every field of every category — metrics (names + label keys + values), traces (span names / tags / span logs / endpoints), logs (body + structured fields), profiling results, HTTP capture/debugging dumps, and any future telemetry surface. Add the operator-facing recommendation to deploy a gateway / sidecar / service-mesh validation layer between agents and OAP as a security enhancement (several security vendors ship this; OAP does not validate telemetry itself). Frame the bullet list as examples, not an enumeration: the rule is 'validate every field,' not 'validate the ones we enumerated here.' --- docs/en/security/README.md | 54 ++++++++++++++++++++++++-------------- 1 file changed, 34 insertions(+), 20 deletions(-) diff --git a/docs/en/security/README.md b/docs/en/security/README.md index 7907009a192f..f0cfaf8a7cbc 100644 --- a/docs/en/security/README.md +++ b/docs/en/security/README.md @@ -4,13 +4,26 @@ The SkyWalking OAP server, UI, and agent deployments should run in a secure envi OAP server, UI, and agent deployments should only be reachable by the operation team on default deployment. -All telemetry data — including **metrics, traces, and logs** — are trusted. The OAP server -**would not validate any field** of the telemetry data to avoid extra load for the server. -Log data deserves explicit attention here: unlike numeric metric or response-time samples, -log lines are free-form text emitted by applications and routinely contain -attacker-controllable fragments (request URIs, query strings, headers, stack traces from -poisoned input). Treat log payloads as untrusted at the same level as raw HTTP request -bodies. +All telemetry data are trusted. The OAP server **would not validate any field** of the +telemetry data to avoid extra load for the server. **Every field of every telemetry +category should be validated by the operator before it reaches OAP** — none are +inherently safer than the others. + +Examples of surfaces that routinely carry attacker-controllable strings (non-exhaustive): + +- **Metrics**: metric names, label keys, label values. +- **Traces**: span operation names, span tags (keys and values), span logs / events, + endpoint and peer identifiers. +- **Logs**: log body, structured fields. +- **Profiling**: profiling results (eBPF / async-profiler / JFR samples), captured stack + frames, symbol names. +- **HTTP capture **: HTTP request and response bodies, headers, query + strings, and dumps collected by agent-side body-capture profiling plugins. + +A request URI, a header value, an exception message from poisoned input, or any other +free-form string an instrumented application happens to attach to any of the above will +reach OAP and the UI verbatim. The list grows with every new feature; the operator +contract is "validate everything," not "validate this enumerated set." It is up to the operator(OPS team) whether to expose the OAP server, UI, or some agent deployment to unsecured environment. @@ -20,19 +33,20 @@ The following security policies should be considered to add to secure your SkyWa 2. Set up TOKEN or username/password based authentications for the OAP server and UI through your Gateway. 3. Validate all fields of the traceable RPC(including HTTP 1/2, MQ) headers(header names are `sw8`, `sw8-x` and `sw8-correlation`) when requests are from out of the trusted zone. Or simply block/remove those headers unless you are using the client-js agent. -4. All fields of telemetry data — metrics, traces, **and logs** (HTTP in raw text or encoded - Protobuf format) — should be validated and reject malicious data. Log fields in particular - carry attacker-controllable text (request URIs, headers, exception messages from poisoned - input) and are the most common vector for XSS/RCE payloads landing in the OAP and UI. -5. **Build a validation layer between agents and OAP.** The recommended deployment shape is an - operator-controlled gateway / sidecar / service mesh between agents and OAP that - authenticates the source, enforces rate limits, and validates / sanitises telemetry — - metrics, traces, and logs alike — before forwarding to OAP. Several security vendors offer - commercial implementations of this layer; the OAP itself does not perform that validation. - -Without these protections, an attacker could embed executable Javascript code in those fields, -causing XSS or even Remote Code Execution (RCE) issues. Log data is especially exposed -because applications routinely emit raw user input verbatim into log messages. +4. **All fields of telemetry data should be validated and rejected when malicious** — in + both HTTP raw-text and encoded Protobuf transports. The scope is every category an + agent can emit (metrics, traces, logs, profiling results, HTTP capture / debugging + dumps, and any future telemetry surface), and every field within each category. Treat + the list above as examples; the rule is "validate every field," not "validate the + ones we enumerated." None of these surfaces are inherently safer than the others. +5. **Build a validation layer between agents and OAP** as a security enhancement. The + recommended deployment shape is an operator-controlled gateway / sidecar / service mesh + that authenticates the source, enforces rate limits, and validates / sanitises every + telemetry category before forwarding to OAP. Several security vendors offer commercial + implementations of this layer; the OAP itself does not perform that validation. + +Without these protections, an attacker could embed executable Javascript code in any of +those fields, causing XSS or even Remote Code Execution (RCE) issues. For some sensitive environment, consider to limit the telemetry report frequency in case of DoS/DDoS for exposed OAP and UI services.