Service CPU optimisations from JFR profiling#2440
Merged
thjaeckle merged 5 commits intoMay 12, 2026
Merged
Conversation
Five independent optimisations on hot paths observed in a 60s JFR
recording of the things service (~22% CPU per pod). Together they
remove the dominant Java-level hotspots and the bulk of allocation
pressure while preserving all observable semantics.
PolicyEnforcerCache (enforcement-dispatcher hot path):
Replace the O(N) full-map scan in deregisterImportMappings with an
O(k) forward-index lookup (importingPolicyId -> imported set), and
short-circuit null PolicyIds in AbstractEnforcerActor.loadPolicyEnforcer
so the provider is bypassed entirely. The null-id WARN in
CachingPolicyEnforcerProvider / DefaultPolicyEnforcerProvider is
downgraded to DEBUG (was producing ~2 200 log lines/minute).
JacksonSerializationContext (Pekko-remote dispatcher):
Reuse a single static CBORFactory in the package-private constructor
instead of allocating a fresh one per serialisation. Removes the top
byte[]/int[] allocation pressure (Jackson symbol-table canonicalizer).
DefaultDittoJsonHandler.parseToIntegerOrLong:
Stop using NumberFormatException as control flow. Parse once as long
and downcast iff in int range. Profile recorded ~113 NFE/s, mostly
from Publisher.deserializeGroupedHashes where 64-bit hashes always
overflow Integer.
JavaStringToEscapedJsonString.apply:
Replace char-by-char append + replace loop with a no-escape fast path
and bulk StringBuilder.append(CharSequence, start, end) runs. Same
RFC 8259 output, large reduction in StringBuilder work for the
remote-dispatcher header-parsing path.
Pekko dispatchers (thread churn):
Add the missing thread-pool-executor block to ask-with-retry-dispatcher
with allow-core-timeout = off (Pekko's default-dispatcher fallback was
letting core threads die after 60s idle, causing ~2 thread starts/s).
Raise parallelism-min 4 -> 8 on wot-dispatcher and
wot-dispatcher-cache-loader to match observed sustained demand. All
tunables exposed via env-var overrides; things-deployment Helm chart
templates the WoT-dispatcher knobs through new
things.config.dispatchers.{wot,wotCacheLoader} values.
Tests:
- json: 917 tests pass (including 9 new escape-loop cases and 13 new
number-parse cases via the parsing path used by JsonObject.of).
- json-cbor: 51 tests pass (Jackson round-trips unchanged).
- policies/enforcement: 261 tests pass, including the three cascade
tests that exercise forward-index bookkeeping.
- things/service: 619 tests pass (includes MultiStageCommandEnforcementTest).
- internal/utils/cache-loaders: 22 tests pass (dispatcher config
loaded during test setup).
- helm template renders the new env vars with correct defaults.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes mirroring the patterns landed for things-service:
thing-cache-dispatcher / policy-enforcer-cache-dispatcher (search.conf):
Add the missing thread-pool-executor block to both dispatchers so they
no longer fall back to Pekko's default-dispatcher config with
allow-core-timeout = on (which kills idle core threads after 60s and
causes a continual birth/death cycle - JFR observed ~0.95 starts/s
on thing-cache-dispatcher). Same template as ask-with-retry-dispatcher
in the previous commit on this branch; all tunables exposed via
THING_CACHE_DISPATCHER_* / POLICY_ENFORCER_CACHE_DISPATCHER_*
env vars. Helm thingsSearch.config.dispatchers.{thingCache,policyEnforcerCache}
values + thingssearch-deployment.yaml env-var bindings wire the
tunables through.
Netty ResourceLeakDetector (global jvmOptions):
Append -Dio.netty.leakDetection.level=disabled to global.jvmOptions
in values.yaml. Netty defaults to SIMPLE which samples 1/128 buffer
allocations and captures a Throwable stack trace per sampled buffer;
JFR recorded 204 such captures in 366s (~0.56/s) on the Pekko remote
+ Mongo client hot paths. Operators who want leak detection enabled
for debugging / canary can append -Dio.netty.leakDetection.level=simple
to a service's additionalJvmOptions (per-service is rendered after
global, so last-wins).
Tests:
- thingsearch/service: 619 tests pass
- helm template renders all 4 dispatcher env vars + the new
-Dio.netty.leakDetection.level=disabled flag across all 5 services
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a new validateValueTypes(Map<String, String>, Map<String, HeaderDefinition>) overload (since 3.9.0) that iterates the (typically small) incoming headers map and looks up each entry in the supplied definitions map. The pre-existing Collection-based overload iterated every known HeaderDefinition (D ~= 50+) and called headers.get on each - a hot path on every inbound Pekko cluster message (JFR observed ~10% of all execution samples on remote-dispatcher threads spent inside LinkedHashMap.get from this loop). The constructor at line 120-135 already accepts a definitionsMap parameter; switch the call site to the new overload and pass it. Null values are explicitly skipped in the new overload to preserve identical behaviour to the old code (headers.get(key) returned null for both absent and null-valued keys, implicitly skipping both). The Collection-based overload is retained and marked @deprecated for binary compatibility with any external subclass that overrode it. Tests: - base/model: 727 tests pass (incl. new createInstanceWithMapContainingNullValueForKnownHeaderDoesNotThrow) - internal/utils/cluster: 67 tests pass (JFR hot-path consumer) - protocol: 1453 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A repo-wide audit of dispatchers declaring InstrumentedThreadPoolExecutorServiceConfigurator without a thread-pool-executor block found two remaining occurrences after the previous commits in this branch addressed ask-with-retry-dispatcher, thing-cache-dispatcher, and policy-enforcer-cache-dispatcher: - gateway.conf:719 signal-enrichment-cache-dispatcher - connectivity.conf:1360 signal-enrichment-cache-dispatcher Without the block Pekko's default-dispatcher fallback applies and includes allow-core-timeout = on, letting every core thread die after the keep-alive interval. JFR profiling on a connectivity pod observed ~0.06 starts/s for this dispatcher pre-fix. Apply the same template as the three previously fixed dispatchers: core-pool-size-min = 4, keep-alive-time = 60s, allow-core-timeout = off, all overridable via SIGNAL_ENRICHMENT_CACHE_DISPATCHER_* env vars. Helm gateway.config.dispatchers.signalEnrichmentCache and connectivity.config.dispatchers.signalEnrichmentCache values + two new env-var bindings per service deployment template. Tests: - gateway/service + connectivity/service: 1735 tests pass - helm template renders 2 new env vars per service (4 total) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd13a72 to
6c8b242
Compare
A repo-wide audit for the stricter pattern (dispatchers using InstrumentedThreadPoolExecutorServiceConfigurator where the thread-pool-executor block does not explicitly set allow-core-timeout) found gateway's authentication-dispatcher missing the override. Without an explicit value Pekko's default-dispatcher reference cascades `allow-core-timeout = on`, killing idle core threads after 60s and producing a continual start/die cycle. JFR profiling on a gateway pod observed ~0.32 starts/s for this dispatcher pre-fix. Add `allow-core-timeout = off` + corresponding env-var override AUTHENTICATION_DISPATCHER_ALLOW_CORE_TIMEOUT, plus the missing core-pool-size-min and keep-alive-time env-var overrides. (The audit also flagged connectivity's http-push-connection-dispatcher, but that block explicitly sets allow-core-timeout = on with a comment explaining the intent — operator wants the pool to scale back from its max of 512 threads. Left unchanged.) Tests: - gateway/service: 782 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hu-ahmed
approved these changes
May 12, 2026
Contributor
hu-ahmed
left a comment
There was a problem hiding this comment.
LGTM!
Thanks for the well-targeted optimizations
thjaeckle
added a commit
to beyonnex-io/ditto
that referenced
this pull request
May 12, 2026
Adds PRs eclipse-ditto#2440 (JFR CPU optimisations), eclipse-ditto#2441 / eclipse-ditto#2439 (AskTimeoutException to 503), eclipse-ditto#2442 (Helm OIDC issuers), eclipse-ditto#2444 / eclipse-ditto#2443 (ssl-config fix), eclipse-ditto#2445 (MongoDB X509 auth), eclipse-ditto#2446 (stackless 4xx exceptions) and eclipse-ditto#2447 (configurable SSE backpressure) to the release notes and blogpost. Renames the announcement post to 2026-05-13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CPU optimisations identified by 60-second to 5-minute JFR recordings on the things, things-search, and connectivity dev pods. Four commits, all independent and behaviour-preserving:
things-service (
66ef7c6eb0) — five fixes targeting the JFR-observed hot paths:PolicyEnforcerCache.deregisterImportMappings(was O(N) full-map scan); null-PolicyId short-circuit inAbstractEnforcerActor.loadPolicyEnforcerplus WARN→DEBUG downgrade (was producing ~200 k log lines/day on dev).CBORFactoryreuse inJacksonSerializationContext(was allocated per serialisation; top byte[]/int[] allocator).DefaultDittoJsonHandler.parseToIntegerOrLongparses long-then-downcast instead of try-Integer.parseInt-catch-NumberFormatException-fallback (JFR observed ~113 NFE/s from 64-bit pub/sub hashes).JavaStringToEscapedJsonString.apply(no-escape strings) +append(CharSequence, start, end)runs instead of per-charreplace.ask-with-retry-dispatcher: add the missingthread-pool-executorblock withallow-core-timeout = off(Pekko's default-dispatcher fallback was killing idle core threads after 60 s); raisewot-dispatcher/wot-dispatcher-cache-loaderparallelism-min4 → 8.things-search (
3578855ef6) — two structural fixes mirroring the things-service patterns:thread-pool-executortemplate forthing-cache-dispatcherandpolicy-enforcer-cache-dispatcher(JFR observed ~0.95 thread-births/s on thing-cache-dispatcher pre-fix).-Dio.netty.leakDetection.level=disabledtoglobal.jvmOptionsin Helm. Netty defaults to SIMPLE which captures a Throwable stack per sampled buffer; ~0.56/s on the Pekko-remote/Mongo paths.base/model
validateValueTypes(5b9c4b7381) — invert the loop inAbstractDittoHeadersBuilder.validateValueTypes. The pre-existing variant iterates all knownHeaderDefinitions (~50+) andheaders.get(definitionKey)per definition; the new@since 3.9.0overload iterates the (typically small) headers map and looks up each entry in aMap<String, HeaderDefinition>— O(H) instead of O(D). Hits on every inbound Pekko cluster message; the legacyCollectionoverload is kept@Deprecatedfor binary compatibility with external subclasses. Defensive null-value skip preserves identical observable behaviour.signal-enrichment-cache-dispatcher(fd13a72e9b) — repo-wide audit found two remaining occurrences of the same defect (gateway + connectivity); fixed identically. After this commit, no instances oftype = Dispatcher+InstrumentedThreadPoolExecutorServiceConfigurator+ missingthread-pool-executorblock remain in*/main/resources/*.conf.All new HOCON tunables are
${?ENV_VAR}-overridable; Helm wiring ({things,thingsSearch,gateway,connectivity}.config.dispatchers.*+ deployment-template env-var bindings) is included.