Skip to content

Service CPU optimisations from JFR profiling#2440

Merged
thjaeckle merged 5 commits into
eclipse-ditto:masterfrom
beyonnex-io:feature/services-cpu-optimisations
May 12, 2026
Merged

Service CPU optimisations from JFR profiling#2440
thjaeckle merged 5 commits into
eclipse-ditto:masterfrom
beyonnex-io:feature/services-cpu-optimisations

Conversation

@thjaeckle
Copy link
Copy Markdown
Member

@thjaeckle thjaeckle commented May 11, 2026

Summary

CPU optimisations identified by 60-second to 5-minute JFR recordings on the things, things-search, and connectivity dev pods. Four commits, all independent and behaviour-preserving:

  • things-service (66ef7c6eb0) — five fixes targeting the JFR-observed hot paths:

    • O(k) forward-index in PolicyEnforcerCache.deregisterImportMappings (was O(N) full-map scan); null-PolicyId short-circuit in AbstractEnforcerActor.loadPolicyEnforcer plus WARN→DEBUG downgrade (was producing ~200 k log lines/day on dev).
    • Static CBORFactory reuse in JacksonSerializationContext (was allocated per serialisation; top byte[]/int[] allocator).
    • DefaultDittoJsonHandler.parseToIntegerOrLong parses long-then-downcast instead of try-Integer.parseInt-catch-NumberFormatException-fallback (JFR observed ~113 NFE/s from 64-bit pub/sub hashes).
    • Bulk-copy fast path in JavaStringToEscapedJsonString.apply (no-escape strings) + append(CharSequence, start, end) runs instead of per-char replace.
    • ask-with-retry-dispatcher: add the missing thread-pool-executor block with allow-core-timeout = off (Pekko's default-dispatcher fallback was killing idle core threads after 60 s); raise wot-dispatcher / wot-dispatcher-cache-loader parallelism-min 4 → 8.
  • things-search (3578855ef6) — two structural fixes mirroring the things-service patterns:

    • Same thread-pool-executor template for thing-cache-dispatcher and policy-enforcer-cache-dispatcher (JFR observed ~0.95 thread-births/s on thing-cache-dispatcher pre-fix).
    • Append -Dio.netty.leakDetection.level=disabled to global.jvmOptions in Helm. Netty defaults to SIMPLE which captures a Throwable stack per sampled buffer; ~0.56/s on the Pekko-remote/Mongo paths.
  • base/model validateValueTypes (5b9c4b7381) — invert the loop in AbstractDittoHeadersBuilder.validateValueTypes. The pre-existing variant iterates all known HeaderDefinitions (~50+) and headers.get(definitionKey) per definition; the new @since 3.9.0 overload iterates the (typically small) headers map and looks up each entry in a Map<String, HeaderDefinition> — O(H) instead of O(D). Hits on every inbound Pekko cluster message; the legacy Collection overload is kept @Deprecated for binary compatibility with external subclasses. Defensive null-value skip preserves identical observable behaviour.

  • signal-enrichment-cache-dispatcher (fd13a72e9b) — repo-wide audit found two remaining occurrences of the same defect (gateway + connectivity); fixed identically. After this commit, no instances of type = Dispatcher + InstrumentedThreadPoolExecutorServiceConfigurator + missing thread-pool-executor block remain in */main/resources/*.conf.

All new HOCON tunables are ${?ENV_VAR}-overridable; Helm wiring ({things,thingsSearch,gateway,connectivity}.config.dispatchers.* + deployment-template env-var bindings) is included.

thjaeckle and others added 3 commits May 11, 2026 16:26
Five independent optimisations on hot paths observed in a 60s JFR
recording of the things service (~22% CPU per pod). Together they
remove the dominant Java-level hotspots and the bulk of allocation
pressure while preserving all observable semantics.

PolicyEnforcerCache (enforcement-dispatcher hot path):
  Replace the O(N) full-map scan in deregisterImportMappings with an
  O(k) forward-index lookup (importingPolicyId -> imported set), and
  short-circuit null PolicyIds in AbstractEnforcerActor.loadPolicyEnforcer
  so the provider is bypassed entirely. The null-id WARN in
  CachingPolicyEnforcerProvider / DefaultPolicyEnforcerProvider is
  downgraded to DEBUG (was producing ~2 200 log lines/minute).

JacksonSerializationContext (Pekko-remote dispatcher):
  Reuse a single static CBORFactory in the package-private constructor
  instead of allocating a fresh one per serialisation. Removes the top
  byte[]/int[] allocation pressure (Jackson symbol-table canonicalizer).

DefaultDittoJsonHandler.parseToIntegerOrLong:
  Stop using NumberFormatException as control flow. Parse once as long
  and downcast iff in int range. Profile recorded ~113 NFE/s, mostly
  from Publisher.deserializeGroupedHashes where 64-bit hashes always
  overflow Integer.

JavaStringToEscapedJsonString.apply:
  Replace char-by-char append + replace loop with a no-escape fast path
  and bulk StringBuilder.append(CharSequence, start, end) runs. Same
  RFC 8259 output, large reduction in StringBuilder work for the
  remote-dispatcher header-parsing path.

Pekko dispatchers (thread churn):
  Add the missing thread-pool-executor block to ask-with-retry-dispatcher
  with allow-core-timeout = off (Pekko's default-dispatcher fallback was
  letting core threads die after 60s idle, causing ~2 thread starts/s).
  Raise parallelism-min 4 -> 8 on wot-dispatcher and
  wot-dispatcher-cache-loader to match observed sustained demand. All
  tunables exposed via env-var overrides; things-deployment Helm chart
  templates the WoT-dispatcher knobs through new
  things.config.dispatchers.{wot,wotCacheLoader} values.

Tests:
  - json: 917 tests pass (including 9 new escape-loop cases and 13 new
    number-parse cases via the parsing path used by JsonObject.of).
  - json-cbor: 51 tests pass (Jackson round-trips unchanged).
  - policies/enforcement: 261 tests pass, including the three cascade
    tests that exercise forward-index bookkeeping.
  - things/service: 619 tests pass (includes MultiStageCommandEnforcementTest).
  - internal/utils/cache-loaders: 22 tests pass (dispatcher config
    loaded during test setup).
  - helm template renders the new env vars with correct defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural fixes mirroring the patterns landed for things-service:

thing-cache-dispatcher / policy-enforcer-cache-dispatcher (search.conf):
  Add the missing thread-pool-executor block to both dispatchers so they
  no longer fall back to Pekko's default-dispatcher config with
  allow-core-timeout = on (which kills idle core threads after 60s and
  causes a continual birth/death cycle - JFR observed ~0.95 starts/s
  on thing-cache-dispatcher). Same template as ask-with-retry-dispatcher
  in the previous commit on this branch; all tunables exposed via
  THING_CACHE_DISPATCHER_* / POLICY_ENFORCER_CACHE_DISPATCHER_*
  env vars. Helm thingsSearch.config.dispatchers.{thingCache,policyEnforcerCache}
  values + thingssearch-deployment.yaml env-var bindings wire the
  tunables through.

Netty ResourceLeakDetector (global jvmOptions):
  Append -Dio.netty.leakDetection.level=disabled to global.jvmOptions
  in values.yaml. Netty defaults to SIMPLE which samples 1/128 buffer
  allocations and captures a Throwable stack trace per sampled buffer;
  JFR recorded 204 such captures in 366s (~0.56/s) on the Pekko remote
  + Mongo client hot paths. Operators who want leak detection enabled
  for debugging / canary can append -Dio.netty.leakDetection.level=simple
  to a service's additionalJvmOptions (per-service is rendered after
  global, so last-wins).

Tests:
  - thingsearch/service: 619 tests pass
  - helm template renders all 4 dispatcher env vars + the new
    -Dio.netty.leakDetection.level=disabled flag across all 5 services

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a new validateValueTypes(Map<String, String>, Map<String, HeaderDefinition>)
overload (since 3.9.0) that iterates the (typically small) incoming
headers map and looks up each entry in the supplied definitions map.
The pre-existing Collection-based overload iterated every known
HeaderDefinition (D ~= 50+) and called headers.get on each - a hot
path on every inbound Pekko cluster message (JFR observed ~10% of
all execution samples on remote-dispatcher threads spent inside
LinkedHashMap.get from this loop).

The constructor at line 120-135 already accepts a definitionsMap
parameter; switch the call site to the new overload and pass it.
Null values are explicitly skipped in the new overload to preserve
identical behaviour to the old code (headers.get(key) returned null
for both absent and null-valued keys, implicitly skipping both).

The Collection-based overload is retained and marked @deprecated for
binary compatibility with any external subclass that overrode it.

Tests:
  - base/model: 727 tests pass (incl. new
    createInstanceWithMapContainingNullValueForKnownHeaderDoesNotThrow)
  - internal/utils/cluster: 67 tests pass (JFR hot-path consumer)
  - protocol: 1453 tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thjaeckle thjaeckle self-assigned this May 11, 2026
@thjaeckle thjaeckle added this to the 3.9.0 milestone May 11, 2026
A repo-wide audit of dispatchers declaring
InstrumentedThreadPoolExecutorServiceConfigurator without a
thread-pool-executor block found two remaining occurrences after
the previous commits in this branch addressed ask-with-retry-dispatcher,
thing-cache-dispatcher, and policy-enforcer-cache-dispatcher:

  - gateway.conf:719          signal-enrichment-cache-dispatcher
  - connectivity.conf:1360    signal-enrichment-cache-dispatcher

Without the block Pekko's default-dispatcher fallback applies and includes
allow-core-timeout = on, letting every core thread die after the keep-alive
interval. JFR profiling on a connectivity pod observed ~0.06 starts/s for
this dispatcher pre-fix.

Apply the same template as the three previously fixed dispatchers:
core-pool-size-min = 4, keep-alive-time = 60s, allow-core-timeout = off,
all overridable via SIGNAL_ENRICHMENT_CACHE_DISPATCHER_* env vars.
Helm gateway.config.dispatchers.signalEnrichmentCache and
connectivity.config.dispatchers.signalEnrichmentCache values +
two new env-var bindings per service deployment template.

Tests:
  - gateway/service + connectivity/service: 1735 tests pass
  - helm template renders 2 new env vars per service (4 total)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thjaeckle thjaeckle force-pushed the feature/services-cpu-optimisations branch from fd13a72 to 6c8b242 Compare May 11, 2026 17:23
A repo-wide audit for the stricter pattern (dispatchers using
InstrumentedThreadPoolExecutorServiceConfigurator where the
thread-pool-executor block does not explicitly set allow-core-timeout)
found gateway's authentication-dispatcher missing the override.
Without an explicit value Pekko's default-dispatcher reference cascades
`allow-core-timeout = on`, killing idle core threads after 60s and
producing a continual start/die cycle. JFR profiling on a gateway pod
observed ~0.32 starts/s for this dispatcher pre-fix.

Add `allow-core-timeout = off` + corresponding env-var override
AUTHENTICATION_DISPATCHER_ALLOW_CORE_TIMEOUT, plus the missing
core-pool-size-min and keep-alive-time env-var overrides.

(The audit also flagged connectivity's http-push-connection-dispatcher,
but that block explicitly sets allow-core-timeout = on with a comment
explaining the intent — operator wants the pool to scale back from
its max of 512 threads. Left unchanged.)

Tests:
  - gateway/service: 782 tests pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hu-ahmed hu-ahmed self-requested a review May 12, 2026 06:43
Copy link
Copy Markdown
Contributor

@hu-ahmed hu-ahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks for the well-targeted optimizations

@thjaeckle thjaeckle merged commit 0ee1d26 into eclipse-ditto:master May 12, 2026
10 checks passed
thjaeckle added a commit to beyonnex-io/ditto that referenced this pull request May 12, 2026
Adds PRs eclipse-ditto#2440 (JFR CPU optimisations), eclipse-ditto#2441 / eclipse-ditto#2439 (AskTimeoutException
to 503), eclipse-ditto#2442 (Helm OIDC issuers), eclipse-ditto#2444 / eclipse-ditto#2443 (ssl-config fix), eclipse-ditto#2445
(MongoDB X509 auth), eclipse-ditto#2446 (stackless 4xx exceptions) and eclipse-ditto#2447
(configurable SSE backpressure) to the release notes and blogpost.

Renames the announcement post to 2026-05-13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants