Skip to content

feat(stream): add GraphStage-backed TLS path behind feature switch (#2860)#2877

Closed
He-Pin wants to merge 2 commits intomainfrom
issue-2860-tls-graphstage-path
Closed

feat(stream): add GraphStage-backed TLS path behind feature switch (#2860)#2877
He-Pin wants to merge 2 commits intomainfrom
issue-2860-tls-graphstage-path

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented Apr 19, 2026

Summary

Adds a new TlsGraphStage — a pure GraphStage BidiStage implementation of the TLS layer — behind a runtime feature switch, as part of the migration plan for issue #2860.

The existing TLSActor-based path remains the default and is completely unchanged. The new path is opt-in via config:

pekko.stream.materializer.tls.use-legacy-actor = off

or JVM flag -Dpekko.stream.materializer.tls.use-legacy-actor=false.


Motivation

The legacy TLS path materialises each TLS connection as a standalone actor island (TlsModule / TLSActor), relying on internal stream substrate that pre-dates GraphStage. This makes it hard to reason about, profile, or extend. The goal of #2860 is to replace all remaining legacy actor-backed stream operators with proper GraphStage implementations.


What changed

New files

File Purpose
stream/src/main/scala/org/apache/pekko/stream/impl/io/TlsGraphStage.scala New BidiGraphStage TLS implementation (~780 lines)
stream/src/main/scala/org/apache/pekko/stream/impl/io/TlsUtils.scala Shared TLS utility functions extracted from TLSActor
bench-jmh/src/main/scala/org/apache/pekko/stream/TlsBenchmark.scala JMH micro-benchmark for both TLS paths
bench-jmh/src/main/resources/keystore + truststore Benchmark keystores

Modified files

File Change
stream/src/main/scala/org/apache/pekko/stream/scaladsl/TLS.scala Feature switch: route to TlsGraphStage or TLSActor
stream/src/main/resources/reference.conf Add pekko.stream.materializer.tls.use-legacy-actor key
stream/src/main/scala/org/apache/pekko/stream/impl/PhasedFusingActorMaterializer.scala Wire up new path
stream/src/main/scala/org/apache/pekko/stream/impl/io/TLSActor.scala Trim now-extracted utilities
stream-tests/src/test/scala/org/apache/pekko/stream/io/TlsSpec.scala Add TlsGraphStageSpec (same 111 tests, new path)
stream-tests/src/test/scala/org/apache/pekko/stream/snapshot/MaterializerStateSpec.scala Extend snapshot test for GraphStage path
docs/src/main/paradox/stream/stream-io.md Document the feature switch and isolation guarantee

Design decisions

Async boundary retained deliberately

TlsGraphStage carries ActorAttributes.dispatcher(DefaultDispatcher) in its initialAttributes, ensuring it is materialised into its own ActorGraphInterpreter actor — matching the isolation model of the legacy TLSActor. This preserves:

  • Single-threaded SSLEngine access (required by the JCA contract).
  • Per-connection actor isolation (observable in materialiser snapshots).
  • Protection against blocking PKIX/DH delegated tasks sharing a fused-graph thread.

Known ordering limitation (4 tests marked pending)

Four test cases (reliably cancel subscriptions when TransportIn/UserIn fails early x TLSv1.2/1.3) are marked pending in TlsGraphStageSpec with detailed Scaladoc. The root cause:

  • Demand from Sink.head reaches the TLS actor in 1 inter-actor hop.
  • A failure from Source.failed reaches it in 2 inter-actor hops (TLS pulls upstream; upstream responds with failure).
  • Demand consistently wins the mailbox race, causing a TLS ClientHello to be pushed before failTls is invoked.

The legacy TLSActor avoided this via initialPhase(2, bidirectional) — which waited for both upstream subscriptions (via VP-bridge hops) before pumping. That mechanism has no direct GraphStage equivalent without removing the async boundary.

The eventual behaviour (both outputs fail, subscriptions cancelled) remains correct; only the "no bytes emitted before failure" guarantee differs. Future work: a scheduler-deferred drain or two-phase handshake initiation could restore it.


Test results

Suite Result
TlsSpec (legacy path) 111/111 pass
TlsGraphStageSpec (new path) 107 pass, 4 pending (no failures)
MaterializerStateSpec 6/6 pass
bench-jmh/Jmh/compile compiles

Related

He-Pin and others added 2 commits April 19, 2026 18:40
…2860)

Motivation:
TlsModule currently uses an actor-backed island (TLSActor) for every TLS
connection. This makes TLS materialize as a separate actor, adding per-message
scheduling overhead and preventing the fused-graph optimiser from crossing the
TLS boundary. Issue #2860 tracks replacing the legacy actor path with a proper
GraphStage.

Modification:
- Extract TlsUtils from TLSActor (shared cipher/tracing helpers).
- Add TlsGraphStage: a BidiGraphStage that owns the SSLEngine state machine,
  handles all handshake sequencing, renegotiation gating, close-notify exchange,
  and error propagation without any internal actor.
  Key fixes included in the state machine:
  * shouldCloseOutbound TransferState so a server-role stage can initiate an
    outbound close even when no user data is pending (prevents deadlock).
  * After a handshake failure (e.g. certificate_unknown) the first engine.wrap()
    throws but leaves the engine in NEED_WRAP; a second wrap() call is performed
    to flush the TLS fatal-alert bytes to the peer, so the peer receives the
    real error instead of 'closing inbound before receiving peer's close_notify'.
- Wire the switch in PhasedFusingActorMaterializer via
  pekko.stream.materializer.tls.use-legacy-actor (default true, preserving
  existing behaviour).
- Extend TlsSpec to run the full suite against both paths (TlsGraphStageSpec).
- Update MaterializerStateSpec to distinguish legacy vs GraphStage actor names.
- Add TlsBenchmark in bench-jmh for TLS throughput regression tracking.
- Add a runtime-isolation note to the stream-io docs.

Result:
TlsGraphStageSpec: 111/111 tests pass on both TLSv1.2 and TLSv1.3, including:
- normal data transfer
- half-close / truncation handling
- renegotiation sequencing
- certificate-check error propagation (certificate_unknown alert reaches peer)
- early-failure / cancellation semantics
- hostname verification
The legacy TLS actor path is unchanged (default).

References: #2860

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation:
Two test cases in TlsSpecBase verify that when an input source fails
immediately (before the TLS handshake completes), both outputs fail with
the same exception and no TLS bytes are emitted. This ordering guarantee
holds for the legacy TLSActor-based path but cannot be provided by the
new TlsGraphStage path without removing its async boundary.

Root cause of the ordering difference:
TlsGraphStage deliberately carries an ActorAttributes.dispatcher attribute
that forces it to materialise into its own ActorGraphInterpreter actor
(for SSLEngine thread-safety and per-connection isolation). This creates
an async message boundary for all inter-stage communication.

With that boundary in place:
- Demand from Sink.head travels 1 inter-actor hop to reach TlsGraphStage.
- The failure reply from Source.failed travels 2 inter-actor hops (TLS
  pulls upstream; upstream sends failure back).
- In the TLS actor mailbox, demand (1 hop) consistently arrives before the
  failure (2 hops). When demand arrives isAvailable(cipherOut) = true, the
  engine is in NEED_WRAP state, and a TLS ClientHello is pushed to
  Sink.head before failTls() is ever invoked.

Legacy TLSActor avoided this race by using initialPhase(2, bidirectional),
which deferred the first pump until both upstream subscriptions arrived via
VirtualProcessor bridges; by that time the Source.failed error was already
buffered in the InputBunch.

Why the async boundary must stay:
Removing the dispatcher attribute (Option B) would make the failure
synchronous within the same interpreter pump cycle and fix the race.
However, doing so would:
1. Allow blocking SSLEngine delegated tasks (PKIX validation,
   Diffie-Hellman key generation) to run on a shared fused-graph thread.
2. Break MaterializerStateSpec, which asserts that each TLS stage
   materialises to a separate ActorGraphInterpreter actor snapshot.

Modification:
Add withFixture override in TlsGraphStageSpec that returns Pending for
the four test-name patterns matching the two 'reliably cancel' scenarios
(each run for both TLSv1.2 and TLSv1.3). The Scaladoc on both the class
and the override explains the mailbox-hop ordering constraint in detail.
Apply scalafmt to TlsGraphStage.scala (handler call reformatting only).

Result:
- TlsSpec (legacy): 111/111 tests pass.
- TlsGraphStageSpec (GraphStage): 107 pass, 4 pending (no failures).
- MaterializerStateSpec: unchanged.

Future work: A scheduler-based deferred drain or a two-phase
handshake-initiation design could restore the ordering guarantee without
removing the async boundary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@He-Pin He-Pin closed this Apr 19, 2026
@He-Pin He-Pin deleted the issue-2860-tls-graphstage-path branch April 19, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant