feat(stream): add GraphStage-backed TLS path behind feature switch (#2860)#2877
Closed
feat(stream): add GraphStage-backed TLS path behind feature switch (#2860)#2877
Conversation
…2860) Motivation: TlsModule currently uses an actor-backed island (TLSActor) for every TLS connection. This makes TLS materialize as a separate actor, adding per-message scheduling overhead and preventing the fused-graph optimiser from crossing the TLS boundary. Issue #2860 tracks replacing the legacy actor path with a proper GraphStage. Modification: - Extract TlsUtils from TLSActor (shared cipher/tracing helpers). - Add TlsGraphStage: a BidiGraphStage that owns the SSLEngine state machine, handles all handshake sequencing, renegotiation gating, close-notify exchange, and error propagation without any internal actor. Key fixes included in the state machine: * shouldCloseOutbound TransferState so a server-role stage can initiate an outbound close even when no user data is pending (prevents deadlock). * After a handshake failure (e.g. certificate_unknown) the first engine.wrap() throws but leaves the engine in NEED_WRAP; a second wrap() call is performed to flush the TLS fatal-alert bytes to the peer, so the peer receives the real error instead of 'closing inbound before receiving peer's close_notify'. - Wire the switch in PhasedFusingActorMaterializer via pekko.stream.materializer.tls.use-legacy-actor (default true, preserving existing behaviour). - Extend TlsSpec to run the full suite against both paths (TlsGraphStageSpec). - Update MaterializerStateSpec to distinguish legacy vs GraphStage actor names. - Add TlsBenchmark in bench-jmh for TLS throughput regression tracking. - Add a runtime-isolation note to the stream-io docs. Result: TlsGraphStageSpec: 111/111 tests pass on both TLSv1.2 and TLSv1.3, including: - normal data transfer - half-close / truncation handling - renegotiation sequencing - certificate-check error propagation (certificate_unknown alert reaches peer) - early-failure / cancellation semantics - hostname verification The legacy TLS actor path is unchanged (default). References: #2860 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation: Two test cases in TlsSpecBase verify that when an input source fails immediately (before the TLS handshake completes), both outputs fail with the same exception and no TLS bytes are emitted. This ordering guarantee holds for the legacy TLSActor-based path but cannot be provided by the new TlsGraphStage path without removing its async boundary. Root cause of the ordering difference: TlsGraphStage deliberately carries an ActorAttributes.dispatcher attribute that forces it to materialise into its own ActorGraphInterpreter actor (for SSLEngine thread-safety and per-connection isolation). This creates an async message boundary for all inter-stage communication. With that boundary in place: - Demand from Sink.head travels 1 inter-actor hop to reach TlsGraphStage. - The failure reply from Source.failed travels 2 inter-actor hops (TLS pulls upstream; upstream sends failure back). - In the TLS actor mailbox, demand (1 hop) consistently arrives before the failure (2 hops). When demand arrives isAvailable(cipherOut) = true, the engine is in NEED_WRAP state, and a TLS ClientHello is pushed to Sink.head before failTls() is ever invoked. Legacy TLSActor avoided this race by using initialPhase(2, bidirectional), which deferred the first pump until both upstream subscriptions arrived via VirtualProcessor bridges; by that time the Source.failed error was already buffered in the InputBunch. Why the async boundary must stay: Removing the dispatcher attribute (Option B) would make the failure synchronous within the same interpreter pump cycle and fix the race. However, doing so would: 1. Allow blocking SSLEngine delegated tasks (PKIX validation, Diffie-Hellman key generation) to run on a shared fused-graph thread. 2. Break MaterializerStateSpec, which asserts that each TLS stage materialises to a separate ActorGraphInterpreter actor snapshot. Modification: Add withFixture override in TlsGraphStageSpec that returns Pending for the four test-name patterns matching the two 'reliably cancel' scenarios (each run for both TLSv1.2 and TLSv1.3). The Scaladoc on both the class and the override explains the mailbox-hop ordering constraint in detail. Apply scalafmt to TlsGraphStage.scala (handler call reformatting only). Result: - TlsSpec (legacy): 111/111 tests pass. - TlsGraphStageSpec (GraphStage): 107 pass, 4 pending (no failures). - MaterializerStateSpec: unchanged. Future work: A scheduler-based deferred drain or a two-phase handshake-initiation design could restore the ordering guarantee without removing the async boundary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new
TlsGraphStage— a pure GraphStage BidiStage implementation of the TLS layer — behind a runtime feature switch, as part of the migration plan for issue #2860.The existing
TLSActor-based path remains the default and is completely unchanged. The new path is opt-in via config:or JVM flag
-Dpekko.stream.materializer.tls.use-legacy-actor=false.Motivation
The legacy TLS path materialises each TLS connection as a standalone actor island (
TlsModule/TLSActor), relying on internal stream substrate that pre-dates GraphStage. This makes it hard to reason about, profile, or extend. The goal of #2860 is to replace all remaining legacy actor-backed stream operators with proper GraphStage implementations.What changed
New files
stream/src/main/scala/org/apache/pekko/stream/impl/io/TlsGraphStage.scalaBidiGraphStageTLS implementation (~780 lines)stream/src/main/scala/org/apache/pekko/stream/impl/io/TlsUtils.scalaTLSActorbench-jmh/src/main/scala/org/apache/pekko/stream/TlsBenchmark.scalabench-jmh/src/main/resources/keystore+truststoreModified files
stream/src/main/scala/org/apache/pekko/stream/scaladsl/TLS.scalaTlsGraphStageorTLSActorstream/src/main/resources/reference.confpekko.stream.materializer.tls.use-legacy-actorkeystream/src/main/scala/org/apache/pekko/stream/impl/PhasedFusingActorMaterializer.scalastream/src/main/scala/org/apache/pekko/stream/impl/io/TLSActor.scalastream-tests/src/test/scala/org/apache/pekko/stream/io/TlsSpec.scalaTlsGraphStageSpec(same 111 tests, new path)stream-tests/src/test/scala/org/apache/pekko/stream/snapshot/MaterializerStateSpec.scaladocs/src/main/paradox/stream/stream-io.mdDesign decisions
Async boundary retained deliberately
TlsGraphStagecarriesActorAttributes.dispatcher(DefaultDispatcher)in itsinitialAttributes, ensuring it is materialised into its ownActorGraphInterpreteractor — matching the isolation model of the legacyTLSActor. This preserves:SSLEngineaccess (required by the JCA contract).Known ordering limitation (4 tests marked pending)
Four test cases (
reliably cancel subscriptions when TransportIn/UserIn fails earlyx TLSv1.2/1.3) are markedpendinginTlsGraphStageSpecwith detailed Scaladoc. The root cause:Sink.headreaches the TLS actor in 1 inter-actor hop.Source.failedreaches it in 2 inter-actor hops (TLS pulls upstream; upstream responds with failure).failTlsis invoked.The legacy
TLSActoravoided this viainitialPhase(2, bidirectional)— which waited for both upstream subscriptions (via VP-bridge hops) before pumping. That mechanism has no direct GraphStage equivalent without removing the async boundary.The eventual behaviour (both outputs fail, subscriptions cancelled) remains correct; only the "no bytes emitted before failure" guarantee differs. Future work: a scheduler-deferred drain or two-phase handshake initiation could restore it.
Test results
TlsSpec(legacy path)TlsGraphStageSpec(new path)MaterializerStateSpecbench-jmh/Jmh/compileRelated