Defer DTLS handshake consumption until parse succeeds#144
Conversation
|
Alternative take on #143 @zRedShift I felt 143 was going in a very complex direction. I try again making it smaller. |
zRedShift
left a comment
There was a problem hiding this comment.
Didn't have time to look into this, sorry for the delay. Here's a Codex/GPT-5.5 assisted review.
Findings
High: Extension bodies are still validated after handshake consumption
This PR moves set_handled() and transcript append until after Body::parse(), but Body::parse() still does not validate all known extension bodies. The Extension::parse() helpers keep raw extension data ranges, and the known extension body parsers still run later in the client/server state handlers. That leaves the core #142 failure mode intact: malformed known extension bodies can still be parsed after the handshake has already been consumed.
Evidence:
dimpl/src/dtls12/message/extension.rs
Lines 15 to 41 in 09dafc4
dimpl/src/dtls13/message/extension.rs
Lines 12 to 38 in 09dafc4
dimpl/src/dtls12/message/handshake.rs
Lines 187 to 206 in 09dafc4
dimpl/src/dtls13/message/handshake.rs
Lines 211 to 237 in 09dafc4
Lines 422 to 442 in 09dafc4
Lines 449 to 496 in 09dafc4
In DTLS 1.2, next_handshake() also advances peer_handshake_seq_no before those later extension parsers run:
Lines 664 to 672 in 09dafc4
Suggested fix: validate all known extension bodies before committing handled/transcript/sequence state, or store already-validated extension-body data in the parsed handshake body so the state handlers no longer need to run fallible extension parsers after consumption.
Suggested regression: malformed known extension body (use_srtp, supported_groups, key_share, or cookie) followed by a clean retransmission with the same handshake key. The test should prove that only the clean message reaches transcript/sequence state and that the handshake can continue.
High: Non-Finished parse leftovers are still committed
The new consume gate also treats Body::parse() success as enough even when the parser leaves trailing bytes. Both defragment paths only reject non-empty rest for Finished; other known handshake bodies can return a valid prefix plus leftover bytes and still get marked handled and appended to the transcript.
Evidence:
dimpl/src/dtls12/message/handshake.rs
Lines 187 to 206 in 09dafc4
dimpl/src/dtls13/message/handshake.rs
Lines 211 to 237 in 09dafc4
Examples of parsers that can return leftover bytes after consuming a declared inner vector:
dimpl/src/dtls12/message/client_hello.rs
Lines 176 to 185 in 09dafc4
dimpl/src/dtls13/message/encrypted_extensions.rs
Lines 17 to 48 in 09dafc4
dimpl/src/dtls12/message/certificate.rs
Lines 19 to 42 in 09dafc4
dimpl/src/dtls13/message/certificate.rs
Lines 26 to 70 in 09dafc4
Suggested fix: make the pre-consumption validation require exact whole-body consumption for every known non-opaque handshake body, not only Finished. If any known body parser leaves trailing bytes, handle it through the same discard/replacement path as other transient malformed input.
Suggested regression: non-Finished handshake body with a valid prefix plus trailing bytes, followed by a clean retransmission. The malformed prefix should not reach transcript/sequence state, and the clean retransmission should remain processable.
High: Failed body parses can leave an unhandled queue entry that blocks recovery
For malformed handshakes rejected before the post-parse commit point (Body::parse() errors or the Finished exact-tail check), the PR now returns before marking the candidate fragments handled. The public packet paths swallow transient parser errors, so the bad queued entry remains visible. A later clean retransmission with the same (message_seq, fragment_offset) can then be dropped as a duplicate instead of replacing or bypassing the failed candidate.
Evidence:
dimpl/src/dtls12/message/handshake.rs
Lines 187 to 195 in 09dafc4
Lines 298 to 315 in 09dafc4
dimpl/src/dtls13/message/handshake.rs
Lines 211 to 228 in 09dafc4
Lines 420 to 450 in 09dafc4
Lines 687 to 693 in 09dafc4
Suggested fix: keep transcript mutation deferred, but make failed assembled-body parses discard or mark only the failed candidate fragments as handled, or track parse-failed queue entries so an exact clean retransmission can replace them. Be careful with fragmented messages: replacing only the first fragment is not enough if stale later fragments still participate in defragmentation.
Suggested regression: a complete same-length malformed ClientHello/Finished/KeyUpdate candidate that fails body parsing, followed by a corrected retransmission with the same handshake key. The connection should progress and the transcript should contain only the corrected message.
Medium: The new stack cap counts a different unit than the existing defragment cap
MAX_DEFRAGMENT_HANDSHAKES caps flattened handshake entries, but the existing defragment visibility cap is expressed in records, and each record can contain multiple parsed handshakes. That introduces a new protocol-visible limit without boundary coverage.
Evidence:
dimpl/src/dtls12/message/handshake.rs
Lines 22 to 24 in 09dafc4
dimpl/src/dtls13/message/handshake.rs
Lines 14 to 16 in 09dafc4
Lines 593 to 600 in 09dafc4
Lines 729 to 735 in 09dafc4
Lines 268 to 270 in 09dafc4
Lines 353 to 355 in 09dafc4
Suggested fix: either derive the handled-fragment cap from the same unit as the caller cap, stop defragmenting once the target handshake is complete, or add boundary tests showing that valid coalesced/fragmented traffic is not rejected at the new limit.
|
Thanks for the detailed review. I went through the findings one by one and kept the scope intentionally below the broader rollback/transactionality direction from #143.
Accepted/documented rather than expanded. The malformed known extension payload is still rejected by the client/server state handler; the remaining problem is recovery from a transiently corrupted datagram after the handshake body/envelope has already been consumed. Pulling all known extension payload parsing into the handshake parser, or storing parsed extension payloads in the body, makes this PR drift toward the broader transactionality work. I added comments in the DTLS 1.2 and DTLS 1.3 defragment commit points documenting this parser/state-handler boundary and the accepted recovery edge.
Fixed. The exact-consumption check now applies to every recognized handshake message type, not only Finished. Unknown handshake types remain opaque. Added DTLS 1.2 and DTLS 1.3 coverage for known non-Finished bodies with trailing bytes, proving they are rejected before handled/transcript mutation.
Fixed with a local parser-layer discard. Once defragmentation has assembled a complete candidate, body parse failure or exact-consumption failure marks only the candidate fragments handled before returning the error. Transcript is not written and sequence state is not advanced. This reuses the existing stack ArrayVec of selected fragments and avoids queue rebuild/rollback state.
Accepted/documented as a sanity cap. The cap is 50 flattened fragments for one defragmentation attempt; exceeding that implies pathologically tiny records for ordinary handshake sizes. I added comments by the cap in both protocol versions clarifying that it intentionally does not mirror the receive queue record-count cap. Verification run:
|
|
@zRedShift I think we're ready to merge and close this out. |
zRedShift
left a comment
There was a problem hiding this comment.
From Codex's re-review:
Findings
High: clean same-key retransmissions can still be dropped after parser-layer discard
The new parser-layer discard marks the failed assembled candidate handled before returning from Body::parse() or exact-tail failures:
dimpl/src/dtls12/message/handshake.rs
Lines 191 to 208 in d7e7cd7
dimpl/src/dtls13/message/handshake.rs
Lines 215 to 246 in d7e7cd7
That avoids transcript/sequence mutation, but it does not reliably let a clean retransmission replace or bypass the discarded candidate.
In DTLS 1.2, duplicate insertion still keys only on (message_seq, fragment_offset), and the Ok(_) duplicate branch drops the retransmission without checking whether the existing candidate is already handled:
Lines 298 to 315 in d7e7cd7
The consumer then skips handled records/handshakes:
Lines 593 to 603 in d7e7cd7
purge_handled_queue_rx() does not fully save this path because it only runs from poll_output(), and only removes front entries whose records are fully handled. A clean retransmission that arrives before the next poll/purge is still dropped:
Lines 413 to 416 in d7e7cd7
Lines 470 to 478 in d7e7cd7
DTLS 1.3 has an analogous same-record case. Replacement checks existing.first().is_handled(), but Record::is_handled() is false until all handshakes in that record are handled:
Lines 420 to 450 in d7e7cd7
Lines 329 to 334 in d7e7cd7
So if the failed candidate is the first handshake in a same-record flight and later handshakes remain unhandled, a clean retransmission of the first candidate is not admitted. The parser discard only handled the selected prefix; the whole record is not handled.
Suggested fix: make duplicate replacement/purge candidate-aware rather than whole-record aware. In particular, DTLS 1.2 should be able to replace or remove handled same-key handshake fragments, and DTLS 1.3 should not require the whole existing record to be handled before replacing a handled same-key handshake candidate. Please add recovery tests where a malformed assembled candidate is followed by a clean retransmission with the same (message_seq, fragment_offset).
High: the PR still overclaims Fixes #142
Issue #142 asks that the full handshake body, including inner extension payloads, be validated before consumption/transcript/state mutation. The updated PR explicitly accepts that known extension payload parsing still happens later in the client/server state handlers:
dimpl/src/dtls12/message/handshake.rs
Lines 211 to 216 in d7e7cd7
dimpl/src/dtls13/message/handshake.rs
Lines 249 to 254 in d7e7cd7
That scope choice is understandable for keeping this PR smaller, but it means this PR does not fully close #142 as written. The PR body still says Fixes #142, and also says the unit coverage proves parse failures do not mark handshakes handled, while current tests assert the new discard behavior does mark them handled:
dimpl/src/dtls12/message/handshake.rs
Line 767 in d7e7cd7
dimpl/src/dtls13/message/handshake.rs
Line 682 in d7e7cd7
Suggested fix: either remove/narrow the Fixes #142 claim and leave the extension-payload recovery part open, or pull extension-payload validation before consumption and add the clean-retransmission regressions for that path too. The PR body should also describe the current behavior as “failed complete candidates are discarded without transcript/sequence mutation”, not “not marked handled.”
Validation
Local checks passed:
cargo fmt --check
git diff --check 32e3b02f6995969014b35aed97eca0889f623d0c...HEAD
/home/ronen/.codex/skills/dimpl/scripts/check-snowflake-local.pl 32e3b02f6995969014b35aed97eca0889f623d0c
cargo test handshake --features rcgen
cargo test --all-targets --features rcgen
cargo clippy --all-targets --features rcgen -- -D warnings
cargo test --no-default-features --features rust-crypto
cargo clippy --no-default-features --features rust-crypto -- -D warnings
cargo test --doc --features rcgen
|
@zRedShift yeah, both of those comments are invalid given that I accept the UDP resend issue described. |
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Codex <codex@openai.com>
497064c to
5a2db0c
Compare
|
To summarize I disagree with these two as blockers. For the extension-body case, the remaining problem is recovery from UDP corruption after the handshake envelope has already parsed. That can happen, but it is very much an edge case. Moving all known extension payload parsing into the handshake parser, or making the later state-handler parsing transactional, feels like a lot of machinery for that case. The packet still errors. It just errors later, when the state handler parses the known extension payload. For the defragmentation case, I don't think we should try to preserve old fragments and mix them with later retransmitted fragments. Once a complete assembled candidate fails to parse, we don't know which fragment was bad. A clean recovery is a resend of the handshake message or flight, not trying combinations of fragments from attempt one and attempt two until something parses. That would be absurd. Also, this is Sans-IO. After |
Summary: