feat: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation#55
Merged
feat: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation#55
Conversation
Closes #51. PR #49 surfaced non-UTF-8 SNI bytes as a finding but deliberately left a related RFC 6066 §3 violation unflagged: SNI hostnames that are valid UTF-8 but contain non-ASCII codepoints, e.g. raw U-labels like "café.example" or "пример.example". The spec requires HostName to be ASCII (with internationalized names sent as A-labels per RFC 5890 Punycode `xn--…` form), so a non-ASCII byte on the wire is a real protocol violation. Real-world false-positive risk ------------------------------ Validated via Perplexity that all major TLS clients auto-Punycode internationalized hostnames before sending the SNI: - rustls — rejects raw UTF-8 at the ServerName API level, requires Punycode upfront - Chrome / BoringSSL — always Punycode (per RFC 3492 + 6066) - Firefox / NSS — always Punycode (URL bar may display Unicode for user-friendliness, but the wire is always Punycode) - curl / libcurl — auto-converts via libidn2 before calling OpenSSL - OpenSSL raw — passes verbatim, but expects ASCII from the caller; applications using OpenSSL directly without IDNA prep are the only legitimate path to a raw U-label SNI on the wire So the false-positive surface is small: a buggy custom client, an attacker tool, or an OpenSSL-direct app that forgot to Punycode. Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as the non-UTF-8 finding. SniValue enum ------------- Renamed `Utf8(String)` → `Ascii(String)` (the old name was misleading since ASCII is also valid UTF-8) and added a new variant: enum SniValue { Ascii(String), // RFC compliant NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII NonUtf8 { lossy: String, hex: String }, // invalid UTF-8 (existing) } `extract_sni` now uses `s.is_ascii()` to split the UTF-8 case into `Ascii` and `NonAsciiUtf8`. The hostname for `NonAsciiUtf8` is the decoded String (always valid UTF-8 by definition); `hex` is the lossless representation for forensic evidence. `handle_client_hello` keys `sni_counts` on the hostname directly for both Ascii and NonAsciiUtf8 (no collision risk for valid UTF-8) and keeps the existing `<non-utf8:HEX>` tagged form for NonUtf8. The finding summary uses `{hostname:?}` Debug formatter to escape any control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL). Tests ----- - Flipped `test_valid_utf8_non_ascii_sni_currently_not_flagged` to `test_valid_utf8_non_ascii_sni_emits_finding` — asserts the new finding fires for "café.example" with the right severity, RFC reference in summary, and hex evidence (`636166c3a92e6578616d706c65`). - Added `test_cyrillic_sni_emits_non_ascii_finding` for "пример.example". - Added `test_emoji_sni_emits_non_ascii_finding` for "🦀.example" (4-byte UTF-8 codepoint, all bytes ≥ 0x80). - Added `test_punycode_a_label_does_not_emit_non_ascii_finding` for "xn--caf-dma.example" — pins that the RFC-compliant Punycode form is NOT flagged.
There was a problem hiding this comment.
Pull request overview
This PR tightens TLS SNI analysis to flag RFC 6066 violations where the SNI hostname bytes are valid UTF-8 but include non-ASCII characters (raw U-labels), while preserving existing handling for non-UTF-8 SNI.
Changes:
- Refines SNI decoding by splitting valid SNI into
AsciivsNonAsciiUtf8(with hex evidence), and emits anAnomaly / Inconclusive / Lowfinding for the non-ASCII UTF-8 case. - Updates SNI counting to key
AsciiandNonAsciiUtf8by the decoded hostname string, while keeping the existing hex-tag keying for non-UTF-8. - Expands and flips tests to assert the new non-ASCII SNI finding, including Cyrillic/emoji positives and a Punycode A-label regression case.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/tls_analyzer_tests.rs | Flips the prior pin-test and adds new cases to validate non-ASCII UTF-8 SNI finding behavior and Punycode non-flagging. |
| src/analyzer/tls.rs | Implements NonAsciiUtf8 detection and finding emission; renames the ASCII-happy-path variant and updates SNI keying logic accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot review on PR #55: "a RFC 6066" → "an RFC 6066". RFC is pronounced "arr-eff-see", which starts with a vowel sound, so the indefinite article is "an" — matches IETF style throughout the RFC corpus itself.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Anomaly/Inconclusive/Lowfinding for TLS SNI hostnames that are valid UTF-8 but contain non-ASCII codepoints (e.g. raw U-labels like"café.example","пример.example","🦀.example"). RFC 6066 §3 requiresHostNameto be ASCII, with internationalized names sent as A-labels per RFC 5890 (xn--…Punycode form). Closes TLS: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation #51.SniValue::Utf8(String)→SniValue::Ascii(String)(the old name was misleading — ASCII is also valid UTF-8) and adds a newNonAsciiUtf8 { hostname, hex }variant. The existingNonUtf8variant is unchanged.test_valid_utf8_non_ascii_sni_currently_not_flagged→test_valid_utf8_non_ascii_sni_emits_finding) and adds 3 new positive cases (Cyrillic, emoji, Punycode regression).code-reviewer) applied pre-merge: zero critical, zero important, all suggestions explicit defers.Real-world false-positive risk
Validated with Perplexity that all major TLS clients auto-Punycode internationalized hostnames before sending the SNI:
ServerNameAPI level, requires Punycode upfrontSo the false-positive surface is small: a buggy custom client, an OpenSSL-direct app that forgot to Punycode, or an attacker tool. Severity is
Anomaly/Inconclusive/Low— same rationale as the non-UTF-8 finding from #49.Implementation
extract_sniusess.is_ascii()to split the UTF-8 case.handle_client_hellouses an exhaustivematchon the variants, so a future variant addition (e.g. for #54's ASCII-control-code case) will fail to compile until handled. Map-keying forNonAsciiUtf8is the bare hostname (no<non-ascii:HEX>tagging needed — valid UTF-8 strings have no collision risk becausefrom_utf8is a bijection on well-formed bytes).Findings shape
The new finding mirrors the existing
NonUtf8finding pattern for consistency:category: Anomaly,verdict: Inconclusive,confidence: Low"TLS SNI contains non-ASCII characters (RFC 6066 requires A-labels per RFC 5890): {hostname:?}"—{:?}Debug formatter to escape any control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL)vec![format!("hex: {hex}")]for forensic byte-level reviewmitre_technique: None(matches existing TLS findings; no clean ATT&CK technique for protocol-format violations)Test plan
cargo test --test tls_analyzer_tests— 21/21 pass (~650ms; was 18 before)cargo test— full suite greencargo clippy --all-targets -- -D warnings— cleancargo fmt --check— cleancode-reviewer): zero critical, zero important; one defer-only suggestion (S1 — ASCII control codes) filed as TLS: flag SNI hostnames containing ASCII control codes (BEL/ESC/DEL/C0) #54New tests
"café.example", asserts category/verdict/confidence/RFC mention/A-label mention/hex evidence (636166c3a92e6578616d706c65)."пример.example"(Cyrillic 2-byte UTF-8 sequences)."🦀.example"(4-byte UTF-8 codepoint)."xn--caf-dma.example"regression: the RFC-compliant Punycode form is pure ASCII and must NOT be flagged.