fix: surface non-UTF-8 SNI hostnames instead of dropping by Zious11 · Pull Request #49 · Zious11/wirerust

Zious11 · 2026-04-08T15:42:56Z

Summary

extract_sni() previously used String::from_utf8(...).ok() which silently dropped any TLS SNI hostname containing non-UTF-8 bytes. This is wrong for a forensics tool: per RFC 6066 §3 the HostName field must be ASCII (internationalized names use A-labels per RFC 5890), so non-UTF-8 SNI is a real protocol violation worth surfacing — even if usually a buggy client rather than a deliberate evasion.
New SniValue enum (Utf8(String) / NonUtf8 { lossy, hex }) preserves both a human-readable form and the lossless hex.
handle_client_hello now keys sni_counts on a tagged <non-utf8:HEX> form for the NonUtf8 case (uniqueness-preserving) and emits one Anomaly / Inconclusive / Low finding per non-UTF-8 SNI, with the lossy form in the summary and hex in the evidence.
Closes TLS: surface non-UTF-8 SNI hostnames instead of silently dropping #28.

Why hex-keyed counts (not lossy-keyed)

Using String::from_utf8_lossy as a sni_counts map key would let two distinct byte sequences collide if their U+FFFD-replaced forms align — bad for a forensic counter. Hex keying guarantees distinct byte sequences stay distinct, is terminal-safe (no ANSI injection vector), JSON-safe, and visually obvious to analysts.

Severity rationale

ThreatCategory::Anomaly — it's a spec violation.
Verdict::Inconclusive + Confidence::Low — Perplexity validation + judgment: legitimate TLS clients occasionally produce malformed SNI from bugs; non-UTF-8 SNI is not a documented C2 / DGA technique on its own. Surface for analyst review, don't scream malicious.
mitre_technique: None — T1001.001 (Junk Data) and T1001.003 (Protocol Impersonation) are the closest candidates; neither is a clean fit. Matches the existing weak-cipher and SSLv3 findings in tls.rs, which also use None.

Out of scope (intentional)

Valid-UTF-8 but non-ASCII SNI (e.g. raw U-labels) — also a RFC 6066 violation, but a separate detection worth its own issue.
parse_error_count is intentionally NOT incremented — the TLS record parsed cleanly; only one extension field failed UTF-8 decoding. The counter elsewhere in the file specifically means "the record itself was unparseable".

Test plan

cargo test --test tls_analyzer_tests — all 12 tests pass, including the 2 new ones.
cargo test — full suite green.
cargo clippy --all-targets -- -D warnings — clean.
cargo fmt --check — clean.
Code review pass (pr-review-toolkit:code-reviewer) — no critical, no important; all suggestions explicit defers, follow-up issue to be filed for additional test coverage gaps.

New tests

test_non_utf8_sni_emits_finding_and_counts_under_hex_key — builds a ClientHello with [0xff, 0xfe, "a.com"] SNI bytes, asserts one finding with the right severity/category, and verifies sni_counts contains the hex-tagged key <non-utf8:fffe612e636f6d>.
test_ascii_sni_does_not_emit_non_utf8_finding — regression: clean ASCII hostname must not trip the new finding.

Closes #28. extract_sni() previously used String::from_utf8(...).ok() which silently discarded any SNI hostname containing non-UTF-8 bytes. For a forensics tool that's the wrong default — non-UTF-8 SNI is a RFC 6066 §3 protocol violation (HostName MUST be ASCII; internationalized names use A-labels per RFC 5890) and a possible (low-confidence) malware/buggy-client indicator worth surfacing to analysts. Approach -------- - New SniValue enum: Utf8(String) for the clean case, NonUtf8 { lossy, hex } when from_utf8 fails. The lossy form is for human display (U+FFFD replacement); the hex form is the lossless evidence record. - handle_client_hello keys sni_counts on a tagged "<non-utf8:HEX>" form for the NonUtf8 case rather than the lossy string. This is critical: using the lossy string as a key would let two distinct byte sequences collide if their U+FFFD-replaced forms happen to align — bad for a forensics counter. - Emit one Anomaly / Inconclusive / Low finding per non-UTF-8 SNI, with the lossy form in the summary and the hex in the evidence vector. No MITRE technique attached (T1001.001/.003 don't fit cleanly; matches the existing weak-cipher and SSLv3 findings in this file which also use None). - parse_error_count is intentionally NOT incremented — the TLS record parsed cleanly, only one extension field failed UTF-8 decoding. Tests ----- - New build_client_hello_raw_sni helper that takes &[u8] for the SNI, reusing the existing builder for the rest of the record. - test_non_utf8_sni_emits_finding_and_counts_under_hex_key: builds a ClientHello with [0xff, 0xfe, "a.com"] as the SNI bytes, asserts one finding with the right severity/category/MITRE and verifies sni_counts contains the hex-tagged key. - test_ascii_sni_does_not_emit_non_utf8_finding: regression — clean ASCII hostname must not trip the new finding. Scope intentionally excludes valid-UTF-8-but-non-ASCII SNI (e.g. raw U-labels). Those are also a RFC 6066 violation but are a separate detection worth its own issue.

Copilot

Pull request overview

This PR updates the TLS analyzer to preserve and surface malformed (non-UTF-8) SNI hostnames instead of silently dropping them, improving forensic visibility into protocol violations (per RFC 6066) and ensuring distinct byte sequences remain distinguishable in aggregation.

Changes:

Introduces an internal SniValue enum to represent decoded SNI as either valid UTF-8 or non-UTF-8 (lossy + hex).
Updates extract_sni() / ClientHello handling to count non-UTF-8 SNI using a hex-tagged key and emit a low-confidence anomaly finding with hex evidence.
Adds tests for non-UTF-8 SNI counting + finding emission and a regression test ensuring normal ASCII SNI does not trigger the new finding.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/analyzer/tls.rs`	Adds `SniValue`, changes SNI extraction to preserve non-UTF-8 bytes, counts them under a hex-tagged key, and emits a new finding.
`tests/tls_analyzer_tests.rs`	Adds a raw-SNI ClientHello builder and new tests covering non-UTF-8 SNI behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Addresses Copilot review on PR #49. String::from_utf8_lossy only replaces invalid UTF-8 *sequences* with U+FFFD — it preserves valid-but-control bytes like ESC (0x1b) verbatim. Interpolating the lossy form into the finding summary with {} therefore left a terminal injection vector: a malformed SNI like b"\x1b[31mpwnd" would render in red on any terminal showing the finding. Switch the lossy interpolation to {:?} (Debug formatter), which escapes ESC and other control characters as the literal sequence \u{1b}. The hex evidence is unchanged — that remains the lossless record. No behavior change for SNIs whose lossy form is purely printable text. Verified the fix empirically: format!("{:?}", "\u{fffd}\x1b[31mpwnd") contains no raw 0x1b byte in the output, while format!("{}", ...) does. Tests ----- - New test_non_utf8_sni_escapes_control_bytes_in_summary: builds a ClientHello with raw ESC + ANSI CSI in the SNI, asserts the finding summary does NOT contain a literal 0x1b byte but DOES contain the escaped "\u{1b}" sequence, and verifies the hex evidence still carries the original bytes losslessly.

Closes #50. Adds 5 pin-tests for SNI edge cases that the existing non-UTF-8 SNI tests (from #49) don't reach. Each new test documents the *current* analyzer behavior so a future refactor can't silently change it. New tests --------- 1. test_sni_extension_with_empty_hostname_list — SNI extension present but the ServerNameList has zero entries. extract_sni returns None, no finding, no count increment, handshake still parses. (Confirmed tls_parser accepts zero-entry lists at parse time.) 2. test_sni_with_empty_hostname_bytes — SNI hostname of b"". Decodes via from_utf8 as Ok(""), counted under the empty-string key, no finding. Degenerate but technically valid wire form. 3. test_valid_utf8_non_ascii_sni_currently_not_flagged — A raw U-label like "café.example" is also a RFC 6066 §3 violation (spec requires A-label Punycode form), but the current analyzer does not flag it. Pins current behavior; will be flipped when issue #51 lands. 4. test_multi_name_sni_list_only_first_entry_counted — SNI extension with 3 ServerName entries. Only the first is counted in sni_counts (matching extract_sni's list.first() behavior). 5. test_non_utf8_sni_finding_fires_when_sni_counts_at_capacity — fills sni_counts to MAX_MAP_ENTRIES (50k unique valid hostnames) via the public API, then sends a non-UTF-8 SNI ClientHello. Verifies that the new non-UTF-8 key is dropped at the cap but the finding still fires. This tests a real invariant: the finding push and the count increment are independent, and a refactor nesting one inside the other would silently break forensic visibility past the cap. Runtime ~660ms in debug builds (acceptable). Builder refactor ---------------- Extracted build_client_hello_with_sni_list from build_client_hello_raw_sni to support zero and multi-entry SNI lists. build_client_hello_raw_sni now delegates to the new helper for a single entry, preserving the existing public API and all prior tests. Out of scope ------------ Issue #51 (flag valid-UTF-8 but non-ASCII SNI) is filed separately and is the natural follow-up that would flip test 3's assertion.

## Summary - Adds 5 pin-tests for SNI edge cases that the existing non-UTF-8 SNI tests (from #49) don't reach. Each new test documents the *current* analyzer behavior so a future refactor can't silently change it. Closes #50. - Refactors the test fixture builder to support 0-or-more-entry SNI lists: extracted `build_client_hello_with_sni_list` from `build_client_hello_raw_sni`, preserving the existing API for prior tests. - One test (#5) intentionally exercises the `MAX_MAP_ENTRIES` boundary by feeding 50,000 unique ClientHellos through the public API. Runtime ~650ms in debug (budget ~2s for CI). No production code changes — this is test-only. - Code review (pr-test-analyzer) applied pre-merge: RFC 6066 §3 bounds verified, silent-false-positive guard added to test 1, test 5's finding-delta tightened to filter specifically for non-UTF-8 findings, comment drift fixed. - S4 follow-up (non-zero `NameType`, max-length SNI, trailing bytes) filed as #52. ## New tests 1. **test_sni_extension_with_empty_hostname_list** — `ServerNameList` with zero entries. `extract_sni` returns `None`, no finding, no count, handshake still parses. Guards against tls_parser drift via an explicit `parse_error_count == 0` assertion (empty-list wire form is technically RFC-violating per `server_name_list<1..2^16-1>`; a future tls_parser strictness tightening would otherwise let this test pass for the wrong reason). 2. **test_sni_with_empty_hostname_bytes** — one entry but the hostname bytes are `b""`. Decodes via `from_utf8` as `Ok("")`, counted under the empty-string key, no finding. Degenerate but valid wire form. 3. **test_valid_utf8_non_ascii_sni_currently_not_flagged** — `"café.example"` (a raw U-label, also a RFC 6066 §3 violation per the A-label requirement). Pins current no-finding behavior; the assertion includes the explicit #51 URL so the cross-reference survives grep and will fire loud when that issue lands. 4. **test_multi_name_sni_list_only_first_entry_counted** — 3 ServerName entries; only the first is counted, matching `extract_sni`'s `list.first()` behavior. 5. **test_non_utf8_sni_finding_fires_when_sni_counts_at_capacity** — fills `sni_counts` to `MAX_MAP_ENTRIES = 50_000` via 50k unique ClientHellos through the public API, then sends a non-UTF-8 SNI ClientHello. Verifies the new non-UTF-8 key is dropped at the cap but the finding still fires. Pins a real invariant: finding emission is independent of count insertion. Delta is filtered by `summary.contains("non-UTF-8 bytes")` so the pin stays tight under refactors. ## Why brute-force for test 5 (not a test helper) Reaching `MAX_MAP_ENTRIES = 50_000` via the public API was chosen over adding a `#[cfg(test)] pub(crate)` test helper on `TlsAnalyzer` to expose internal state. The trade-off was validated with Perplexity (rustc community tends to accept slow boundary tests over test-only production surface) and empirically: debug runtime ~650ms, release ~120ms. The constant is hand-duplicated in the test but self-corrects via the length assertion (any change to the cap would loudly fail). ## RFC validation Before applying the code-reviewer's fixes, RFC 6066 §3 was fetched directly to confirm the relevant spec claims: - `ServerNameList server_name_list<1..2^16-1>` — minimum 1 entry (empty lists are RFC-violating wire form, used in test 1 only to exercise the analyzer's defensive branch) - `enum { host_name(0), (255) } NameType` — `(255)` reserves room for future types (informs the #52 follow-up for non-zero NameType testing) - RFC 6066 is **silent** on duplicate SNI extensions, so that case was dropped from the follow-up list ## Test plan - [x] `cargo test --test tls_analyzer_tests` — 18/18 pass (~650ms total, test 5 is ~650ms of that) - [x] `cargo test` — full suite green - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] `cargo fmt --check` — clean - [x] Code review pass (`pr-test-analyzer`): 1 important finding (I1 silent false-positive), 1 important (I2 panic message link), 3 suggestions — all applied before PR creation - [x] RFC 6066 §3 claims validated via direct RFC fetch before applying review fixes

Closes #51. PR #49 surfaced non-UTF-8 SNI bytes as a finding but deliberately left a related RFC 6066 §3 violation unflagged: SNI hostnames that are valid UTF-8 but contain non-ASCII codepoints, e.g. raw U-labels like "café.example" or "пример.example". The spec requires HostName to be ASCII (with internationalized names sent as A-labels per RFC 5890 Punycode `xn--…` form), so a non-ASCII byte on the wire is a real protocol violation. Real-world false-positive risk ------------------------------ Validated via Perplexity that all major TLS clients auto-Punycode internationalized hostnames before sending the SNI: - rustls — rejects raw UTF-8 at the ServerName API level, requires Punycode upfront - Chrome / BoringSSL — always Punycode (per RFC 3492 + 6066) - Firefox / NSS — always Punycode (URL bar may display Unicode for user-friendliness, but the wire is always Punycode) - curl / libcurl — auto-converts via libidn2 before calling OpenSSL - OpenSSL raw — passes verbatim, but expects ASCII from the caller; applications using OpenSSL directly without IDNA prep are the only legitimate path to a raw U-label SNI on the wire So the false-positive surface is small: a buggy custom client, an attacker tool, or an OpenSSL-direct app that forgot to Punycode. Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as the non-UTF-8 finding. SniValue enum ------------- Renamed `Utf8(String)` → `Ascii(String)` (the old name was misleading since ASCII is also valid UTF-8) and added a new variant: enum SniValue { Ascii(String), // RFC compliant NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII NonUtf8 { lossy: String, hex: String }, // invalid UTF-8 (existing) } `extract_sni` now uses `s.is_ascii()` to split the UTF-8 case into `Ascii` and `NonAsciiUtf8`. The hostname for `NonAsciiUtf8` is the decoded String (always valid UTF-8 by definition); `hex` is the lossless representation for forensic evidence. `handle_client_hello` keys `sni_counts` on the hostname directly for both Ascii and NonAsciiUtf8 (no collision risk for valid UTF-8) and keeps the existing `<non-utf8:HEX>` tagged form for NonUtf8. The finding summary uses `{hostname:?}` Debug formatter to escape any control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL). Tests ----- - Flipped `test_valid_utf8_non_ascii_sni_currently_not_flagged` to `test_valid_utf8_non_ascii_sni_emits_finding` — asserts the new finding fires for "café.example" with the right severity, RFC reference in summary, and hex evidence (`636166c3a92e6578616d706c65`). - Added `test_cyrillic_sni_emits_non_ascii_finding` for "пример.example". - Added `test_emoji_sni_emits_non_ascii_finding` for "🦀.example" (4-byte UTF-8 codepoint, all bytes ≥ 0x80). - Added `test_punycode_a_label_does_not_emit_non_ascii_finding` for "xn--caf-dma.example" — pins that the RFC-compliant Punycode form is NOT flagged.

## Summary - Adds a new `Anomaly` / `Inconclusive` / `Low` finding for TLS SNI hostnames that are valid UTF-8 but contain non-ASCII codepoints (e.g. raw U-labels like `"café.example"`, `"пример.example"`, `"🦀.example"`). RFC 6066 §3 requires `HostName` to be ASCII, with internationalized names sent as A-labels per RFC 5890 (`xn--…` Punycode form). Closes #51. - Renames `SniValue::Utf8(String)` → `SniValue::Ascii(String)` (the old name was misleading — ASCII is also valid UTF-8) and adds a new `NonAsciiUtf8 { hostname, hex }` variant. The existing `NonUtf8` variant is unchanged. - Flips the pin-test from #50 (`test_valid_utf8_non_ascii_sni_currently_not_flagged` → `test_valid_utf8_non_ascii_sni_emits_finding`) and adds 3 new positive cases (Cyrillic, emoji, Punycode regression). - Code review (`code-reviewer`) applied pre-merge: zero critical, zero important, all suggestions explicit defers. - Follow-up #54 filed for the related ASCII-control-code gap (S1 — pre-existing, out of #51's scope). ## Real-world false-positive risk Validated with Perplexity that all major TLS clients auto-Punycode internationalized hostnames before sending the SNI: | Client | Behavior | |---|---| | **rustls** | Rejects raw UTF-8 at the `ServerName` API level, requires Punycode upfront | | **Chrome / BoringSSL** | Always Punycode (per RFC 3492 + 6066) | | **Firefox / NSS** | Always Punycode (URL bar may display Unicode for user-friendliness, but the wire is always Punycode) | | **curl / libcurl** | Auto-converts via libidn2 before calling OpenSSL | | **OpenSSL raw API** | Passes verbatim — but expects ASCII from the caller; only legitimate path to a raw U-label SNI is a custom application using OpenSSL directly without IDNA prep | So the false-positive surface is small: a buggy custom client, an OpenSSL-direct app that forgot to Punycode, or an attacker tool. Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as the non-UTF-8 finding from #49. ## Implementation ```rust enum SniValue { Ascii(String), // RFC compliant NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII (NEW) NonUtf8 { lossy: String, hex: String }, // invalid UTF-8 (existing) } ``` `extract_sni` uses `s.is_ascii()` to split the UTF-8 case. `handle_client_hello` uses an exhaustive `match` on the variants, so a future variant addition (e.g. for #54's ASCII-control-code case) will fail to compile until handled. Map-keying for `NonAsciiUtf8` is the bare hostname (no `<non-ascii:HEX>` tagging needed — valid UTF-8 strings have no collision risk because `from_utf8` is a bijection on well-formed bytes). ## Findings shape The new finding mirrors the existing `NonUtf8` finding pattern for consistency: - `category: Anomaly`, `verdict: Inconclusive`, `confidence: Low` - Summary: `"TLS SNI contains non-ASCII characters (RFC 6066 requires A-labels per RFC 5890): {hostname:?}"` — `{:?}` Debug formatter to escape any control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL) - Evidence: `vec![format!("hex: {hex}")]` for forensic byte-level review - `mitre_technique: None` (matches existing TLS findings; no clean ATT&CK technique for protocol-format violations) ## Test plan - [x] `cargo test --test tls_analyzer_tests` — 21/21 pass (~650ms; was 18 before) - [x] `cargo test` — full suite green - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] `cargo fmt --check` — clean - [x] Code review pass (`code-reviewer`): zero critical, zero important; one defer-only suggestion (S1 — ASCII control codes) filed as #54 - [x] Real-world false-positive risk Perplexity-validated against rustls, OpenSSL, Chrome/BoringSSL, Firefox/NSS, curl/libcurl - [x] RFC 6066 §3 ASCII requirement and RFC 5890 A-label requirement reconfirmed (already validated earlier this session via direct RFC fetch) ## New tests 1. **test_valid_utf8_non_ascii_sni_emits_finding** (flipped from #50's pin-test) — `"café.example"`, asserts category/verdict/confidence/RFC mention/A-label mention/hex evidence (`636166c3a92e6578616d706c65`). 2. **test_cyrillic_sni_emits_non_ascii_finding** — `"пример.example"` (Cyrillic 2-byte UTF-8 sequences). 3. **test_emoji_sni_emits_non_ascii_finding** — `"🦀.example"` (4-byte UTF-8 codepoint). 4. **test_punycode_a_label_does_not_emit_non_ascii_finding** — `"xn--caf-dma.example"` regression: the RFC-compliant Punycode form is pure ASCII and must NOT be flagged.

Establishes the architectural principle that the data layer (analyzers and the Finding struct) holds raw bytes from network captures, while each reporter (terminal, JSON, future CSV/SQLite/HTML/SIEM) is responsible for all formatting that depends on its own output medium — escaping, styling, truncation, localization. The motivating consequence is terminal-safe escaping: an audit during PR #49 follow-up found 7 unprotected interpolations in the HTTP analyzer where attacker-controlled bytes could reach the operator's terminal and be interpreted as ANSI escape sequences (CWE-117). PR #49 patched TLS by escaping at the construction site, which solved the immediate vulnerability but destroyed forensic data (Cyrillic SNIs became hex blobs in JSON output) and required every new analyzer to remember the rule — the HTTP audit shows the rule never propagated. The decision: escape at display time in the reporter, using stdlib str::escape_default in the terminal reporter. Analyzers store raw bytes; JSON output stays safe via serde_json's automatic RFC 8259 escaping; future reporters apply their own format's rules. Validated against OWASP output-encoding guidance (CWE-117) and Rust stdlib behavior via Perplexity 2026-04-08. Other formatting concerns (truncation, cipher-name fallback, verdict localization) follow the same principle but are explicitly deferred per YAGNI. Implementation follows in subsequent commits.

…::escape_default An initial Perplexity answer (relied on during the first draft) claimed str::escape_default preserves multi-byte UTF-8 while escaping only C0 control bytes. Empirical verification during plan self-review — a small rustc-compiled program that fed ESC, BEL, DEL, Cyrillic, emoji, and mixed content through str::escape_default — falsified this: the method routes every character through char::escape_default, which escapes all non-ASCII as \u{...}. A Cyrillic SNI like "пример.рф" would become "\u{43f}\u{440}..." — the same UX problem as PR #49's Debug formatter. The Perplexity answer had conflated str::escape_default (method on str, operates on chars, escapes all non-ASCII) with std::ascii::escape_default (free function, operates on bytes, uses \x{hh}). These are not equivalent. Corrections in this commit: - Replaced "The mechanism: str::escape_default" with an 8-line custom helper that iterates chars() and gates char::escape_default() on is_ascii_control() || '\\'. Empirically verified against the same test inputs. - Replaced the "char::escape_debug (per-char Debug)" alternative entry with "Stdlib str::escape_default", since they're functionally equivalent for our input space and the new entry correctly documents the rejection. - Updated Rationale bullet to describe the custom helper. - Updated Validation section to record the empirical check and the corrected Perplexity claim. - Fixed an incorrect "\x1b" reference in the behavioral-changes section (char::escape_default uses \u{1b}, not \x1b). No change to the layering principle, the decision to escape at display time, the rollback of TLS PR #49, the HTTP analyzer impact, or any of the file-level consequences. Only the escape primitive description changed.

Apply escape_for_terminal to every Finding.summary and every evidence entry in the terminal reporter's rendering loop. This moves output sanitization from the analyzer construction site (where PR #49 placed it) to the display layer, per ADR 0003. JSON output remains unchanged and is already safe via serde_json's automatic RFC 8259 escaping. HTTP findings (path traversal, web shell, admin panel, unusual method, etc.) — previously vulnerable to terminal injection via attacker-controlled URIs — are fixed by this change with zero analyzer-side code modifications.

Three additions from the multi-agent PR review triage: 1. test_json_reporter_preserves_cyrillic_as_readable_unicode — pins the ADR 0003 user-visible claim that JSON consumers now see raw Cyrillic hostnames instead of the PR #49 hex-mangled form. Includes a round-trip serde_json::from_str check proving the Cyrillic survives the full trip. 2. test_terminal_reporter_escapes_esc_bytes_in_summary augmented with Cyrillic content in the summary payload. Previously only covered ASCII + ESC — now proves the render loop preserves Cyrillic while still escaping control bytes (catches accidental double-escape bugs in the for-findings loop). 3. test_cyrillic_sni_emits_non_ascii_finding augmented with raw- Cyrillic-substring assertions on Finding.summary. This directly guards the {hostname:?} → {hostname} rollback on the NonAsciiUtf8 match arm in src/analyzer/tls.rs, which was otherwise only structurally similar (and untested) compared to the non-UTF-8 branch that has the inverted dedicated test.

## Summary Establishes ADR 0003 (Reporting Pipeline Layering) as the architectural rule for how attacker-controlled bytes flow through the reporting pipeline. The data layer (analyzers + `Finding` struct) now stores raw post-`from_utf8_lossy` bytes; each reporter (terminal, JSON, future CSV/SQLite) escapes for its own output medium at render time. Reverses PR #49's construction-site escaping in TLS and centrally fixes an injection vulnerability in the HTTP analyzer's 7 findings. ## Motivating vulnerability During PR #49 follow-up, an audit of HTTP finding construction at `src/analyzer/http.rs:183, 209, 227, 242, 257, 271, 285` found that path-traversal, web-shell, admin-panel, unusual-method, missing-Host, long-URI, and empty-UA findings all interpolated `parsed.uri` / `parsed.method` / `parsed.host` directly via `format!("{}", ...)`. An attacker-supplied URI containing `\x1b[31mRED\x1b[0m` would reach the operator's terminal unescaped (CWE-117). PR #49 patched the TLS path with `{:?}` (Debug formatter), which worked for terminal safety but (a) permanently destroyed forensic bytes in `Finding.summary` — Cyrillic SNIs became `\u{43f}\u{440}...` in JSON exports — and (b) was tribal knowledge that didn't propagate to HTTP. ## Solution Move escaping from construction-site (analyzer) to display-time (terminal reporter), using a custom helper `escape_for_terminal` (~15 lines) built on `char::escape_default` gated by `char::is_ascii_control() || ('\u{80}'..='\u{9f}').contains(&c) || c == '\\'`. The gate preserves valid non-ASCII Unicode (Cyrillic, CJK, emoji) while escaping C0, DEL, C1, and backslash. `str::escape_default` was considered and rejected after empirical verification showed it mangles all non-ASCII (see commit `45cf649` and ADR 0003 Validation section). ## Behavioral changes visible to users - **JSON output preserves Cyrillic / CJK / emoji as raw UTF-8.** A TLS SNI of `пример.рф` now appears readably in JSON exports instead of `\u{43f}\u{440}...`. Downstream tooling gets the actual hostname. - **HTTP findings gain terminal-safety.** An attacker who embeds `\x1b[31m` in a URI can no longer recolor the operator's terminal output. Zero HTTP analyzer code changes — the fix is centralized in the terminal reporter. - **Both C0 and C1 control codepoints are escaped at terminal render.** C1 codepoints like U+009B (CSI) — encoded in valid UTF-8 as `0xC2 0x9B` — are escaped to close the DEC S8C1T attack vector. Narrow but real threat. ## Non-changes - `Finding` struct shape (same fields, same serde derives) - JSON reporter (already safe via `serde_json`'s automatic RFC 8259 escaping) - HTTP analyzer source code (raw interpolation is now correct under ADR 0003) - DNS analyzer (emits no findings) - No new dependencies ## Test plan - [x] `cargo test --all` — 166 tests pass (up from 163 pre-PR) - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] `cargo fmt -- --check` — clean - [x] Unit tests on `escape_for_terminal` helper (11 total): ESC, BEL, DEL, tab/newline/CR, backslash, printable ASCII, Cyrillic preservation, emoji preservation, mixed content, empty string, C1 NEL+CSI, C1 range boundaries - [x] Reporter integration test: terminal reporter escapes ESC in summary + evidence, preserves Cyrillic - [x] Reporter integration test: JSON reporter preserves Cyrillic as readable Unicode, round-trips via `serde_json::from_str` - [x] End-to-end contract test covering all three layers (struct preserves raw, terminal escapes, JSON escapes) - [x] TLS analyzer test: raw ESC bytes survive to `Finding.summary` on the non-UTF-8 path (inverted from the pre-PR contract) - [x] TLS analyzer test: raw Cyrillic substring present in `Finding.summary` on the non-ASCII path (guards the `{hostname:?}` → `{hostname}` rollback) ## Multi-agent PR review Three reviewer agents (code-reviewer, pr-test-analyzer, comment-analyzer) reviewed this branch and returned APPROVED_WITH_NITS / COVERAGE_ADEQUATE / APPROVED_WITH_NITS respectively. The code-reviewer's one Important finding (C1 codepoint bypass + ADR reasoning error) was applied in `7d2cd3c`; the pr-test-analyzer's top three coverage suggestions (JSON Cyrillic, TLS non-ASCII rollback, terminal Cyrillic) were applied in `b9cc820`. One coverage suggestion was deferred to #56 (HTTP analyzer end-to-end through terminal reporter). ## Commits 14 commits on top of `develop`. Notable: - `f4c6098` + `45cf649` — ADR 0003 (original + correction) - `1143d0b` — Implementation plan - `1f7e556` + `755837b` — Escape helper + apply in render loop - `0e42aad` — Roll back TLS Debug formatter - `7e4f1df` — End-to-end layering contract test - `7d2cd3c` + `b9cc820` — Post-review C1 fix + coverage additions ## Related - Follow-up: #56 (HTTP end-to-end test) - Supersedes the construction-site approach from PR #49 - References RFC 8259 §7 (JSON control byte escaping), CWE-117 (log injection), OWASP output encoding guidance

Zious11 requested a review from Copilot April 8, 2026 15:43

Zious11 mentioned this pull request Apr 8, 2026

test: expand TLS SNI edge-case coverage #50

Closed

Copilot started reviewing on behalf of Zious11 April 8, 2026 15:43 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread src/analyzer/tls.rs Outdated

Zious11 mentioned this pull request Apr 8, 2026

TLS: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation #51

Closed

Zious11 merged commit 628b646 into develop Apr 8, 2026
4 checks passed

Zious11 deleted the worktree-tls-non-utf8-sni branch April 8, 2026 16:08

Zious11 mentioned this pull request Apr 8, 2026

test: add TLS SNI edge-case coverage for non-zero NameType, max length, and trailing bytes #52

Closed

Zious11 mentioned this pull request Apr 8, 2026

test: expand TLS SNI edge-case coverage #53

Merged

6 tasks

Zious11 mentioned this pull request Apr 8, 2026

TLS: flag SNI hostnames containing ASCII control codes (BEL/ESC/DEL/C0) #54

Closed

Zious11 mentioned this pull request Apr 8, 2026

feat: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation #55

Merged

7 tasks

Zious11 mentioned this pull request Apr 9, 2026

feat(reporter): layered output sanitization + ADR 0003 #57

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: surface non-UTF-8 SNI hostnames instead of dropping#49

fix: surface non-UTF-8 SNI hostnames instead of dropping#49
Zious11 merged 2 commits intodevelopfrom
worktree-tls-non-utf8-sni

Zious11 commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zious11 commented Apr 8, 2026

Summary

Why hex-keyed counts (not lossy-keyed)

Severity rationale

Out of scope (intentional)

Test plan

New tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants