Skip to content

fix: surface non-UTF-8 SNI hostnames instead of dropping#49

Merged
Zious11 merged 2 commits intodevelopfrom
worktree-tls-non-utf8-sni
Apr 8, 2026
Merged

fix: surface non-UTF-8 SNI hostnames instead of dropping#49
Zious11 merged 2 commits intodevelopfrom
worktree-tls-non-utf8-sni

Conversation

@Zious11
Copy link
Copy Markdown
Owner

@Zious11 Zious11 commented Apr 8, 2026

Summary

  • extract_sni() previously used String::from_utf8(...).ok() which silently dropped any TLS SNI hostname containing non-UTF-8 bytes. This is wrong for a forensics tool: per RFC 6066 §3 the HostName field must be ASCII (internationalized names use A-labels per RFC 5890), so non-UTF-8 SNI is a real protocol violation worth surfacing — even if usually a buggy client rather than a deliberate evasion.
  • New SniValue enum (Utf8(String) / NonUtf8 { lossy, hex }) preserves both a human-readable form and the lossless hex.
  • handle_client_hello now keys sni_counts on a tagged <non-utf8:HEX> form for the NonUtf8 case (uniqueness-preserving) and emits one Anomaly / Inconclusive / Low finding per non-UTF-8 SNI, with the lossy form in the summary and hex in the evidence.
  • Closes TLS: surface non-UTF-8 SNI hostnames instead of silently dropping #28.

Why hex-keyed counts (not lossy-keyed)

Using String::from_utf8_lossy as a sni_counts map key would let two distinct byte sequences collide if their U+FFFD-replaced forms align — bad for a forensic counter. Hex keying guarantees distinct byte sequences stay distinct, is terminal-safe (no ANSI injection vector), JSON-safe, and visually obvious to analysts.

Severity rationale

  • ThreatCategory::Anomaly — it's a spec violation.
  • Verdict::Inconclusive + Confidence::Low — Perplexity validation + judgment: legitimate TLS clients occasionally produce malformed SNI from bugs; non-UTF-8 SNI is not a documented C2 / DGA technique on its own. Surface for analyst review, don't scream malicious.
  • mitre_technique: None — T1001.001 (Junk Data) and T1001.003 (Protocol Impersonation) are the closest candidates; neither is a clean fit. Matches the existing weak-cipher and SSLv3 findings in tls.rs, which also use None.

Out of scope (intentional)

  • Valid-UTF-8 but non-ASCII SNI (e.g. raw U-labels) — also a RFC 6066 violation, but a separate detection worth its own issue.
  • parse_error_count is intentionally NOT incremented — the TLS record parsed cleanly; only one extension field failed UTF-8 decoding. The counter elsewhere in the file specifically means "the record itself was unparseable".

Test plan

  • cargo test --test tls_analyzer_tests — all 12 tests pass, including the 2 new ones.
  • cargo test — full suite green.
  • cargo clippy --all-targets -- -D warnings — clean.
  • cargo fmt --check — clean.
  • Code review pass (pr-review-toolkit:code-reviewer) — no critical, no important; all suggestions explicit defers, follow-up issue to be filed for additional test coverage gaps.

New tests

  • test_non_utf8_sni_emits_finding_and_counts_under_hex_key — builds a ClientHello with [0xff, 0xfe, "a.com"] SNI bytes, asserts one finding with the right severity/category, and verifies sni_counts contains the hex-tagged key <non-utf8:fffe612e636f6d>.
  • test_ascii_sni_does_not_emit_non_utf8_finding — regression: clean ASCII hostname must not trip the new finding.

Closes #28. extract_sni() previously used String::from_utf8(...).ok()
which silently discarded any SNI hostname containing non-UTF-8 bytes.
For a forensics tool that's the wrong default — non-UTF-8 SNI is a
RFC 6066 §3 protocol violation (HostName MUST be ASCII; internationalized
names use A-labels per RFC 5890) and a possible (low-confidence)
malware/buggy-client indicator worth surfacing to analysts.

Approach
--------
- New SniValue enum: Utf8(String) for the clean case, NonUtf8 { lossy,
  hex } when from_utf8 fails. The lossy form is for human display
  (U+FFFD replacement); the hex form is the lossless evidence record.
- handle_client_hello keys sni_counts on a tagged "<non-utf8:HEX>" form
  for the NonUtf8 case rather than the lossy string. This is critical:
  using the lossy string as a key would let two distinct byte sequences
  collide if their U+FFFD-replaced forms happen to align — bad for a
  forensics counter.
- Emit one Anomaly / Inconclusive / Low finding per non-UTF-8 SNI, with
  the lossy form in the summary and the hex in the evidence vector. No
  MITRE technique attached (T1001.001/.003 don't fit cleanly; matches
  the existing weak-cipher and SSLv3 findings in this file which also
  use None).
- parse_error_count is intentionally NOT incremented — the TLS record
  parsed cleanly, only one extension field failed UTF-8 decoding.

Tests
-----
- New build_client_hello_raw_sni helper that takes &[u8] for the SNI,
  reusing the existing builder for the rest of the record.
- test_non_utf8_sni_emits_finding_and_counts_under_hex_key: builds a
  ClientHello with [0xff, 0xfe, "a.com"] as the SNI bytes, asserts
  one finding with the right severity/category/MITRE and verifies
  sni_counts contains the hex-tagged key.
- test_ascii_sni_does_not_emit_non_utf8_finding: regression — clean
  ASCII hostname must not trip the new finding.

Scope intentionally excludes valid-UTF-8-but-non-ASCII SNI (e.g. raw
U-labels). Those are also a RFC 6066 violation but are a separate
detection worth its own issue.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TLS analyzer to preserve and surface malformed (non-UTF-8) SNI hostnames instead of silently dropping them, improving forensic visibility into protocol violations (per RFC 6066) and ensuring distinct byte sequences remain distinguishable in aggregation.

Changes:

  • Introduces an internal SniValue enum to represent decoded SNI as either valid UTF-8 or non-UTF-8 (lossy + hex).
  • Updates extract_sni() / ClientHello handling to count non-UTF-8 SNI using a hex-tagged key and emit a low-confidence anomaly finding with hex evidence.
  • Adds tests for non-UTF-8 SNI counting + finding emission and a regression test ensuring normal ASCII SNI does not trigger the new finding.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/analyzer/tls.rs Adds SniValue, changes SNI extraction to preserve non-UTF-8 bytes, counts them under a hex-tagged key, and emits a new finding.
tests/tls_analyzer_tests.rs Adds a raw-SNI ClientHello builder and new tests covering non-UTF-8 SNI behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/analyzer/tls.rs Outdated
Addresses Copilot review on PR #49. String::from_utf8_lossy only
replaces invalid UTF-8 *sequences* with U+FFFD — it preserves
valid-but-control bytes like ESC (0x1b) verbatim. Interpolating the
lossy form into the finding summary with {} therefore left a terminal
injection vector: a malformed SNI like b"\x1b[31mpwnd" would render
in red on any terminal showing the finding.

Switch the lossy interpolation to {:?} (Debug formatter), which
escapes ESC and other control characters as the literal sequence
\u{1b}. The hex evidence is unchanged — that remains the lossless
record. No behavior change for SNIs whose lossy form is purely
printable text.

Verified the fix empirically: format!("{:?}", "\u{fffd}\x1b[31mpwnd")
contains no raw 0x1b byte in the output, while format!("{}", ...)
does.

Tests
-----
- New test_non_utf8_sni_escapes_control_bytes_in_summary: builds a
  ClientHello with raw ESC + ANSI CSI in the SNI, asserts the finding
  summary does NOT contain a literal 0x1b byte but DOES contain the
  escaped "\u{1b}" sequence, and verifies the hex evidence still
  carries the original bytes losslessly.
@Zious11 Zious11 merged commit 628b646 into develop Apr 8, 2026
4 checks passed
@Zious11 Zious11 deleted the worktree-tls-non-utf8-sni branch April 8, 2026 16:08
Zious11 added a commit that referenced this pull request Apr 8, 2026
Closes #50. Adds 5 pin-tests for SNI edge cases that the existing
non-UTF-8 SNI tests (from #49) don't reach. Each new test documents
the *current* analyzer behavior so a future refactor can't silently
change it.

New tests
---------
1. test_sni_extension_with_empty_hostname_list — SNI extension present
   but the ServerNameList has zero entries. extract_sni returns None,
   no finding, no count increment, handshake still parses. (Confirmed
   tls_parser accepts zero-entry lists at parse time.)
2. test_sni_with_empty_hostname_bytes — SNI hostname of b"". Decodes
   via from_utf8 as Ok(""), counted under the empty-string key, no
   finding. Degenerate but technically valid wire form.
3. test_valid_utf8_non_ascii_sni_currently_not_flagged — A raw U-label
   like "café.example" is also a RFC 6066 §3 violation (spec requires
   A-label Punycode form), but the current analyzer does not flag it.
   Pins current behavior; will be flipped when issue #51 lands.
4. test_multi_name_sni_list_only_first_entry_counted — SNI extension
   with 3 ServerName entries. Only the first is counted in sni_counts
   (matching extract_sni's list.first() behavior).
5. test_non_utf8_sni_finding_fires_when_sni_counts_at_capacity — fills
   sni_counts to MAX_MAP_ENTRIES (50k unique valid hostnames) via the
   public API, then sends a non-UTF-8 SNI ClientHello. Verifies that
   the new non-UTF-8 key is dropped at the cap but the finding still
   fires. This tests a real invariant: the finding push and the count
   increment are independent, and a refactor nesting one inside the
   other would silently break forensic visibility past the cap.
   Runtime ~660ms in debug builds (acceptable).

Builder refactor
----------------
Extracted build_client_hello_with_sni_list from build_client_hello_raw_sni
to support zero and multi-entry SNI lists. build_client_hello_raw_sni
now delegates to the new helper for a single entry, preserving the
existing public API and all prior tests.

Out of scope
------------
Issue #51 (flag valid-UTF-8 but non-ASCII SNI) is filed separately
and is the natural follow-up that would flip test 3's assertion.
Zious11 added a commit that referenced this pull request Apr 8, 2026
## Summary
- Adds 5 pin-tests for SNI edge cases that the existing non-UTF-8 SNI
tests (from #49) don't reach. Each new test documents the *current*
analyzer behavior so a future refactor can't silently change it. Closes
#50.
- Refactors the test fixture builder to support 0-or-more-entry SNI
lists: extracted `build_client_hello_with_sni_list` from
`build_client_hello_raw_sni`, preserving the existing API for prior
tests.
- One test (#5) intentionally exercises the `MAX_MAP_ENTRIES` boundary
by feeding 50,000 unique ClientHellos through the public API. Runtime
~650ms in debug (budget ~2s for CI). No production code changes — this
is test-only.
- Code review (pr-test-analyzer) applied pre-merge: RFC 6066 §3 bounds
verified, silent-false-positive guard added to test 1, test 5's
finding-delta tightened to filter specifically for non-UTF-8 findings,
comment drift fixed.
- S4 follow-up (non-zero `NameType`, max-length SNI, trailing bytes)
filed as #52.

## New tests
1. **test_sni_extension_with_empty_hostname_list** — `ServerNameList`
with zero entries. `extract_sni` returns `None`, no finding, no count,
handshake still parses. Guards against tls_parser drift via an explicit
`parse_error_count == 0` assertion (empty-list wire form is technically
RFC-violating per `server_name_list<1..2^16-1>`; a future tls_parser
strictness tightening would otherwise let this test pass for the wrong
reason).
2. **test_sni_with_empty_hostname_bytes** — one entry but the hostname
bytes are `b""`. Decodes via `from_utf8` as `Ok("")`, counted under the
empty-string key, no finding. Degenerate but valid wire form.
3. **test_valid_utf8_non_ascii_sni_currently_not_flagged** —
`"café.example"` (a raw U-label, also a RFC 6066 §3 violation per the
A-label requirement). Pins current no-finding behavior; the assertion
includes the explicit #51 URL so the cross-reference survives grep and
will fire loud when that issue lands.
4. **test_multi_name_sni_list_only_first_entry_counted** — 3 ServerName
entries; only the first is counted, matching `extract_sni`'s
`list.first()` behavior.
5. **test_non_utf8_sni_finding_fires_when_sni_counts_at_capacity** —
fills `sni_counts` to `MAX_MAP_ENTRIES = 50_000` via 50k unique
ClientHellos through the public API, then sends a non-UTF-8 SNI
ClientHello. Verifies the new non-UTF-8 key is dropped at the cap but
the finding still fires. Pins a real invariant: finding emission is
independent of count insertion. Delta is filtered by
`summary.contains("non-UTF-8 bytes")` so the pin stays tight under
refactors.

## Why brute-force for test 5 (not a test helper)
Reaching `MAX_MAP_ENTRIES = 50_000` via the public API was chosen over
adding a `#[cfg(test)] pub(crate)` test helper on `TlsAnalyzer` to
expose internal state. The trade-off was validated with Perplexity
(rustc community tends to accept slow boundary tests over test-only
production surface) and empirically: debug runtime ~650ms, release
~120ms. The constant is hand-duplicated in the test but self-corrects
via the length assertion (any change to the cap would loudly fail).

## RFC validation
Before applying the code-reviewer's fixes, RFC 6066 §3 was fetched
directly to confirm the relevant spec claims:
- `ServerNameList server_name_list<1..2^16-1>` — minimum 1 entry (empty
lists are RFC-violating wire form, used in test 1 only to exercise the
analyzer's defensive branch)
- `enum { host_name(0), (255) } NameType` — `(255)` reserves room for
future types (informs the #52 follow-up for non-zero NameType testing)
- RFC 6066 is **silent** on duplicate SNI extensions, so that case was
dropped from the follow-up list

## Test plan
- [x] `cargo test --test tls_analyzer_tests` — 18/18 pass (~650ms total,
test 5 is ~650ms of that)
- [x] `cargo test` — full suite green
- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] `cargo fmt --check` — clean
- [x] Code review pass (`pr-test-analyzer`): 1 important finding (I1
silent false-positive), 1 important (I2 panic message link), 3
suggestions — all applied before PR creation
- [x] RFC 6066 §3 claims validated via direct RFC fetch before applying
review fixes
Zious11 added a commit that referenced this pull request Apr 8, 2026
Closes #51. PR #49 surfaced non-UTF-8 SNI bytes as a finding but
deliberately left a related RFC 6066 §3 violation unflagged: SNI
hostnames that are valid UTF-8 but contain non-ASCII codepoints,
e.g. raw U-labels like "café.example" or "пример.example". The spec
requires HostName to be ASCII (with internationalized names sent as
A-labels per RFC 5890 Punycode `xn--…` form), so a non-ASCII byte on
the wire is a real protocol violation.

Real-world false-positive risk
------------------------------
Validated via Perplexity that all major TLS clients auto-Punycode
internationalized hostnames before sending the SNI:

- rustls — rejects raw UTF-8 at the ServerName API level, requires
  Punycode upfront
- Chrome / BoringSSL — always Punycode (per RFC 3492 + 6066)
- Firefox / NSS — always Punycode (URL bar may display Unicode for
  user-friendliness, but the wire is always Punycode)
- curl / libcurl — auto-converts via libidn2 before calling OpenSSL
- OpenSSL raw — passes verbatim, but expects ASCII from the caller;
  applications using OpenSSL directly without IDNA prep are the only
  legitimate path to a raw U-label SNI on the wire

So the false-positive surface is small: a buggy custom client, an
attacker tool, or an OpenSSL-direct app that forgot to Punycode.
Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as
the non-UTF-8 finding.

SniValue enum
-------------
Renamed `Utf8(String)` → `Ascii(String)` (the old name was misleading
since ASCII is also valid UTF-8) and added a new variant:

    enum SniValue {
        Ascii(String),                              // RFC compliant
        NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII
        NonUtf8 { lossy: String, hex: String },     // invalid UTF-8 (existing)
    }

`extract_sni` now uses `s.is_ascii()` to split the UTF-8 case into
`Ascii` and `NonAsciiUtf8`. The hostname for `NonAsciiUtf8` is the
decoded String (always valid UTF-8 by definition); `hex` is the
lossless representation for forensic evidence.

`handle_client_hello` keys `sni_counts` on the hostname directly for
both Ascii and NonAsciiUtf8 (no collision risk for valid UTF-8) and
keeps the existing `<non-utf8:HEX>` tagged form for NonUtf8. The
finding summary uses `{hostname:?}` Debug formatter to escape any
control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL).

Tests
-----
- Flipped `test_valid_utf8_non_ascii_sni_currently_not_flagged` to
  `test_valid_utf8_non_ascii_sni_emits_finding` — asserts the new
  finding fires for "café.example" with the right severity, RFC
  reference in summary, and hex evidence (`636166c3a92e6578616d706c65`).
- Added `test_cyrillic_sni_emits_non_ascii_finding` for "пример.example".
- Added `test_emoji_sni_emits_non_ascii_finding` for "🦀.example"
  (4-byte UTF-8 codepoint, all bytes ≥ 0x80).
- Added `test_punycode_a_label_does_not_emit_non_ascii_finding` for
  "xn--caf-dma.example" — pins that the RFC-compliant Punycode form
  is NOT flagged.
Zious11 added a commit that referenced this pull request Apr 8, 2026
## Summary
- Adds a new `Anomaly` / `Inconclusive` / `Low` finding for TLS SNI
hostnames that are valid UTF-8 but contain non-ASCII codepoints (e.g.
raw U-labels like `"café.example"`, `"пример.example"`, `"🦀.example"`).
RFC 6066 §3 requires `HostName` to be ASCII, with internationalized
names sent as A-labels per RFC 5890 (`xn--…` Punycode form). Closes #51.
- Renames `SniValue::Utf8(String)` → `SniValue::Ascii(String)` (the old
name was misleading — ASCII is also valid UTF-8) and adds a new
`NonAsciiUtf8 { hostname, hex }` variant. The existing `NonUtf8` variant
is unchanged.
- Flips the pin-test from #50
(`test_valid_utf8_non_ascii_sni_currently_not_flagged` →
`test_valid_utf8_non_ascii_sni_emits_finding`) and adds 3 new positive
cases (Cyrillic, emoji, Punycode regression).
- Code review (`code-reviewer`) applied pre-merge: zero critical, zero
important, all suggestions explicit defers.
- Follow-up #54 filed for the related ASCII-control-code gap (S1 —
pre-existing, out of #51's scope).

## Real-world false-positive risk
Validated with Perplexity that all major TLS clients auto-Punycode
internationalized hostnames before sending the SNI:

| Client | Behavior |
|---|---|
| **rustls** | Rejects raw UTF-8 at the `ServerName` API level, requires
Punycode upfront |
| **Chrome / BoringSSL** | Always Punycode (per RFC 3492 + 6066) |
| **Firefox / NSS** | Always Punycode (URL bar may display Unicode for
user-friendliness, but the wire is always Punycode) |
| **curl / libcurl** | Auto-converts via libidn2 before calling OpenSSL
|
| **OpenSSL raw API** | Passes verbatim — but expects ASCII from the
caller; only legitimate path to a raw U-label SNI is a custom
application using OpenSSL directly without IDNA prep |

So the false-positive surface is small: a buggy custom client, an
OpenSSL-direct app that forgot to Punycode, or an attacker tool.
Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as the
non-UTF-8 finding from #49.

## Implementation
```rust
enum SniValue {
    Ascii(String),                                  // RFC compliant
    NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII (NEW)
    NonUtf8 { lossy: String, hex: String },         // invalid UTF-8 (existing)
}
```

`extract_sni` uses `s.is_ascii()` to split the UTF-8 case.
`handle_client_hello` uses an exhaustive `match` on the variants, so a
future variant addition (e.g. for #54's ASCII-control-code case) will
fail to compile until handled. Map-keying for `NonAsciiUtf8` is the bare
hostname (no `<non-ascii:HEX>` tagging needed — valid UTF-8 strings have
no collision risk because `from_utf8` is a bijection on well-formed
bytes).

## Findings shape
The new finding mirrors the existing `NonUtf8` finding pattern for
consistency:
- `category: Anomaly`, `verdict: Inconclusive`, `confidence: Low`
- Summary: `"TLS SNI contains non-ASCII characters (RFC 6066 requires
A-labels per RFC 5890): {hostname:?}"` — `{:?}` Debug formatter to
escape any control codepoints that might survive UTF-8 decoding (e.g.
U+0085 NEL)
- Evidence: `vec![format!("hex: {hex}")]` for forensic byte-level review
- `mitre_technique: None` (matches existing TLS findings; no clean
ATT&CK technique for protocol-format violations)

## Test plan
- [x] `cargo test --test tls_analyzer_tests` — 21/21 pass (~650ms; was
18 before)
- [x] `cargo test` — full suite green
- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] `cargo fmt --check` — clean
- [x] Code review pass (`code-reviewer`): zero critical, zero important;
one defer-only suggestion (S1 — ASCII control codes) filed as #54
- [x] Real-world false-positive risk Perplexity-validated against
rustls, OpenSSL, Chrome/BoringSSL, Firefox/NSS, curl/libcurl
- [x] RFC 6066 §3 ASCII requirement and RFC 5890 A-label requirement
reconfirmed (already validated earlier this session via direct RFC
fetch)

## New tests
1. **test_valid_utf8_non_ascii_sni_emits_finding** (flipped from #50's
pin-test) — `"café.example"`, asserts category/verdict/confidence/RFC
mention/A-label mention/hex evidence (`636166c3a92e6578616d706c65`).
2. **test_cyrillic_sni_emits_non_ascii_finding** — `"пример.example"`
(Cyrillic 2-byte UTF-8 sequences).
3. **test_emoji_sni_emits_non_ascii_finding** — `"🦀.example"` (4-byte
UTF-8 codepoint).
4. **test_punycode_a_label_does_not_emit_non_ascii_finding** —
`"xn--caf-dma.example"` regression: the RFC-compliant Punycode form is
pure ASCII and must NOT be flagged.
Zious11 added a commit that referenced this pull request Apr 9, 2026
Establishes the architectural principle that the data layer (analyzers
and the Finding struct) holds raw bytes from network captures, while
each reporter (terminal, JSON, future CSV/SQLite/HTML/SIEM) is
responsible for all formatting that depends on its own output medium —
escaping, styling, truncation, localization.

The motivating consequence is terminal-safe escaping: an audit during
PR #49 follow-up found 7 unprotected interpolations in the HTTP
analyzer where attacker-controlled bytes could reach the operator's
terminal and be interpreted as ANSI escape sequences (CWE-117). PR #49
patched TLS by escaping at the construction site, which solved the
immediate vulnerability but destroyed forensic data (Cyrillic SNIs
became hex blobs in JSON output) and required every new analyzer to
remember the rule — the HTTP audit shows the rule never propagated.

The decision: escape at display time in the reporter, using stdlib
str::escape_default in the terminal reporter. Analyzers store raw
bytes; JSON output stays safe via serde_json's automatic RFC 8259
escaping; future reporters apply their own format's rules.

Validated against OWASP output-encoding guidance (CWE-117) and Rust
stdlib behavior via Perplexity 2026-04-08. Other formatting concerns
(truncation, cipher-name fallback, verdict localization) follow the
same principle but are explicitly deferred per YAGNI.

Implementation follows in subsequent commits.
Zious11 added a commit that referenced this pull request Apr 9, 2026
…::escape_default

An initial Perplexity answer (relied on during the first draft) claimed
str::escape_default preserves multi-byte UTF-8 while escaping only C0
control bytes. Empirical verification during plan self-review — a small
rustc-compiled program that fed ESC, BEL, DEL, Cyrillic, emoji, and
mixed content through str::escape_default — falsified this: the method
routes every character through char::escape_default, which escapes all
non-ASCII as \u{...}. A Cyrillic SNI like "пример.рф" would become
"\u{43f}\u{440}..." — the same UX problem as PR #49's Debug formatter.

The Perplexity answer had conflated str::escape_default (method on str,
operates on chars, escapes all non-ASCII) with std::ascii::escape_default
(free function, operates on bytes, uses \x{hh}). These are not
equivalent.

Corrections in this commit:

- Replaced "The mechanism: str::escape_default" with an 8-line custom
  helper that iterates chars() and gates char::escape_default() on
  is_ascii_control() || '\\'. Empirically verified against the same
  test inputs.
- Replaced the "char::escape_debug (per-char Debug)" alternative entry
  with "Stdlib str::escape_default", since they're functionally
  equivalent for our input space and the new entry correctly documents
  the rejection.
- Updated Rationale bullet to describe the custom helper.
- Updated Validation section to record the empirical check and the
  corrected Perplexity claim.
- Fixed an incorrect "\x1b" reference in the behavioral-changes
  section (char::escape_default uses \u{1b}, not \x1b).

No change to the layering principle, the decision to escape at
display time, the rollback of TLS PR #49, the HTTP analyzer impact,
or any of the file-level consequences. Only the escape primitive
description changed.
Zious11 added a commit that referenced this pull request Apr 9, 2026
Apply escape_for_terminal to every Finding.summary and every evidence
entry in the terminal reporter's rendering loop. This moves output
sanitization from the analyzer construction site (where PR #49 placed
it) to the display layer, per ADR 0003.

JSON output remains unchanged and is already safe via serde_json's
automatic RFC 8259 escaping. HTTP findings (path traversal, web shell,
admin panel, unusual method, etc.) — previously vulnerable to terminal
injection via attacker-controlled URIs — are fixed by this change with
zero analyzer-side code modifications.
Zious11 added a commit that referenced this pull request Apr 9, 2026
Three additions from the multi-agent PR review triage:

1. test_json_reporter_preserves_cyrillic_as_readable_unicode — pins
   the ADR 0003 user-visible claim that JSON consumers now see raw
   Cyrillic hostnames instead of the PR #49 hex-mangled form. Includes
   a round-trip serde_json::from_str check proving the Cyrillic
   survives the full trip.

2. test_terminal_reporter_escapes_esc_bytes_in_summary augmented with
   Cyrillic content in the summary payload. Previously only covered
   ASCII + ESC — now proves the render loop preserves Cyrillic while
   still escaping control bytes (catches accidental double-escape
   bugs in the for-findings loop).

3. test_cyrillic_sni_emits_non_ascii_finding augmented with raw-
   Cyrillic-substring assertions on Finding.summary. This directly
   guards the {hostname:?} → {hostname} rollback on the NonAsciiUtf8
   match arm in src/analyzer/tls.rs, which was otherwise only
   structurally similar (and untested) compared to the non-UTF-8
   branch that has the inverted dedicated test.
Zious11 added a commit that referenced this pull request Apr 9, 2026
## Summary

Establishes ADR 0003 (Reporting Pipeline Layering) as the architectural
rule for how attacker-controlled bytes flow through the reporting
pipeline. The data layer (analyzers + `Finding` struct) now stores raw
post-`from_utf8_lossy` bytes; each reporter (terminal, JSON, future
CSV/SQLite) escapes for its own output medium at render time. Reverses
PR #49's construction-site escaping in TLS and centrally fixes an
injection vulnerability in the HTTP analyzer's 7 findings.

## Motivating vulnerability

During PR #49 follow-up, an audit of HTTP finding construction at
`src/analyzer/http.rs:183, 209, 227, 242, 257, 271, 285` found that
path-traversal, web-shell, admin-panel, unusual-method, missing-Host,
long-URI, and empty-UA findings all interpolated `parsed.uri` /
`parsed.method` / `parsed.host` directly via `format!("{}", ...)`. An
attacker-supplied URI containing `\x1b[31mRED\x1b[0m` would reach the
operator's terminal unescaped (CWE-117).

PR #49 patched the TLS path with `{:?}` (Debug formatter), which worked
for terminal safety but (a) permanently destroyed forensic bytes in
`Finding.summary` — Cyrillic SNIs became `\u{43f}\u{440}...` in JSON
exports — and (b) was tribal knowledge that didn't propagate to HTTP.

## Solution

Move escaping from construction-site (analyzer) to display-time
(terminal reporter), using a custom helper `escape_for_terminal` (~15
lines) built on `char::escape_default` gated by
`char::is_ascii_control() || ('\u{80}'..='\u{9f}').contains(&c) || c ==
'\\'`. The gate preserves valid non-ASCII Unicode (Cyrillic, CJK, emoji)
while escaping C0, DEL, C1, and backslash. `str::escape_default` was
considered and rejected after empirical verification showed it mangles
all non-ASCII (see commit `45cf649` and ADR 0003 Validation section).

## Behavioral changes visible to users

- **JSON output preserves Cyrillic / CJK / emoji as raw UTF-8.** A TLS
SNI of `пример.рф` now appears readably in JSON exports instead of
`\u{43f}\u{440}...`. Downstream tooling gets the actual hostname.
- **HTTP findings gain terminal-safety.** An attacker who embeds
`\x1b[31m` in a URI can no longer recolor the operator's terminal
output. Zero HTTP analyzer code changes — the fix is centralized in the
terminal reporter.
- **Both C0 and C1 control codepoints are escaped at terminal render.**
C1 codepoints like U+009B (CSI) — encoded in valid UTF-8 as `0xC2 0x9B`
— are escaped to close the DEC S8C1T attack vector. Narrow but real
threat.

## Non-changes

- `Finding` struct shape (same fields, same serde derives)
- JSON reporter (already safe via `serde_json`'s automatic RFC 8259
escaping)
- HTTP analyzer source code (raw interpolation is now correct under ADR
0003)
- DNS analyzer (emits no findings)
- No new dependencies

## Test plan

- [x] `cargo test --all` — 166 tests pass (up from 163 pre-PR)
- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] `cargo fmt -- --check` — clean
- [x] Unit tests on `escape_for_terminal` helper (11 total): ESC, BEL,
DEL, tab/newline/CR, backslash, printable ASCII, Cyrillic preservation,
emoji preservation, mixed content, empty string, C1 NEL+CSI, C1 range
boundaries
- [x] Reporter integration test: terminal reporter escapes ESC in
summary + evidence, preserves Cyrillic
- [x] Reporter integration test: JSON reporter preserves Cyrillic as
readable Unicode, round-trips via `serde_json::from_str`
- [x] End-to-end contract test covering all three layers (struct
preserves raw, terminal escapes, JSON escapes)
- [x] TLS analyzer test: raw ESC bytes survive to `Finding.summary` on
the non-UTF-8 path (inverted from the pre-PR contract)
- [x] TLS analyzer test: raw Cyrillic substring present in
`Finding.summary` on the non-ASCII path (guards the `{hostname:?}` →
`{hostname}` rollback)

## Multi-agent PR review

Three reviewer agents (code-reviewer, pr-test-analyzer,
comment-analyzer) reviewed this branch and returned APPROVED_WITH_NITS /
COVERAGE_ADEQUATE / APPROVED_WITH_NITS respectively. The code-reviewer's
one Important finding (C1 codepoint bypass + ADR reasoning error) was
applied in `7d2cd3c`; the pr-test-analyzer's top three coverage
suggestions (JSON Cyrillic, TLS non-ASCII rollback, terminal Cyrillic)
were applied in `b9cc820`. One coverage suggestion was deferred to #56
(HTTP analyzer end-to-end through terminal reporter).

## Commits

14 commits on top of `develop`. Notable:
- `f4c6098` + `45cf649` — ADR 0003 (original + correction)
- `1143d0b` — Implementation plan
- `1f7e556` + `755837b` — Escape helper + apply in render loop
- `0e42aad` — Roll back TLS Debug formatter
- `7e4f1df` — End-to-end layering contract test
- `7d2cd3c` + `b9cc820` — Post-review C1 fix + coverage additions

## Related

- Follow-up: #56 (HTTP end-to-end test)
- Supersedes the construction-site approach from PR #49
- References RFC 8259 §7 (JSON control byte escaping), CWE-117 (log
injection), OWASP output encoding guidance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TLS: surface non-UTF-8 SNI hostnames instead of silently dropping

2 participants