feat(reporter): layered output sanitization + ADR 0003#57
Conversation
Establishes the architectural principle that the data layer (analyzers and the Finding struct) holds raw bytes from network captures, while each reporter (terminal, JSON, future CSV/SQLite/HTML/SIEM) is responsible for all formatting that depends on its own output medium — escaping, styling, truncation, localization. The motivating consequence is terminal-safe escaping: an audit during PR #49 follow-up found 7 unprotected interpolations in the HTTP analyzer where attacker-controlled bytes could reach the operator's terminal and be interpreted as ANSI escape sequences (CWE-117). PR #49 patched TLS by escaping at the construction site, which solved the immediate vulnerability but destroyed forensic data (Cyrillic SNIs became hex blobs in JSON output) and required every new analyzer to remember the rule — the HTTP audit shows the rule never propagated. The decision: escape at display time in the reporter, using stdlib str::escape_default in the terminal reporter. Analyzers store raw bytes; JSON output stays safe via serde_json's automatic RFC 8259 escaping; future reporters apply their own format's rules. Validated against OWASP output-encoding guidance (CWE-117) and Rust stdlib behavior via Perplexity 2026-04-08. Other formatting concerns (truncation, cipher-name fallback, verdict localization) follow the same principle but are explicitly deferred per YAGNI. Implementation follows in subsequent commits.
…::escape_default
An initial Perplexity answer (relied on during the first draft) claimed
str::escape_default preserves multi-byte UTF-8 while escaping only C0
control bytes. Empirical verification during plan self-review — a small
rustc-compiled program that fed ESC, BEL, DEL, Cyrillic, emoji, and
mixed content through str::escape_default — falsified this: the method
routes every character through char::escape_default, which escapes all
non-ASCII as \u{...}. A Cyrillic SNI like "пример.рф" would become
"\u{43f}\u{440}..." — the same UX problem as PR #49's Debug formatter.
The Perplexity answer had conflated str::escape_default (method on str,
operates on chars, escapes all non-ASCII) with std::ascii::escape_default
(free function, operates on bytes, uses \x{hh}). These are not
equivalent.
Corrections in this commit:
- Replaced "The mechanism: str::escape_default" with an 8-line custom
helper that iterates chars() and gates char::escape_default() on
is_ascii_control() || '\\'. Empirically verified against the same
test inputs.
- Replaced the "char::escape_debug (per-char Debug)" alternative entry
with "Stdlib str::escape_default", since they're functionally
equivalent for our input space and the new entry correctly documents
the rejection.
- Updated Rationale bullet to describe the custom helper.
- Updated Validation section to record the empirical check and the
corrected Perplexity claim.
- Fixed an incorrect "\x1b" reference in the behavioral-changes
section (char::escape_default uses \u{1b}, not \x1b).
No change to the layering principle, the decision to escape at
display time, the rollback of TLS PR #49, the HTTP analyzer impact,
or any of the file-level consequences. Only the escape primitive
description changed.
Seven bite-sized TDD tasks implementing the reporting pipeline layering decision from ADR 0003 (commits f4c6098 + 45cf649): Task 1: Cross-reference ADR 0003 from ADR 0002 Task 2: Finding::Display doc warning Task 3: escape_for_terminal helper + 9 unit tests Task 4: Apply helper in terminal reporter render loop Task 5: Roll back TLS construction-site {:?} escaping Task 6: End-to-end regression test (data layer + both reporters) Task 7: cargo test + clippy + fmt Each task has complete code blocks, exact commands, and expected outputs. Test assertions were pre-validated empirically with a small rustc-compiled program (see commit 45cf649 for the primitive correction). No placeholders, no speculative steps. Total impact: ~100 lines of production-code change, ~120 lines of test code, 6-7 commits beyond the ADR.
Point analyzer authors at the layering principle so they know output sanitization is the reporter's responsibility, not theirs.
Doc comment on the impl Display block explaining that the raw text representation is for logging/debugging only and may contain attacker-controlled control bytes. Terminal rendering should go through the terminal reporter, which escapes per ADR 0003.
Private helper in src/reporter/terminal.rs built on stdlib char::escape_default, gated by char::is_ascii_control || backslash. Escapes C0 control bytes, DEL, and backslash while preserving all valid non-ASCII Unicode (Cyrillic, CJK, emoji). Unit tested for ESC, BEL, DEL, tab/newline/CR short escapes, backslash, printable ASCII, Cyrillic, emoji, mixed content, and empty string. Not yet applied to the rendering loop — that's the next commit, so the helper can land with dedicated tests before it's integrated. See ADR 0003 for the layering rationale.
Apply escape_for_terminal to every Finding.summary and every evidence entry in the terminal reporter's rendering loop. This moves output sanitization from the analyzer construction site (where PR #49 placed it) to the display layer, per ADR 0003. JSON output remains unchanged and is already safe via serde_json's automatic RFC 8259 escaping. HTTP findings (path traversal, web shell, admin panel, unusual method, etc.) — previously vulnerable to terminal injection via attacker-controlled URIs — are fixed by this change with zero analyzer-side code modifications.
…ed_ev Minor clarity improvement from Task 4 code quality review. "safe" is ambiguous — the escaped string still contains the escape sequences in textual form, which is what the terminal wants. "escaped" names the action (control-byte escaping via char::escape_default) rather than a misleading safety property. Zero behavioral change.
Remove {:?} Debug-format interpolation in both NonAsciiUtf8 and NonUtf8
SNI finding arms. The data layer now stores raw post-from_utf8_lossy
bytes; the terminal reporter is responsible for escaping at render
time per ADR 0003.
Inverts test_non_utf8_sni_escapes_control_bytes_in_summary — renamed
to test_non_utf8_sni_preserves_raw_bytes_in_summary — to enforce the
new contract: raw ESC bytes MUST survive to Finding.summary. A
separate regression test (next commit) verifies the end-to-end
pipeline: terminal reporter escapes, JSON reporter preserves via
serde, struct holds the raw byte.
Behavioral change visible to JSON consumers: Cyrillic SNIs now appear
as readable Unicode ("пример.рф") instead of hex-escaped
placeholders ("\u{43f}\u{440}...").
Single Finding flows through the data layer and both reporters. Asserts:
1. Finding struct preserves raw ESC byte (forensic ground truth).
2. Terminal reporter escapes ESC to \u{1b} on display.
3. JSON reporter serializes ESC as \u001b per RFC 8259.
4. JSON round-trips back to the original raw byte (reversibility).
Any regression that re-introduces construction-site escaping, removes
the terminal reporter's helper, or changes the JSON path will fail
this test.
The escape_for_terminal helper gated only on char::is_ascii_control(), which matches C0 + DEL but NOT the C1 range (U+0080-U+009F). C1 codepoints like U+009B (CSI) and U+0085 (NEL) can appear in a valid UTF-8 String — they're encoded as two-byte sequences (e.g., 0xC2 0x9B for U+009B) that survive from_utf8_lossy. A terminal in DEC S8C1T mode interprets U+009B as the 8-bit equivalent of ESC[, so an attacker-supplied SNI or HTTP URI containing 0xC2 0x9B could bypass the escape and inject ANSI control sequences. Modern terminals in UTF-8 mode default to NOT interpreting C1 this way (verified via Perplexity for xterm, iTerm2, gnome-terminal, Windows Terminal, VS Code, tmux), but the DEC mode exists and some legacy terminals honor it. Escape C1 as cheap insurance. The ADR's Coverage of the threat model section contained a subtly wrong claim: 'a standalone byte in the 0x80-0x9f range cannot appear in a valid UTF-8 string'. That's correct for a single raw byte but irrelevant — the codepoint U+009B is a valid multi-byte UTF-8 sequence and can absolutely appear in a String. ADR updated to reflect the actual helper behavior and the corrected reasoning. Adds two new unit tests (escapes_c1_nel_and_csi, escapes_c1_range_boundaries) covering U+0080, U+0085, U+009B, U+009F, plus the U+00A0 (NBSP) boundary — which is just past C1 and must pass through unchanged.
Three additions from the multi-agent PR review triage: 1. test_json_reporter_preserves_cyrillic_as_readable_unicode — pins the ADR 0003 user-visible claim that JSON consumers now see raw Cyrillic hostnames instead of the PR #49 hex-mangled form. Includes a round-trip serde_json::from_str check proving the Cyrillic survives the full trip. 2. test_terminal_reporter_escapes_esc_bytes_in_summary augmented with Cyrillic content in the summary payload. Previously only covered ASCII + ESC — now proves the render loop preserves Cyrillic while still escaping control bytes (catches accidental double-escape bugs in the for-findings loop). 3. test_cyrillic_sni_emits_non_ascii_finding augmented with raw- Cyrillic-substring assertions on Finding.summary. This directly guards the {hostname:?} → {hostname} rollback on the NonAsciiUtf8 match arm in src/analyzer/tls.rs, which was otherwise only structurally similar (and untested) compared to the non-UTF-8 branch that has the inverted dedicated test.
There was a problem hiding this comment.
Pull request overview
Implements ADR 0003’s “layered reporting pipeline” rule by preserving attacker-controlled bytes in the data layer (Finding) while performing output-medium-specific escaping at render time (terminal), restoring forensic fidelity (Unicode preserved) and centralizing terminal-safety.
Changes:
- Add terminal-layer escaping (
escape_for_terminal) and apply it when rendering finding summaries/evidence. - Roll back TLS analyzer construction-site Debug escaping so
Finding.summarypreserves raw bytes again. - Add ADR 0003 + tests that pin the layering contract across data/terminal/JSON outputs.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/reporter/terminal.rs |
Adds escape_for_terminal and applies it to Finding.summary/evidence during terminal rendering. |
src/analyzer/tls.rs |
Removes {:?}-based construction-site escaping for SNI summaries to preserve raw bytes. |
src/findings.rs |
Documents that Finding’s Display is raw and not terminal-safe. |
tests/reporter_tests.rs |
Adds integration/contract tests for terminal escaping + JSON Unicode preservation/round-trip. |
tests/tls_analyzer_tests.rs |
Inverts TLS test contract to assert raw bytes are preserved in Finding.summary; adds Cyrillic regression guard. |
docs/adr/0003-reporting-pipeline-layering.md |
Introduces ADR 0003 describing the layering rule and escape approach. |
docs/adr/0002-modular-protocol-analyzers.md |
Adds a guideline pointing analyzer authors to ADR 0003 (no construction-site escaping). |
docs/superpowers/plans/2026-04-09-reporting-pipeline-layering.md |
Adds an implementation plan document for the ADR 0003 work. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The C1 control codepoint fix in commit 7d2cd3c updated the helper's predicate to include the U+0080..=U+009F range, but only patched one section of the ADR ('Coverage of the threat model') and left four other places describing the pre-C1 predicate. Copilot's review of #57 caught all five mismatches (four in ADR 0003, one in the plan). Fixes: - ADR 0003 'Immediate scope' section: code sample updated to include the C1 range check, and the 'helper escapes' bullet list gains an explicit C1 bullet pointing to the later threat-model paragraph. - ADR 0003 'Rationale' bullet: 'Escapes exactly the bytes that constitute the threat (C0 + DEL + backslash)' -> 'C0 + C1 + DEL + backslash'. Also bumps the line count estimate (~8 -> ~15) to reflect the helper's actual size after the C1 addition. - ADR 0003 'Consequences' table row for src/reporter/terminal.rs: description now mentions C1 controls as part of the required predicate. - Plan Architecture paragraph: clarifies that the helper gates on ASCII control + C1 control + backslash. Adds a forward reference to commit 7d2cd3c so future plan-followers know the C1 range was added post-plan and Task 3's code samples reflect the pre-C1 predicate for historical accuracy. - Plan 'File Structure' clarification: 'No files are created' was technically inaccurate since this plan doc and ADR 0003 are both new files; reworded to 'No new code or runtime files are created' with an explicit note that the two docs are prerequisites committed before the implementation tasks begin. Per Copilot on #57.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TerminalReporter::render's ANALYZER SUMMARIES block was printing
detail values (serde_json::Value) via {val} direct interpolation,
relying on serde_json's Display impl to handle escaping. That works
for C0 + DEL (serde_json escapes them per RFC 8259 §7) but NOT for
C1 codepoints (U+0080-U+009F) — serde_json preserves those as raw
UTF-8 two-byte sequences per RFC 8259, which is valid but leaves a
C1 terminal injection vector open.
Empirically verified: a serde_json::Value::Array containing
"before\u{9b}31mafter" renders via Display to a JSON string that
contains the raw 0xC2 0x9B bytes (U+009B CSI). The helper fix in
commit 7d2cd3c closed this gap in the findings block but didn't
reach the analyzer summaries block.
Detail values include attacker-controlled strings from HTTP
top_hosts / top_user_agents / recent_uris and TLS top_snis — these
all come through String::from_utf8_lossy in the analyzers, so
embedded C0 + C1 bytes can appear in the serde_json::Value leaves
exposed to the terminal.
Fix: apply escape_for_terminal to val.to_string() before writing to
the output buffer, matching the findings-block pattern. The JSON
rendering path now has a defense-in-depth double layer:
serde_json's RFC 8259 escape for C0 (which produces \u001b that
escape_for_terminal passes through as printable ASCII), plus
escape_for_terminal's C1 gate catching any raw U+0080-U+009F bytes
in the JSON string.
Adds test_terminal_reporter_escapes_control_bytes_in_analyzer_summaries
covering the C0 and C1 cases through the full ANALYZER SUMMARIES
render path, plus a Cyrillic preservation assertion.
Also fixes the plan's Task 2 doc-comment snippet which still
mentioned str::escape_default — replaces with the correct
escape_for_terminal reference (Copilot round 2 comment).
Per Copilot round 2 review on #57.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
## Summary - Adds 3 end-to-end tests that drive `HttpAnalyzer::on_data` with crafted HTTP requests containing C1 CSI (U+009B) injection, then render through both `TerminalReporter` and `JsonReporter`, verifying ADR 0003's output sanitization contract. - Key discovery during issue validation: `httparse` rejects C0 control bytes (including ESC 0x1b) in URIs and header values, but **accepts** C1 codepoints (U+009B CSI = `0xC2 0x9B`) because their UTF-8 encoding uses high bytes (≥ 0x80). C1 CSI is the real viable injection vector through the HTTP analyzer. - Closes the coverage gap identified by `pr-test-analyzer` during the ADR 0003 PR review (#57). ### Tests added | Test | Vector | Path | |------|--------|------| | `test_http_finding_c1_csi_escaped_by_terminal_reporter` | C1 CSI in path-traversal URI | Finding → TerminalReporter | | `test_http_finding_c1_csi_in_json_reporter` | C1 CSI in path-traversal URI | Finding → JsonReporter round-trip | | `test_http_analyzer_summary_c1_csi_escaped_by_terminal_reporter` | C1 CSI in Host header | AnalyzerSummary → TerminalReporter | Closes #56 ## Test plan - [x] All 12 reporter tests pass (`cargo test --test reporter_tests`) - [x] Full suite passes (170+ tests, 0 failures) - [x] `cargo clippy --all-targets` clean - [x] `cargo fmt --check` clean - [x] Multi-agent PR review (test-analyzer, code-reviewer, comment-analyzer) — 0 critical issues
Summary
Establishes ADR 0003 (Reporting Pipeline Layering) as the architectural rule for how attacker-controlled bytes flow through the reporting pipeline. The data layer (analyzers +
Findingstruct) now stores raw post-from_utf8_lossybytes; each reporter (terminal, JSON, future CSV/SQLite) escapes for its own output medium at render time. Reverses PR #49's construction-site escaping in TLS and centrally fixes an injection vulnerability in the HTTP analyzer's 7 findings.Motivating vulnerability
During PR #49 follow-up, an audit of HTTP finding construction at
src/analyzer/http.rs:183, 209, 227, 242, 257, 271, 285found that path-traversal, web-shell, admin-panel, unusual-method, missing-Host, long-URI, and empty-UA findings all interpolatedparsed.uri/parsed.method/parsed.hostdirectly viaformat!("{}", ...). An attacker-supplied URI containing\x1b[31mRED\x1b[0mwould reach the operator's terminal unescaped (CWE-117).PR #49 patched the TLS path with
{:?}(Debug formatter), which worked for terminal safety but (a) permanently destroyed forensic bytes inFinding.summary— Cyrillic SNIs became\u{43f}\u{440}...in JSON exports — and (b) was tribal knowledge that didn't propagate to HTTP.Solution
Move escaping from construction-site (analyzer) to display-time (terminal reporter), using a custom helper
escape_for_terminal(~15 lines) built onchar::escape_defaultgated bychar::is_ascii_control() || ('\u{80}'..='\u{9f}').contains(&c) || c == '\\'. The gate preserves valid non-ASCII Unicode (Cyrillic, CJK, emoji) while escaping C0, DEL, C1, and backslash.str::escape_defaultwas considered and rejected after empirical verification showed it mangles all non-ASCII (see commit45cf649and ADR 0003 Validation section).Behavioral changes visible to users
пример.рфnow appears readably in JSON exports instead of\u{43f}\u{440}.... Downstream tooling gets the actual hostname.\x1b[31min a URI can no longer recolor the operator's terminal output. Zero HTTP analyzer code changes — the fix is centralized in the terminal reporter.0xC2 0x9B— are escaped to close the DEC S8C1T attack vector. Narrow but real threat.Non-changes
Findingstruct shape (same fields, same serde derives)serde_json's automatic RFC 8259 escaping)Test plan
cargo test --all— 166 tests pass (up from 163 pre-PR)cargo clippy --all-targets -- -D warnings— cleancargo fmt -- --check— cleanescape_for_terminalhelper (11 total): ESC, BEL, DEL, tab/newline/CR, backslash, printable ASCII, Cyrillic preservation, emoji preservation, mixed content, empty string, C1 NEL+CSI, C1 range boundariesserde_json::from_strFinding.summaryon the non-UTF-8 path (inverted from the pre-PR contract)Finding.summaryon the non-ASCII path (guards the{hostname:?}→{hostname}rollback)Multi-agent PR review
Three reviewer agents (code-reviewer, pr-test-analyzer, comment-analyzer) reviewed this branch and returned APPROVED_WITH_NITS / COVERAGE_ADEQUATE / APPROVED_WITH_NITS respectively. The code-reviewer's one Important finding (C1 codepoint bypass + ADR reasoning error) was applied in
7d2cd3c; the pr-test-analyzer's top three coverage suggestions (JSON Cyrillic, TLS non-ASCII rollback, terminal Cyrillic) were applied inb9cc820. One coverage suggestion was deferred to #56 (HTTP analyzer end-to-end through terminal reporter).Commits
14 commits on top of
develop. Notable:f4c6098+45cf649— ADR 0003 (original + correction)1143d0b— Implementation plan1f7e556+755837b— Escape helper + apply in render loop0e42aad— Roll back TLS Debug formatter7e4f1df— End-to-end layering contract test7d2cd3c+b9cc820— Post-review C1 fix + coverage additionsRelated