feat(compact): file compression tracking + secret masker (ROADMAP 7.6) by emal-avala · Pull Request #128 · avala-ai/agent-code

emal-avala · 2026-04-16T00:45:14Z

Summary

Lands the core primitives for ROADMAP 7.6 (Advanced History Compression). Self-contained, tested, no query-loop changes.

services::secret_masker — shared regex set for AWS keys, GitHub PATs, OpenAI-style sk- keys, PEM private keys, and generic api_key= / password: / auth_token= assignments. Idempotent mask().
Masker wired into every persistence boundary:
- compact::build_compact_summary_prompt — masks message text before the summarizer LLM sees it
- session::save_session_full — masks serialized JSON before disk write
- output_store::persist_if_large — masks large tool results before disk write (in-memory preview stays unmasked for agent use)
FileCompressionRecord, CompressionLevel, FileCompressionState in compact.rs:
- Four fidelity levels: Full / Partial / Summary / Excluded
- 12-byte SHA256 content hash for change detection — re-reading a modified file resets level to Full, unchanged re-reads preserve existing level
- PROTECTED_TURN_WINDOW = 2 — recently-read files are locked at Full and demote() refuses to compress them
- save() / load() to ~/.cache/agent-code/sessions/<id>.compression.json
Dep: sha2 = "0.10"

Deferred (ROADMAP 7.6 follow-ups)

Left for a follow-up PR that touches the query loop:

Populate FileCompressionState on every tool read (needs query-loop hook)
Wire the protected-turn check into the auto-compact summarizer
/compression slash command (UX)

Test plan

10 new secret_masker unit tests (each regex, idempotency, non-secret passthrough)
6 new compact unit tests (hash change detection, protected window, read-resets-on-change, read-preserves-on-unchanged, demote-refuses-protected, state roundtrip)
cargo test -p agent-code-lib --lib services::secret_masker — 10/10 pass
cargo test -p agent-code-lib --lib services::compact — 28/28 pass
cargo check -p agent-code-lib clean
Reviewer: confirm no behavior change for existing compaction paths (masker is applied but does not alter non-secret content)

chatgpt-codex-connector · 2026-04-16T00:45:21Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

emal-avala · 2026-04-16T00:56:19Z

Self-review summary

Ran a deep review per the request. Found and fixed one critical bug before merge.

🔴 Critical: credential regex corrupted session JSON (fixed in `9b48356`)

The credential pattern used ["']? at both ends to optionally match quoted values. The trailing ["']? could consume a stray closing quote that belonged to the surrounding JSON string literal — producing unparseable output.

Reproducer (added as regression tests in session.rs):

Input message: api_key=hunter2hunter2
Serialized JSON fragment: "text": "api_key=hunter2hunter2"
Buggy output: "text": "api_key=[REDACTED:credential] ← closing quote eaten
Result: serde_json::from_str fails with control character (\u0000-\u001F) found while parsing a string
Impact: every /resume of a session containing an unquoted inner credential would fail

Fix: split the credential pattern into two variants — quoted (requires both open + close delimiter) and unquoted (value char class excludes " and ', so matching stops at any surrounding delimiter instead of eating it).

Added two regression tests that round-trip masked session JSON through serde_json::from_str for multiple secret shapes. Both failed on becbd98, both pass on 9b48356.

✅ Other checks that came back clean

Area	Finding
Idempotency after fix	The replacement `${1}=[REDACTED:credential]` contains `[` and `:` which aren't in the value char class — can't re-match, stays idempotent.
Other regex patterns (AWS, GitHub PAT, OpenAI, PEM)	All have bounded or delimiter-safe forms. No trailing-quote consumption. PEM uses `(?s).*?` but is anchored to `-----END...-----` markers.
Double-masking interaction	`api_key=AKIAIOSFODNN7EXAMPLE` → AWS pattern runs first (`[REDACTED:aws_access_key]`), credential pattern then sees `api_key=[REDACTED...` and fails to match because `[` isn't in the value class. No cascade.
Callers of `persist_if_large`	Only `tools/executor.rs:275`. Preview is returned unmasked for in-memory agent use (correct per spec); disk copy is masked. No caller of `read_persisted` outside the module, so the mask asymmetry is invisible to the agent.
Callers of `save_session_full`	`schedule/executor.rs:135`, same signature as before. No breakage.
Public API breakage	Zero. All new items are additive. `persist_if_large_in` is `pub(crate)`, `serialize_masked` is `pub(crate)`.
Dead code	`FileCompressionState::{new, record_read, demote, save, load}` and `FileCompressionRecord::is_protected` are unused by the query loop yet — this is intentional (ROADMAP 7.6 follow-up) and all `pub` so clippy doesn't warn. Documented in the commit message.
HashMap<PathBuf, …> serde	PathBuf serializes as string, serde_json accepts string map keys. Roundtrip test (`compression_state_roundtrip`) validates this.
sha2 dep	Widely audited, stable 0.10. Used only for `hash_content()` (12-byte SHA256 slice, non-cryptographic use).
Thread safety	`LazyLock<Vec<Pattern>>` is `Sync`. `mask()` is pure read on the shared regex set.
Memory	`secret_masker::mask` clones + six `replace_all` passes. For megabyte-scale session JSON, transient overhead is tolerable (save is not a hot path).
Error paths	`serialize_masked` adds no new error sources beyond existing `serde_json::to_string_pretty` failure.
TOCTOU / file races	`save_session_full` unchanged behavior around `path.exists()` + read.
`output_store` write failure fallback	On write error, returns preview from raw `content` (not masked) — correct, preview spec always uses raw.

Test coverage

10 secret_masker unit tests (each regex + idempotency + non-secret passthrough)
6 compact tracking tests (hash detection, protected window, state roundtrip)
8 wire-up tests (masker actually invoked at each persistence boundary)
2 new regression tests (JSON round-trip after masking, multiple secret shapes)
Full agent-code-lib suite: 579/580 pass (1 unrelated pre-existing sandbox::tests::auto_detect_off_macos_is_noop failure — bwrap binary presence dependent, not touched by this PR)
cargo clippy --all-targets -- -D warnings ✓
cargo fmt --check ✓

Verdict: safe to merge after 9b48356 lands.

Adds the core primitives for advanced history compression: - `services::secret_masker` — shared regex set for AWS/GitHub/OpenAI keys, PEM private keys, and generic credential assignments. Applied at every persistence boundary: compaction LLM prompt, session.rs disk writes, output_store.rs large tool results. - `FileCompressionRecord`, `CompressionLevel`, `FileCompressionState` in services::compact. Tracks per-file fidelity (Full / Partial / Summary / Excluded), 12-byte SHA256 content hash for change detection, protected turn window so recently-read files are locked at Full, and save/load to ~/.cache/agent-code/sessions/<id>.compression.json. - hash_content() SHA256 helper; sha2 = "0.10" added to crates/lib. Not yet wired: protected-window check into the auto-compact path, query-loop integration that populates FileCompressionState on every tool read, and the /compression slash command. These require touching the query loop and are left as follow-ups per ROADMAP 7.6 checklist. Tests: 16 new unit tests (10 secret_masker, 6 compact tracking). All services tests pass.

…ries Adds 8 integration tests that verify the masker is actually invoked at each persistence site (not just that the masker module works in isolation): - compact.rs: build_compact_summary_prompt masks AWS keys and GitHub PATs from user AND assistant messages before the summary prompt is built. - session.rs: serialize_masked (new pub(crate) helper extracted from save_session_full) redacts secrets from the serialized JSON, preserves innocuous code, and keeps non-secret metadata intact. - output_store.rs: persist_if_large_in (new pub(crate) variant that takes a store directory) writes masked content to disk while keeping the in-memory preview unmasked for agent use. Covers both the specific AWS-key regex and the generic credential rule. Also formats secret_masker.rs with cargo fmt (rustfmt 1.8.0 prefers a different wrapping for the inline replace_all call).

The credential regex used a single `["']?` at both ends to optionally match quoted values. This was greedy and could consume a stray trailing quote that belonged to the surrounding string literal — catastrophically breaking session JSON when a message contained an unquoted inner secret like `api_key=hunter2hunter2`. Reproducer: Input (JSON fragment): "text": "api_key=hunter2hunter2" Buggy output: "text": "api_key=[REDACTED:credential] ^ closing quote eaten Result: serde_json::from_str fails with "control character (\u0000-\u001F) found while parsing a string" — every `/resume` of a session saved with such a message would fail. Fix: split the credential pattern into two — quoted (requires both open and close delimiter) and unquoted (value char class excludes `"` and `'`, so matching naturally stops at any surrounding string delimiter). Neither variant can eat a stray closing quote. Tests: - Two new regression tests in session.rs that assert the masked session JSON round-trips through serde_json::from_str for a variety of secret shapes (unquoted assignments, mixed quoting, URL-embedded credentials).

…-escape support Deep review surfaced two more gaps; both are fixed and covered by new regression tests. ### Gaps found 1. **Quoted credential pattern missed JSON-escaped forms.** Tool output often contains config-file fragments like `api_key = "xxx"`. When a message carrying that text is serialized into the session JSON, the inner `"` become `\"`. The quoted pattern required literal `"` on both sides, so these escaped pairs were not masked. Fix: allow an optional leading `\` before each quote (`\\?"..."` / `\\?'...'`). Still never consumes a stray surrounding JSON delimiter — the bound anchors are paired. 2. **URL-embedded credentials leaked entirely.** `postgres://user:hunter2hunter2@host/db`, `redis://:pw@host`, etc. never matched any pattern. Added a dedicated `url_credential` rule covering postgres/postgresql/mysql/mariadb/redis/rediss/ mongodb(+srv)/amqp/amqps/mqtt/mqtts/smtp/smtps/sftp/ssh/ldap/ ldaps/http/https. Password char class excludes `@`, whitespace, `"`, `'`, and `\` so it stops at the URL boundary and string delimiters. Scheme and username are preserved for debugging. ### New tests (27 added, all green) - secret_masker (+11): - `masks_single_quoted_credential` - `masks_mixed_quoted_and_unquoted_in_one_input` - `does_not_consume_surrounding_json_quote` (direct regression) - `does_not_mask_json_key_form` (structural invariant) - `empty_input_does_not_panic` - `strengthened_idempotency_across_split_pattern` - `masks_uppercase_env_var_style` - `masks_url_embedded_password_postgres` - `masks_url_embedded_password_redis_without_user` - `masks_url_embedded_password_inside_json_escape` - `does_not_mask_url_without_password` - session (+3): - `serialize_masked_redacts_secret_in_tool_result_block` — covers ContentBlock::ToolResult which isn't reached by the compact summary prompt path. - `serialize_masked_handles_many_messages_with_mixed_secrets` — 4-message stress with AWS key, URL password, unquoted token, quoted api_key. Verifies all are redacted and the JSON still round-trips. - `serialize_masked_is_idempotent_save_load_save` — re-saving a loaded session produces byte-identical JSON. - output_store (+2): - `persist_if_large_at_exact_threshold_passes_through` — guards the `<=` boundary against a future refactor to `<`. - `persist_if_large_at_threshold_plus_one_writes_to_disk` - compact (+3): - `compression_state_empty_roundtrip` - `compression_state_handles_unicode_paths` - `compression_state_demote_after_protection_window_expires` Totals: 598 agent-code-lib unit tests pass, clippy clean, fmt clean. (The single pre-existing `sandbox::tests::auto_detect_off_macos_is_noop` failure is unrelated and dependent on local `bwrap` presence.)

emal-avala · 2026-04-16T01:06:27Z

Second deep review pass — 2 more issues found and fixed (`5f02ba5`)

Took another careful pass specifically looking for edge cases the first review missed. Found two additional security gaps, fixed both, added regression tests.

🟠 Gap 1: JSON-escaped quotes defeated the quoted credential pattern

Scenario: tool output contains a config fragment like api_key = "xxx". When the message carrying that text is serialized into session JSON, the inner " get escaped to \". The previous quoted pattern looked for a literal " on both sides and so missed escaped pairs entirely — the secret would be written to disk unredacted.

Caught by: new serialize_masked_handles_many_messages_with_mixed_secrets stress test (4 messages, mixed secret shapes including a ToolResult block with api_key = \"secretprovidervalue\").

Fix: extended the quoted pattern to allow an optional leading \ before each quote — \\?"..." / \\?'...'. The paired anchoring still guarantees we never consume a stray surrounding delimiter.

🟠 Gap 2: URL-embedded credentials leaked entirely

Scenario: postgres://user:hunter2hunter2@host/db, redis://:pw@cache, mongodb+srv://user:pw@cluster/db — common in tool output from env, docker inspect, pg_dump, etc. None of the existing patterns matched these; the passwords were written to session files verbatim.

Caught by: same stress test above.

Fix: added a dedicated url_credential rule covering postgres(ql)? | mysql | mariadb | redis(s)? | mongodb(+srv)? | amqp(s)? | mqtt(s)? | smtp(s)? | sftp | ssh | ldap(s)? | https?. The password char class excludes @, whitespace, ", ', and \ — so the match stops at the URL boundary and never eats past a string delimiter (same safety property as the credential fix). Scheme and username are preserved in the replacement (${scheme}://${user}:[REDACTED:url_credential]@) for debuggability.

New tests this pass — 19 total

secret_masker (+11 unit tests)

Test	What it covers
`masks_single_quoted_credential`	`api_key = 'hunter2'` (single quotes)
`masks_mixed_quoted_and_unquoted_in_one_input`	Pattern ordering: both run against the same input
`does_not_consume_surrounding_json_quote`	Direct regression test for the v1 fix — parses output as JSON
`does_not_mask_json_key_form`	`"api_key": "..."` JSON form stays structurally valid
`empty_input_does_not_panic`	Sanity
`strengthened_idempotency_across_split_pattern`	3× mask round across cocktail of secret shapes
`masks_uppercase_env_var_style`	`API_KEY=...` case handling
`masks_url_embedded_password_postgres`	postgres:// URL
`masks_url_embedded_password_redis_without_user`	redis://:pw@ (no user)
`masks_url_embedded_password_inside_json_escape`	URL inside a JSON string value
`does_not_mask_url_without_password`	No false positive on plain `https://example.com`

session (+3 wire-up regression tests)

Test	What it covers
`serialize_masked_redacts_secret_in_tool_result_block`	Secrets in `ContentBlock::ToolResult` content (not reached by the compact summary prompt path)
`serialize_masked_handles_many_messages_with_mixed_secrets`	4-message stress: AWS key, URL password, unquoted token, quoted `api_key`. This is the test that found both gaps.
`serialize_masked_is_idempotent_save_load_save`	Re-saving a loaded session produces byte-identical JSON

output_store (+2 boundary tests)

Test	What it covers
`persist_if_large_at_exact_threshold_passes_through`	`<=` guard: exactly `INLINE_THRESHOLD` bytes → no disk write
`persist_if_large_at_threshold_plus_one_writes_to_disk`	One byte past → disk write fires

compact (+3 compression state tests)

Test	What it covers
`compression_state_empty_roundtrip`	Empty state serializes/deserializes
`compression_state_handles_unicode_paths`	PathBuf with `café/niño.rs` keys
`compression_state_demote_after_protection_window_expires`	Exact `PROTECTED_TURN_WINDOW = 2` boundary (turns 0, 1 fail; turn 2 succeeds)

Final verification

cargo fmt --check ✓
cargo clippy -D warnings ✓
Full lib suite: 598/599 pass — only unrelated pre-existing sandbox::tests::auto_detect_off_macos_is_noop (depends on local bwrap presence)
All new masker, session, output_store, compact tests green
Branch integrity vs main: clean (only the 7 intended files)

Verdict: safe to merge after CI reruns on 5f02ba5.

emal-avala added 3 commits April 15, 2026 17:56

emal-avala force-pushed the feat/compact-file-tracking-secret-masker branch from 9b48356 to 62bd27a Compare April 16, 2026 00:57

emal-avala merged commit 155aa18 into main Apr 16, 2026
13 of 14 checks passed

emal-avala deleted the feat/compact-file-tracking-secret-masker branch April 16, 2026 01:17

emal-avala mentioned this pull request Apr 16, 2026

chore: release v0.16.0 #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compact): file compression tracking + secret masker (ROADMAP 7.6)#128

feat(compact): file compression tracking + secret masker (ROADMAP 7.6)#128
emal-avala merged 4 commits intomainfrom
feat/compact-file-tracking-secret-masker

emal-avala commented Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 16, 2026

Uh oh!

emal-avala commented Apr 16, 2026

Uh oh!

emal-avala commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emal-avala commented Apr 16, 2026

Summary

Deferred (ROADMAP 7.6 follow-ups)

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 16, 2026

Uh oh!

emal-avala commented Apr 16, 2026

Self-review summary

🔴 Critical: credential regex corrupted session JSON (fixed in 9b48356)

✅ Other checks that came back clean

Test coverage

Uh oh!

emal-avala commented Apr 16, 2026

Second deep review pass — 2 more issues found and fixed (5f02ba5)

🟠 Gap 1: JSON-escaped quotes defeated the quoted credential pattern

🟠 Gap 2: URL-embedded credentials leaked entirely

New tests this pass — 19 total

Final verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔴 Critical: credential regex corrupted session JSON (fixed in `9b48356`)

Second deep review pass — 2 more issues found and fixed (`5f02ba5`)