Summary
hardenUnicodeText() in sanitize_content_core.cjs strips several invisible Unicode characters (U+200B–U+200F, U+2060, bidi controls) but omits U+2061 (FUNCTION APPLICATION), U+2062 (INVISIBLE TIMES), U+2063 (INVISIBLE SEPARATOR), and U+2064 (INVISIBLE PLUS). These Unicode Format (category Cf) characters are invisible in all standard text renderers including GitHub Markdown, are not removed by NFKC normalization, and pass through sanitizeContent() unchanged in v0.68.3. A secret or injection payload fragmented with these characters is byte-different from the plain pattern but visually identical, defeating static regex-based secret-detection patterns. The LLM threat-detection prompt also does not instruct the model to look for invisible-operator fragmentation before evaluating patterns.
Affected Area
Output sanitization / SafeOutputs write-path — actions/setup/js/sanitize_content_core.cjs, Step 3 of hardenUnicodeText(). This sits between agent-generated content and the GitHub write operations performed by safe-outputs jobs.
Reproduction Outline
- Deploy gh-aw v0.68.3; locate
sanitize_content_core.cjs at $RUNNER_TEMP/gh-aw/actions/.
- Construct a payload by inserting U+2061 between every character of a recognizable secret-like marker:
MARKER.split('').join('\u2061').
- Pass the fragmented string through
sanitizeContent(fragmented, {allowedAliases: []}).
- Observe that the output equals the input (byte-for-byte pass-through, confirmed by
out === fragmented // true).
- Confirm NFKC does not help:
'A\u2061B'.normalize('NFKC') === 'A\u2061B' // true.
- Verify the rendered text is visually identical to the original marker while static regex matching against the plain pattern fails.
Observed Behavior
U+2061–U+2064 pass through sanitizeContent() unchanged. The Step 3 regex in hardenUnicodeText() (line 1089 in v0.68.3) is:
result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
U+2061–U+2064 are absent from this character class and from the bidi-control strip on line 1093. A 33-byte marker fragmented with U+2061 becomes 129 bytes in output but renders identically and does not match a regex scanning for the plain pattern.
Expected Behavior
U+2061–U+2064 are stripped by hardenUnicodeText(), consistent with the documented intent of "zero-width character handling" and "control character removal". The plaintext equivalent is recovered so that downstream static and LLM-based detection can evaluate it accurately.
Security Relevance
The sanitizer is the primary defense between agent-produced content and GitHub API write operations. A bypass allows invisible-operator-fragmented secret patterns or prompt-injection payloads to reach LLM evaluators and safe-output consumers without being caught by static detection, and without visual indication to human reviewers. The bypass is deterministic and reproducible on any v0.68.3 runner.
Suggested Fix
Extend the Step 3 regex to cover U+2061–U+2064:
// Current:
result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
// Fixed:
result = result.replace(/[\u00AD\u034F\u200B-\u200F\u2060-\u2064\uFEFF]/g, "");
A broader approach — stripping all Unicode General Category Cf characters — would be more future-proof. Additionally, extend the threat_detection.md prompt's Secret Leak section to mention invisible-operator fragmentation (U+2061–U+2064) alongside existing encoded-representation and homoglyph checks.
gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1888
Generated by File Issue · ● 351.8K · ◷
Summary
hardenUnicodeText()insanitize_content_core.cjsstrips several invisible Unicode characters (U+200B–U+200F, U+2060, bidi controls) but omits U+2061 (FUNCTION APPLICATION), U+2062 (INVISIBLE TIMES), U+2063 (INVISIBLE SEPARATOR), and U+2064 (INVISIBLE PLUS). These Unicode Format (category Cf) characters are invisible in all standard text renderers including GitHub Markdown, are not removed by NFKC normalization, and pass throughsanitizeContent()unchanged in v0.68.3. A secret or injection payload fragmented with these characters is byte-different from the plain pattern but visually identical, defeating static regex-based secret-detection patterns. The LLM threat-detection prompt also does not instruct the model to look for invisible-operator fragmentation before evaluating patterns.Affected Area
Output sanitization / SafeOutputs write-path —
actions/setup/js/sanitize_content_core.cjs, Step 3 ofhardenUnicodeText(). This sits between agent-generated content and the GitHub write operations performed by safe-outputs jobs.Reproduction Outline
sanitize_content_core.cjsat$RUNNER_TEMP/gh-aw/actions/.MARKER.split('').join('\u2061').sanitizeContent(fragmented, {allowedAliases: []}).out === fragmented // true).'A\u2061B'.normalize('NFKC') === 'A\u2061B' // true.Observed Behavior
U+2061–U+2064 pass through
sanitizeContent()unchanged. The Step 3 regex inhardenUnicodeText()(line 1089 in v0.68.3) is:U+2061–U+2064 are absent from this character class and from the bidi-control strip on line 1093. A 33-byte marker fragmented with U+2061 becomes 129 bytes in output but renders identically and does not match a regex scanning for the plain pattern.
Expected Behavior
U+2061–U+2064 are stripped by
hardenUnicodeText(), consistent with the documented intent of "zero-width character handling" and "control character removal". The plaintext equivalent is recovered so that downstream static and LLM-based detection can evaluate it accurately.Security Relevance
The sanitizer is the primary defense between agent-produced content and GitHub API write operations. A bypass allows invisible-operator-fragmented secret patterns or prompt-injection payloads to reach LLM evaluators and safe-output consumers without being caught by static detection, and without visual indication to human reviewers. The bypass is deterministic and reproducible on any v0.68.3 runner.
Suggested Fix
Extend the Step 3 regex to cover U+2061–U+2064:
A broader approach — stripping all Unicode General Category
Cfcharacters — would be more future-proof. Additionally, extend thethreat_detection.mdprompt's Secret Leak section to mention invisible-operator fragmentation (U+2061–U+2064) alongside existing encoded-representation and homoglyph checks.gh-aw version: v0.68.3
Original finding: https://github.com/githubnext/gh-aw-security/issues/1888