Skip to content

sanitize_content_core: extend hardenUnicodeText to strip U+2061–U+2064 Invisible Mathematical Operators #28036

@szabta89

Description

@szabta89

Summary

hardenUnicodeText() in sanitize_content_core.cjs strips several invisible Unicode characters (U+200B–U+200F, U+2060, bidi controls) but omits U+2061 (FUNCTION APPLICATION), U+2062 (INVISIBLE TIMES), U+2063 (INVISIBLE SEPARATOR), and U+2064 (INVISIBLE PLUS). These Unicode Format (category Cf) characters are invisible in all standard text renderers including GitHub Markdown, are not removed by NFKC normalization, and pass through sanitizeContent() unchanged in v0.68.3. A secret or injection payload fragmented with these characters is byte-different from the plain pattern but visually identical, defeating static regex-based secret-detection patterns. The LLM threat-detection prompt also does not instruct the model to look for invisible-operator fragmentation before evaluating patterns.

Affected Area

Output sanitization / SafeOutputs write-path — actions/setup/js/sanitize_content_core.cjs, Step 3 of hardenUnicodeText(). This sits between agent-generated content and the GitHub write operations performed by safe-outputs jobs.

Reproduction Outline

  1. Deploy gh-aw v0.68.3; locate sanitize_content_core.cjs at $RUNNER_TEMP/gh-aw/actions/.
  2. Construct a payload by inserting U+2061 between every character of a recognizable secret-like marker: MARKER.split('').join('\u2061').
  3. Pass the fragmented string through sanitizeContent(fragmented, {allowedAliases: []}).
  4. Observe that the output equals the input (byte-for-byte pass-through, confirmed by out === fragmented // true).
  5. Confirm NFKC does not help: 'A\u2061B'.normalize('NFKC') === 'A\u2061B' // true.
  6. Verify the rendered text is visually identical to the original marker while static regex matching against the plain pattern fails.

Observed Behavior

U+2061–U+2064 pass through sanitizeContent() unchanged. The Step 3 regex in hardenUnicodeText() (line 1089 in v0.68.3) is:

result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");

U+2061–U+2064 are absent from this character class and from the bidi-control strip on line 1093. A 33-byte marker fragmented with U+2061 becomes 129 bytes in output but renders identically and does not match a regex scanning for the plain pattern.

Expected Behavior

U+2061–U+2064 are stripped by hardenUnicodeText(), consistent with the documented intent of "zero-width character handling" and "control character removal". The plaintext equivalent is recovered so that downstream static and LLM-based detection can evaluate it accurately.

Security Relevance

The sanitizer is the primary defense between agent-produced content and GitHub API write operations. A bypass allows invisible-operator-fragmented secret patterns or prompt-injection payloads to reach LLM evaluators and safe-output consumers without being caught by static detection, and without visual indication to human reviewers. The bypass is deterministic and reproducible on any v0.68.3 runner.

Suggested Fix

Extend the Step 3 regex to cover U+2061–U+2064:

// Current:
result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
// Fixed:
result = result.replace(/[\u00AD\u034F\u200B-\u200F\u2060-\u2064\uFEFF]/g, "");

A broader approach — stripping all Unicode General Category Cf characters — would be more future-proof. Additionally, extend the threat_detection.md prompt's Secret Leak section to mention invisible-operator fragmentation (U+2061–U+2064) alongside existing encoded-representation and homoglyph checks.


gh-aw version: v0.68.3

Original finding: https://github.com/githubnext/gh-aw-security/issues/1888

Generated by File Issue · ● 351.8K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions