sanitize_content_core: extend hardenUnicodeText to strip U+2061–U+2064 Invisible Mathematical Operators

## Summary

`hardenUnicodeText()` in `sanitize_content_core.cjs` strips several invisible Unicode characters (U+200B–U+200F, U+2060, bidi controls) but omits U+2061 (FUNCTION APPLICATION), U+2062 (INVISIBLE TIMES), U+2063 (INVISIBLE SEPARATOR), and U+2064 (INVISIBLE PLUS). These Unicode Format (category Cf) characters are invisible in all standard text renderers including GitHub Markdown, are not removed by NFKC normalization, and pass through `sanitizeContent()` unchanged in v0.68.3. A secret or injection payload fragmented with these characters is byte-different from the plain pattern but visually identical, defeating static regex-based secret-detection patterns. The LLM threat-detection prompt also does not instruct the model to look for invisible-operator fragmentation before evaluating patterns.

## Affected Area

Output sanitization / SafeOutputs write-path — `actions/setup/js/sanitize_content_core.cjs`, Step 3 of `hardenUnicodeText()`. This sits between agent-generated content and the GitHub write operations performed by safe-outputs jobs.

## Reproduction Outline

1. Deploy gh-aw v0.68.3; locate `sanitize_content_core.cjs` at `$RUNNER_TEMP/gh-aw/actions/`.
2. Construct a payload by inserting U+2061 between every character of a recognizable secret-like marker: `MARKER.split('').join('\u2061')`.
3. Pass the fragmented string through `sanitizeContent(fragmented, {allowedAliases: []})`.
4. Observe that the output equals the input (byte-for-byte pass-through, confirmed by `out === fragmented // true`).
5. Confirm NFKC does not help: `'A\u2061B'.normalize('NFKC') === 'A\u2061B' // true`.
6. Verify the rendered text is visually identical to the original marker while static regex matching against the plain pattern fails.

## Observed Behavior

U+2061–U+2064 pass through `sanitizeContent()` unchanged. The Step 3 regex in `hardenUnicodeText()` (line 1089 in v0.68.3) is:
```js
result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
```
U+2061–U+2064 are absent from this character class and from the bidi-control strip on line 1093. A 33-byte marker fragmented with U+2061 becomes 129 bytes in output but renders identically and does not match a regex scanning for the plain pattern.

## Expected Behavior

U+2061–U+2064 are stripped by `hardenUnicodeText()`, consistent with the documented intent of "zero-width character handling" and "control character removal". The plaintext equivalent is recovered so that downstream static and LLM-based detection can evaluate it accurately.

## Security Relevance

The sanitizer is the primary defense between agent-produced content and GitHub API write operations. A bypass allows invisible-operator-fragmented secret patterns or prompt-injection payloads to reach LLM evaluators and safe-output consumers without being caught by static detection, and without visual indication to human reviewers. The bypass is deterministic and reproducible on any v0.68.3 runner.

## Suggested Fix

Extend the Step 3 regex to cover U+2061–U+2064:
```js
// Current:
result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
// Fixed:
result = result.replace(/[\u00AD\u034F\u200B-\u200F\u2060-\u2064\uFEFF]/g, "");
```
A broader approach — stripping all Unicode General Category `Cf` characters — would be more future-proof. Additionally, extend the `threat_detection.md` prompt's **Secret Leak** section to mention invisible-operator fragmentation (U+2061–U+2064) alongside existing encoded-representation and homoglyph checks.

---

**gh-aw version**: v0.68.3

Original finding: https://github.com/githubnext/gh-aw-security/issues/1888




> Generated by [File Issue](https://github.com/githubnext/gh-aw-security/actions/runs/24833179890/agentic_workflow) · ● 351.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+githubnext%2Fgh-aw-security%2Ffile-issue%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sanitize_content_core: extend hardenUnicodeText to strip U+2061–U+2064 Invisible Mathematical Operators #28036

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sanitize_content_core: extend hardenUnicodeText to strip U+2061–U+2064 Invisible Mathematical Operators #28036

Description

Summary

Affected Area

Reproduction Outline

Observed Behavior

Expected Behavior

Security Relevance

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions