Skip to content

fix: strip U+2061–U+2064 invisible mathematical operators in hardenUnicodeText#28037

Merged
pelikhan merged 3 commits intomainfrom
copilot/extend-harden-unicode-text-stripping
Apr 23, 2026
Merged

fix: strip U+2061–U+2064 invisible mathematical operators in hardenUnicodeText#28037
pelikhan merged 3 commits intomainfrom
copilot/extend-harden-unicode-text-stripping

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 23, 2026

U+2061–U+2064 (FUNCTION APPLICATION, INVISIBLE TIMES, INVISIBLE SEPARATOR, INVISIBLE PLUS) are Unicode Cf format characters that are invisible in all renderers, survive NFKC normalization, and were passing through sanitizeContent() unchanged. A secret fragmented by inserting these characters between every character is byte-distinct from the plain pattern but visually identical, defeating static regex detection.

Changes

  • sanitize_content_core.cjs — Extend the Step 3 regex in hardenUnicodeText() to cover U+2061–U+2064 via range notation:

    // Before
    result = result.replace(/[\u00AD\u034F\u200B\u200C\u200D\u200E\u200F\u2060\uFEFF]/g, "");
    // After
    result = result.replace(/[\u00AD\u034F\u200B-\u200F\u2060-\u2064\uFEFF]/g, "");
  • sanitize_content.test.cjs — Add tests for each of U+2061–U+2064 individually, a fragmentation-bypass scenario (marker split with \u2061 reassembles to plaintext), and multi-character removal.

  • threat_detection.md (both actions/setup/md/ and pkg/workflow/prompts/) — Add an Invisible Operator Fragmentation bullet under the Secret Leak section so the LLM evaluator is explicitly instructed to recognize this bypass pattern alongside encoded representations and homoglyph substitution.

…atical operators

Extends the Step 3 regex in hardenUnicodeText() to cover U+2061
(FUNCTION APPLICATION), U+2062 (INVISIBLE TIMES), U+2063 (INVISIBLE
SEPARATOR), and U+2064 (INVISIBLE PLUS). These Unicode Cf characters
are invisible in all renderers, not removed by NFKC normalization, and
were previously passing through sanitizeContent() unchanged — allowing
secret-like patterns fragmented with invisible operators to bypass static
regex detection while remaining visually identical.

Also updates both copies of threat_detection.md to instruct the LLM to
check for invisible-operator fragmentation alongside existing
encoded-representation and homoglyph checks.

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/94dfa88e-fdbd-4476-a118-3d070e17dbc0

Co-authored-by: szabta89 <1330202+szabta89@users.noreply.github.com>
Copilot AI changed the title [WIP] Update hardenUnicodeText to strip invisible mathematical operators fix: strip U+2061–U+2064 invisible mathematical operators in hardenUnicodeText Apr 23, 2026
Copilot AI requested a review from szabta89 April 23, 2026 11:53
@pelikhan pelikhan marked this pull request as ready for review April 23, 2026 12:11
Copilot AI review requested due to automatic review settings April 23, 2026 12:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Unicode hardening and related threat-detection guidance to prevent secrets/payloads from being visually preserved while byte-wise altered using invisible mathematical operator format characters (U+2061–U+2064).

Changes:

  • Extend hardenUnicodeText() to strip U+2061–U+2064 (invisible mathematical operators) alongside existing zero-width/format characters.
  • Add unit tests covering each operator and multi-operator removal, including a fragmentation bypass scenario.
  • Update threat detection prompt docs to explicitly call out “Invisible Operator Fragmentation” as a secret-leak evasion pattern.
Show a summary per file
File Description
actions/setup/js/sanitize_content_core.cjs Expands the “strip invisible characters” regex to remove U+2061–U+2064 during Unicode hardening.
actions/setup/js/sanitize_content.test.cjs Adds tests verifying removal of U+2061–U+2064 and a fragmentation-style bypass case.
pkg/workflow/prompts/threat_detection.md Documents invisible-operator fragmentation as a secret leak detection heuristic.
actions/setup/md/threat_detection.md Mirrors the same threat-detection documentation update for the setup action.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 1

Comment on lines +1771 to +1776
it("should strip U+2061-U+2064 used to fragment a secret-like marker", () => {
// Simulate a secret fragmented with invisible operators to bypass static detection
const marker = "SECRET";
const fragmented = marker.split("").join("\u2061");
const result = sanitizeContent(fragmented);
expect(result).toBe(marker);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test description says it strips U+2061–U+2064 fragmentation, but the test only inserts U+2061. Either adjust the title/assertions to match U+2061 specifically, or parameterize the test to cover U+2062–U+2064 as well so the name reflects what’s actually validated.

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot mentioned this pull request Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 85/100

Excellent test quality

Metric Value
New/modified tests analyzed 6
✅ Design tests (behavioral contracts) 6 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 6 (100%)
Duplicate test clusters 1
Test inflation detected Yes (38 test lines / 5 prod lines ≈ 7.6:1)
🚨 Coding-guideline violations None

Test Classification Details

View all 6 tests
Test File Classification Issues Detected
should remove invisible mathematical operator FUNCTION APPLICATION (U+2061) sanitize_content.test.cjs ✅ Design Part of per-character cluster (tests 1–4)
should remove invisible mathematical operator INVISIBLE TIMES (U+2062) sanitize_content.test.cjs ✅ Design Part of per-character cluster (tests 1–4)
should remove invisible mathematical operator INVISIBLE SEPARATOR (U+2063) sanitize_content.test.cjs ✅ Design Part of per-character cluster (tests 1–4)
should remove invisible mathematical operator INVISIBLE PLUS (U+2064) sanitize_content.test.cjs ✅ Design Part of per-character cluster (tests 1–4)
should strip U+2061-U+2064 used to fragment a secret-like marker sanitize_content.test.cjs ✅ Design Security-focused adversarial scenario
should remove multiple invisible mathematical operators sanitize_content.test.cjs ✅ Design Multi-character combination coverage

Flagged Tests — Requires Review

No tests require mandatory review. One advisory note is included below.

i️ Per-character cluster (tests 1–4) — advisory only

Classification: Design tests, minor duplication
Observation: Four tests assert identical behavior (expect(sanitizeContent("Hello" + char + "World")).toBe("HelloWorld")) for U+2061–U+2064 individually. This pattern triggers the duplicate-cluster heuristic (3+ tests, same structure, different constants).
Impact on score: −5 pts from the Low Duplication component.
Optional consolidation: A single table-driven test iterating over the four code points would be equally expressive and more concise. However, explicit per-character tests are a legitimate and readable style for a security-hardening change, so this is purely advisory — no action required.


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests — no Go test files changed
  • 🟨 JavaScript (*.test.cjs): 6 tests (vitest)

Scoring Breakdown

Component Raw Weight Score
Behavioral Coverage 6/6 = 100% 40 pts 40
Error/Edge Case Coverage 6/6 = 100% 30 pts 30
Low Duplication 1 duplicate cluster 20 pts 15
Proportional Growth 38 test lines / 5 prod lines = 7.6:1 10 pts 0
Total 85

Note on inflation: the 7.6:1 ratio reflects the deliberate choice to add one explicit test per new Unicode code point, which is appropriate for a security-hardening change. The penalty is applied mechanically per the rubric but does not reflect a real quality concern here.


Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All 6 tests verify observable, behavioral contracts against adversarial Unicode input.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §24834484682

🧪 Test quality analysis by Test Quality Sentinel · ● 327.8K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 85/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). All 6 new tests verify observable behavioral contracts against adversarial invisible Unicode characters.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all comments

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 23, 2026

@copilot review all comments

The fragmentation test has been parameterized to run once for each of U+2061–U+2064, so the test name now accurately reflects what's validated. Commit b7b627a.

Copilot AI requested a review from pelikhan April 23, 2026 12:24
@pelikhan pelikhan merged commit e0c7b2a into main Apr 23, 2026
20 of 24 checks passed
@pelikhan pelikhan deleted the copilot/extend-harden-unicode-text-stripping branch April 23, 2026 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sanitize_content_core: extend hardenUnicodeText to strip U+2061–U+2064 Invisible Mathematical Operators

4 participants