Skip to content

MaskXmlInvalidCharacters() corrupts supplementary Unicode characters in XmlLayoutSchemaLog4J #290

@N0tre3l

Description

@N0tre3l

Summary
MaskXmlInvalidCharacters() in Transform.cs processes UTF-16 code units instead of Unicode code points. This causes supplementary characters (U+10000–U+10FFFF) encoded as surrogate pairs to be replaced with “?”, resulting in silent data corruption in XML log output.

Root Cause
The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] operates on UTF-16 char units. Supplementary characters are represented as surrogate pairs, and both surrogates fall outside the allowed range, so both are replaced.

Impact

Silent corruption of non-BMP characters (emoji, CJK extensions, symbols)
Loss of original data in XmlLayoutSchemaLog4J XML output
Affects structured logs consumed by downstream systems

Example
Input: admin🔑
Output: admin?

Suggestion
Use code-point aware processing (surrogate pair handling or XmlConvert.IsXmlChar) instead of regex over UTF-16 code units.

Severity
Low (data integrity issue)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions