Summary
MaskXmlInvalidCharacters() in Transform.cs processes UTF-16 code units instead of Unicode code points. This causes supplementary characters (U+10000–U+10FFFF) encoded as surrogate pairs to be replaced with “?”, resulting in silent data corruption in XML log output.
Root Cause
The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] operates on UTF-16 char units. Supplementary characters are represented as surrogate pairs, and both surrogates fall outside the allowed range, so both are replaced.
Impact
Silent corruption of non-BMP characters (emoji, CJK extensions, symbols)
Loss of original data in XmlLayoutSchemaLog4J XML output
Affects structured logs consumed by downstream systems
Example
Input: admin🔑
Output: admin?
Suggestion
Use code-point aware processing (surrogate pair handling or XmlConvert.IsXmlChar) instead of regex over UTF-16 code units.
Severity
Low (data integrity issue)
Summary
MaskXmlInvalidCharacters() in Transform.cs processes UTF-16 code units instead of Unicode code points. This causes supplementary characters (U+10000–U+10FFFF) encoded as surrogate pairs to be replaced with “?”, resulting in silent data corruption in XML log output.
Root Cause
The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] operates on UTF-16 char units. Supplementary characters are represented as surrogate pairs, and both surrogates fall outside the allowed range, so both are replaced.
Impact
Silent corruption of non-BMP characters (emoji, CJK extensions, symbols)
Loss of original data in XmlLayoutSchemaLog4J XML output
Affects structured logs consumed by downstream systems
Example
Input: admin🔑
Output: admin?
Suggestion
Use code-point aware processing (surrogate pair handling or XmlConvert.IsXmlChar) instead of regex over UTF-16 code units.
Severity
Low (data integrity issue)