Skip to content

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35

Open
danmoseley wants to merge 4 commits intoanynewline-lower-v2from
backslash-R
Open

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35
danmoseley wants to merge 4 commits intoanynewline-lower-v2from
backslash-R

Conversation

@danmoseley
Copy link
Owner

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)

Summary

This implements \R as a parser-lowered construct that matches any Unicode line ending sequence, per Unicode TR18 RL1.6. It builds on top of the AnyNewLine PR (dotnet#124701) using the same lowering approach.

\R matches:

  • \r\n as a single atomic two-character unit
  • Any single Unicode newline: \n, \r, \v, \f, \x85 (NEL), \u2028 (LS), \u2029 (PS)

Equivalent to: (?>\r\n|[\n\v\f\r\x85\u2028\u2029])

Design

Parser lowering -- \R is lowered during parsing into an atomic alternation, following the same pattern used by AnyNewLine for $, ^, \Z, and .. No engine changes are needed.

\R is independent of RegexOptions.AnyNewLine -- it's a standalone escape (like PCRE2, Perl, Java). It works with or without AnyNewLine, and their combination is useful: AnyNewLine controls what ., ^, $ treat as line breaks, while \R explicitly matches line endings.

ECMAScript compatibility -- In ECMAScript mode, \R means literal 'R' (preserved via fallthrough to ScanBasicBackslash). This matches existing .NET behavior where unrecognized escapes in ECMAScript mode are treated as literals.

NonBacktracking -- \R requires atomic semantics to prevent backtracking from \r\n to just \r. The NonBacktracking engine uses DFA where alternation is union (not ordered choice), so it genuinely cannot enforce atomicity. \R is rejected with NotSupportedException, consistent with how (?>...) is rejected.

RTL -- The \r\n concatenation children are reversed for RightToLeft mode (engine scans right-to-left). Tests confirm same-matches-reversed-order behavior.

Breaking change analysis

Completely non-breaking:

  • In standard mode, \R previously threw RegexParseException (UnrecognizedEscape) -- this is error-to-feature, not a behavior change
  • In ECMAScript mode, \R = literal 'R' is preserved
  • [\R] (inside character class) still throws -- \R is a multi-character sequence
  • Zero instances of \R found in the real-world patterns dataset (18k+ patterns; all 598 \R-like hits were lowercase \r)
  • A BackslashR_ThrowsOnNetFramework test documents this: on .NET Framework, new Regex(@"\R") throws

Performance

Ad hoc benchmark (1000 lines, mixed newline types, 8090 chars, Compiled mode):

Pattern Matches Mean (us)
\R 1000 51
\R+ 1000 52
(\r\n|[\n\v\f\r\x85\u2028\u2029]) (manual equivalent) 1000 58
(\r\n|[\n\v\f\r\x85\u2028\u2029])+ 1000 61
(\r\n|[\n\r]) (common partial workaround) 600 28
\w+ (unrelated baseline) 1000 47

\R is ~10% faster than the hand-written equivalent, likely due to avoiding capturing group overhead. Patterns not using \R are completely unaffected.

Test coverage (233 test cases)

  • All 8 Unicode newline types individually
  • Non-matches (letters, space, tab)
  • Atomicity: \R\n on \r\n -- no match (proves atomic behavior)
  • Multiple newlines with various orderings
  • Quantifiers: \R+, \R{2}, \R?, \R+? (lazy)
  • Capturing groups and backreferences: (\R)\1
  • Anchors: ^\R$, ^\R+$
  • All option combinations: Singleline, Multiline, AnyNewLine, RightToLeft, IgnoreCase, Compiled
  • Cross-option combos: Multiline+AnyNewLine, Singleline+RTL, AnyNewLine+RTL, Multiline+RTL
  • ECMAScript mode: \R = literal 'R' (5 cases)
  • Error cases: [\R] throws, NonBacktracking throws, .NET Framework throws

Files changed

  • RegexCharClass.cs: Added AnyNewLineClass constant
  • RegexParser.cs: Added case 'R' in ScanBackslash, added NewLineSequenceNode() method
  • Regex.Match.Tests.cs: 233 test cases across BackslashR, BackslashR_LiteralOnECMAScript, and standalone validation tests
  • Regex.Ctor.Tests.cs: NonBacktracking rejection test alongside other unsupported-construct tests

danmoseley and others added 3 commits March 3, 2026 19:39
Implement \R as a parser-lowered construct that matches any Unicode line
ending sequence: \r\n (as atomic unit), \n, \r, \v, \f, \x85, \u2028, \u2029.

Lowered to (?>\r\n|[\n\v\f\r\x85\u2028\u2029]) following the same pattern
used by AnyNewLine for anchors and dot. NonBacktracking gets plain alternation
(DFA is inherently atomic). ECMAScript mode preserves existing behavior
(\R = literal 'R').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NonBacktracking cannot support \R because its DFA alternation is union
(not ordered choice), so it cannot enforce the atomicity \R requires.
Added test verifying NotSupportedException is thrown.

Expanded test coverage: atomicity verification (\R\n), backreferences,
option combos (Multiline, Singleline, AnyNewLine, RightToLeft), lazy
quantifiers, empty string, alternation, character class rejection,
and .NET Framework backwards compat.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In ECMAScript mode, \R is treated as literal 'R' (unrecognized escapes
become literals). This behavior is the same on .NET Framework, so there
is no reason to skip the test there.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant