Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6) by danmoseley · Pull Request #35 · danmoseley/runtime

danmoseley · 2026-03-04T19:00:04Z

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)

Summary

This implements \R as a parser-lowered construct that matches any Unicode line ending sequence, per Unicode TR18 RL1.6. It builds on top of the AnyNewLine PR (dotnet#124701) using the same lowering approach.

\R matches:

\r\n as a single atomic two-character unit
Any single Unicode newline: \n, \r, \v, \f, \x85 (NEL), \u2028 (LS), \u2029 (PS)

Equivalent to: (?>\r\n|[\n\v\f\r\x85\u2028\u2029])

Design

Parser lowering -- \R is lowered during parsing into an atomic alternation, following the same pattern used by AnyNewLine for $, ^, \Z, and .. No engine changes are needed.

\R is independent of RegexOptions.AnyNewLine -- it's a standalone escape (like PCRE2, Perl, Java). It works with or without AnyNewLine, and their combination is useful: AnyNewLine controls what ., ^, $ treat as line breaks, while \R explicitly matches line endings.

ECMAScript compatibility -- In ECMAScript mode, \R means literal 'R' (preserved via fallthrough to ScanBasicBackslash). This matches existing .NET behavior where unrecognized escapes in ECMAScript mode are treated as literals.

NonBacktracking -- \R requires atomic semantics to prevent backtracking from \r\n to just \r. The NonBacktracking engine uses DFA where alternation is union (not ordered choice), so it genuinely cannot enforce atomicity. \R is rejected with NotSupportedException, consistent with how (?>...) is rejected.

RTL -- The \r\n concatenation children are reversed for RightToLeft mode (engine scans right-to-left). Tests confirm same-matches-reversed-order behavior.

Breaking change analysis

Completely non-breaking:

In standard mode, \R previously threw RegexParseException (UnrecognizedEscape) -- this is error-to-feature, not a behavior change
In ECMAScript mode, \R = literal 'R' is preserved
[\R] (inside character class) still throws -- \R is a multi-character sequence
Zero instances of \R found in the real-world patterns dataset (18k+ patterns; all 598 \R-like hits were lowercase \r)
A BackslashR_ThrowsOnNetFramework test documents this: on .NET Framework, new Regex(@"\R") throws

Performance

Ad hoc benchmark (1000 lines, mixed newline types, 8090 chars, Compiled mode):

Pattern	Matches	Mean (us)
`\R`	1000	51
`\R+`	1000	52
`(\r\n\|[\n\v\f\r\x85\u2028\u2029])` (manual equivalent)	1000	58
`(\r\n\|[\n\v\f\r\x85\u2028\u2029])+`	1000	61
`(\r\n\|[\n\r])` (common partial workaround)	600	28
`\w+` (unrelated baseline)	1000	47

\R is ~10% faster than the hand-written equivalent, likely due to avoiding capturing group overhead. Patterns not using \R are completely unaffected.

Test coverage (233 test cases)

All 8 Unicode newline types individually
Non-matches (letters, space, tab)
Atomicity: \R\n on \r\n -- no match (proves atomic behavior)
Multiple newlines with various orderings
Quantifiers: \R+, \R{2}, \R?, \R+? (lazy)
Capturing groups and backreferences: (\R)\1
Anchors: ^\R$, ^\R+$
All option combinations: Singleline, Multiline, AnyNewLine, RightToLeft, IgnoreCase, Compiled
Cross-option combos: Multiline+AnyNewLine, Singleline+RTL, AnyNewLine+RTL, Multiline+RTL
ECMAScript mode: \R = literal 'R' (5 cases)
Error cases: [\R] throws, NonBacktracking throws, .NET Framework throws

Files changed

RegexCharClass.cs: Added AnyNewLineClass constant
RegexParser.cs: Added case 'R' in ScanBackslash, added NewLineSequenceNode() method
Regex.Match.Tests.cs: 233 test cases across BackslashR, BackslashR_LiteralOnECMAScript, and standalone validation tests
Regex.Ctor.Tests.cs: NonBacktracking rejection test alongside other unsupported-construct tests

Implement \R as a parser-lowered construct that matches any Unicode line ending sequence: \r\n (as atomic unit), \n, \r, \v, \f, \x85, \u2028, \u2029. Lowered to (?>\r\n|[\n\v\f\r\x85\u2028\u2029]) following the same pattern used by AnyNewLine for anchors and dot. NonBacktracking gets plain alternation (DFA is inherently atomic). ECMAScript mode preserves existing behavior (\R = literal 'R'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

NonBacktracking cannot support \R because its DFA alternation is union (not ordered choice), so it cannot enforce the atomicity \R requires. Added test verifying NotSupportedException is thrown. Expanded test coverage: atomicity verification (\R\n), backreferences, option combos (Multiline, Singleline, AnyNewLine, RightToLeft), lazy quantifiers, empty string, alternation, character class rejection, and .NET Framework backwards compat. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In ECMAScript mode, \R is treated as literal 'R' (unrecognized escapes become literals). This behavior is the same on .NET Framework, so there is no reason to skip the test there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley and others added 3 commits March 3, 2026 19:39

Add anchor and additional option combo tests for \R

a8fbfd9

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danmoseley mentioned this pull request Mar 5, 2026

Add RegexOptions.AnyNewLine via parser lowering dotnet/runtime#124701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35

Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35
danmoseley wants to merge 4 commits intoanynewline-lower-v2from
backslash-R

danmoseley commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danmoseley commented Mar 4, 2026

Summary

Design

Breaking change analysis

Performance

Test coverage (233 test cases)

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant