Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35
Open
danmoseley wants to merge 4 commits intoanynewline-lower-v2from
Open
Prototype: Implement \R (Unicode line ending escape, TR18 RL1.6)#35danmoseley wants to merge 4 commits intoanynewline-lower-v2from
danmoseley wants to merge 4 commits intoanynewline-lower-v2from
Conversation
Implement \R as a parser-lowered construct that matches any Unicode line ending sequence: \r\n (as atomic unit), \n, \r, \v, \f, \x85, \u2028, \u2029. Lowered to (?>\r\n|[\n\v\f\r\x85\u2028\u2029]) following the same pattern used by AnyNewLine for anchors and dot. NonBacktracking gets plain alternation (DFA is inherently atomic). ECMAScript mode preserves existing behavior (\R = literal 'R'). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NonBacktracking cannot support \R because its DFA alternation is union (not ordered choice), so it cannot enforce the atomicity \R requires. Added test verifying NotSupportedException is thrown. Expanded test coverage: atomicity verification (\R\n), backreferences, option combos (Multiline, Singleline, AnyNewLine, RightToLeft), lazy quantifiers, empty string, alternation, character class rejection, and .NET Framework backwards compat. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In ECMAScript mode, \R is treated as literal 'R' (unrecognized escapes become literals). This behavior is the same on .NET Framework, so there is no reason to skip the test there. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prototype: Implement
\R(Unicode line ending escape, TR18 RL1.6)Summary
This implements
\Ras a parser-lowered construct that matches any Unicode line ending sequence, per Unicode TR18 RL1.6. It builds on top of the AnyNewLine PR (dotnet#124701) using the same lowering approach.\Rmatches:\r\nas a single atomic two-character unit\n,\r,\v,\f,\x85(NEL),\u2028(LS),\u2029(PS)Equivalent to:
(?>\r\n|[\n\v\f\r\x85\u2028\u2029])Design
Parser lowering --
\Ris lowered during parsing into an atomic alternation, following the same pattern used by AnyNewLine for$,^,\Z, and.. No engine changes are needed.\Ris independent ofRegexOptions.AnyNewLine-- it's a standalone escape (like PCRE2, Perl, Java). It works with or without AnyNewLine, and their combination is useful: AnyNewLine controls what.,^,$treat as line breaks, while\Rexplicitly matches line endings.ECMAScript compatibility -- In ECMAScript mode,
\Rmeans literal 'R' (preserved via fallthrough toScanBasicBackslash). This matches existing .NET behavior where unrecognized escapes in ECMAScript mode are treated as literals.NonBacktracking --
\Rrequires atomic semantics to prevent backtracking from\r\nto just\r. The NonBacktracking engine uses DFA where alternation is union (not ordered choice), so it genuinely cannot enforce atomicity.\Ris rejected withNotSupportedException, consistent with how(?>...)is rejected.RTL -- The
\r\nconcatenation children are reversed for RightToLeft mode (engine scans right-to-left). Tests confirm same-matches-reversed-order behavior.Breaking change analysis
Completely non-breaking:
\Rpreviously threwRegexParseException(UnrecognizedEscape) -- this is error-to-feature, not a behavior change\R= literal 'R' is preserved[\R](inside character class) still throws --\Ris a multi-character sequence\Rfound in the real-world patterns dataset (18k+ patterns; all 598\R-like hits were lowercase\r)BackslashR_ThrowsOnNetFrameworktest documents this: on .NET Framework,new Regex(@"\R")throwsPerformance
Ad hoc benchmark (1000 lines, mixed newline types, 8090 chars, Compiled mode):
\R\R+(\r\n|[\n\v\f\r\x85\u2028\u2029])(manual equivalent)(\r\n|[\n\v\f\r\x85\u2028\u2029])+(\r\n|[\n\r])(common partial workaround)\w+(unrelated baseline)\Ris ~10% faster than the hand-written equivalent, likely due to avoiding capturing group overhead. Patterns not using\Rare completely unaffected.Test coverage (233 test cases)
\R\non\r\n-- no match (proves atomic behavior)\R+,\R{2},\R?,\R+?(lazy)(\R)\1^\R$,^\R+$\R= literal 'R' (5 cases)[\R]throws, NonBacktracking throws, .NET Framework throwsFiles changed
RegexCharClass.cs: AddedAnyNewLineClassconstantRegexParser.cs: Addedcase 'R'inScanBackslash, addedNewLineSequenceNode()methodRegex.Match.Tests.cs: 233 test cases acrossBackslashR,BackslashR_LiteralOnECMAScript, and standalone validation testsRegex.Ctor.Tests.cs: NonBacktracking rejection test alongside other unsupported-construct tests