Skip to content

Remove redundant Atomic wrapper and fix shared-prefix extraction in regex reduction#126114

Merged
danmoseley merged 4 commits intodotnet:mainfrom
danmoseley:dev/danmose/regex-reduction-cleanup
Mar 26, 2026
Merged

Remove redundant Atomic wrapper and fix shared-prefix extraction in regex reduction#126114
danmoseley merged 4 commits intodotnet:mainfrom
danmoseley:dev/danmose/regex-reduction-cleanup

Conversation

@danmoseley
Copy link
Member

Two small regex tree reduction improvements found during analysis of 17,434 real-world patterns from Regex_RealWorldPatterns.json (#126104):

1. Remove redundant Atomic wrapper from non-backtrackable nodes

ReduceAtomic() already unwraps Atomic from Empty/Nothing children, but not from One/Notone/Set/Multi children that also inherently cannot backtrack. These arise when the user writes (?>...) wrapping content that reduces to a single fixed-length match, e.g. (?>\u591C[\u665A\u91CC\u95F4]). The Atomic wrapper adds unnecessary save/restore scaffolding. Affects 383 patterns (2.2%).

No measurable perf impact (JIT already eliminates the dead code), but produces a cleaner tree.

2. Fix ExtractCommonPrefixText to continue past non-text branches

The method bails out of the entire alternation when FindBranchOneOrMultiStart() returns null for any branch. This prevents factoring later consecutive text branches that share a prefix. For example in [^#;]|\\#|\\;, the Set branch causes early return, so branches \\# and \\; (which share prefix \) are never factored.

Changed return alternation to continue so the outer loop advances past non-text branches. The inner loop and singleton-run check already handle all edge cases correctly. Affects 1,090 patterns (6.3%), though most are 2-branch cases with negligible perf benefit. ~50 patterns with 4+ shared branches see modest improvement from enabling switch dispatch.

Note

This PR was generated with AI assistance (GitHub Copilot).

danmoseley and others added 2 commits March 25, 2026 11:18
ReduceAtomic() already unwraps Atomic from Empty/Nothing children, but
not from One/Notone/Set/Multi children that also inherently cannot
backtrack. The Atomic wrapper on these nodes adds useless save/restore
scaffolding in the interpreter and dead code in source-generated output.

Add cases to unwrap Atomic from these node kinds, following the same
pattern as the existing Empty/Nothing handling.

Affects 383 of 17,434 real-world patterns (2.2%) analyzed from
Regex_RealWorldPatterns.json.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ExtractCommonPrefixText() bails out of the entire alternation when
FindBranchOneOrMultiStart() returns null for any branch, even though
later branches might share a common text prefix that could be factored.

For example, in [^x]|ab|ac the first branch is a Set node (not text),
so the method returned early without factoring branches 2 and 3 which
share prefix 'a'. Changing 'return alternation' to 'continue' lets the
outer loop advance past non-text branches and factor any subsequent
consecutive text branches that share a prefix.

The inner loop already handles non-matching branches via break, and
singleton runs are correctly skipped by the existing check at
endingIndex - startingIndex <= 1.

Affects 1,090 of 17,434 real-world patterns (6.3%) analyzed from
Regex_RealWorldPatterns.json.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danmoseley
Copy link
Member Author

@EgorBot -linux_amd

@danmoseley
Copy link
Member Author

@MihuBot regexdiff

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves regex parse-time tree reduction in System.Text.RegularExpressions by removing redundant atomic nodes and by enabling shared-prefix factoring in alternations even when non-text branches appear earlier.

Changes:

  • Unwrap redundant Atomic nodes when the child node can’t backtrack (One, Notone, Set, Multi).
  • Fix ExtractCommonPrefixText to skip non-text branches (instead of bailing out) so later consecutive text branches can still be factored.
  • Add unit + functional tests covering both reductions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Removes redundant atomic wrappers for inherently non-backtrackable leaf nodes; fixes alternation shared-prefix extraction to skip non-text branches.
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs Adds reduction-shape tests for redundant atomic removal and for shared-prefix factoring past non-text branches.
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs Adds behavioral match tests to ensure these reductions preserve matching semantics.

@danmoseley
Copy link
Member Author

Oops I meant mihubot

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

@danmoseley danmoseley requested a review from Copilot March 25, 2026 19:04
@MihuBot
Copy link

MihuBot commented Mar 25, 2026

460 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"^\\s*\\[(([^#;]|\\\\#|\\\\;)+)\\]\\s*([#;].*)?$" (3059 uses)
[GeneratedRegex("^\\s*\\[(([^#;]|\\\\#|\\\\;)+)\\]\\s*([#;].*)?$")]
  /// ○ 1st capture group.<br/>
  ///     ○ Loop greedily at least once.<br/>
  ///         ○ 2nd capture group.<br/>
-   ///             ○ Match with 3 alternative expressions.<br/>
+   ///             ○ Match with 2 alternative expressions.<br/>
  ///                 ○ Match a character in the set [^#;].<br/>
-   ///                 ○ Match the string "\\#".<br/>
-   ///                 ○ Match the string "\\;".<br/>
+   ///                 ○ Match a sequence of expressions.<br/>
+   ///                     ○ Match '\\'.<br/>
+   ///                     ○ Match a character in the set [#;].<br/>
  /// ○ Match ']'.<br/>
  /// ○ Match a whitespace character greedily any number of times.<br/>
  /// ○ Optional (greedy).<br/>
                          //{
                              int capture_starting_pos1 = pos;
                              
-                               // Match with 3 alternative expressions.
+                               // Match with 2 alternative expressions.
                              //{
                                  int alternation_starting_pos = pos;
                                  int alternation_starting_capturepos = base.Crawlpos();
                                  
                                  // Branch 1
                                  //{
-                                       // Match the string "\\#".
-                                       if (!slice.StartsWith("\\#"))
-                                       {
-                                           goto AlternationBranch1;
-                                       }
-                                       
-                                       Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
-                                       pos += 2;
-                                       slice = inputSpan.Slice(pos);
-                                       goto AlternationMatch;
-                                       
-                                       AlternationBranch1:
-                                       pos = alternation_starting_pos;
-                                       slice = inputSpan.Slice(pos);
-                                       UncaptureUntil(alternation_starting_capturepos);
-                                   //}
-                                   
-                                   // Branch 2
-                                   //{
-                                       // Match the string "\\;".
-                                       if (!slice.StartsWith("\\;"))
+                                       if ((uint)slice.Length < 2 ||
+                                           slice[0] != '\\' || // Match '\\'.
+                                           (((ch = slice[1]) != '#') & (ch != ';'))) // Match a character in the set [#;].
                                      {
                                          goto LoopIterationNoMatch;
                                      }
                                      
-                                       Utilities.StackPush(ref base.runstack!, ref stackpos, 2, alternation_starting_pos, alternation_starting_capturepos);
+                                       Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
                                      pos += 2;
                                      slice = inputSpan.Slice(pos);
                                      goto AlternationMatch;
                                      case 0:
                                          goto AlternationBranch;
                                      case 1:
-                                           goto AlternationBranch1;
-                                       case 2:
                                          goto LoopIterationNoMatch;
                                  }
"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25 ..." (1964 uses)
[GeneratedRegex("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$|^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])\\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])$", RegexOptions.IgnoreCase)]
  ///         ○ Loop exactly 3 times.<br/>
  ///             ○ 1st capture group.<br/>
  ///                 ○ 2nd capture group.<br/>
-   ///                     ○ Match with 5 alternative expressions.<br/>
+   ///                     ○ Match with 4 alternative expressions.<br/>
  ///                         ○ Match a character in the set [0-9].<br/>
  ///                         ○ Match a sequence of expressions.<br/>
  ///                             ○ Match a character in the set [1-9].<br/>
  ///                             ○ Match a character in the set [0-9] exactly 2 times.<br/>
  ///                         ○ Match a sequence of expressions.<br/>
  ///                             ○ Match '2'.<br/>
-   ///                             ○ Match a character in the set [0-4].<br/>
-   ///                             ○ Match a character in the set [0-9].<br/>
-   ///                         ○ Match a sequence of expressions.<br/>
-   ///                             ○ Match the string "25".<br/>
-   ///                             ○ Match a character in the set [0-5].<br/>
+   ///                             ○ Match with 2 alternative expressions.<br/>
+   ///                                 ○ Match a sequence of expressions.<br/>
+   ///                                     ○ Match a character in the set [0-4].<br/>
+   ///                                     ○ Match a character in the set [0-9].<br/>
+   ///                                 ○ Match a sequence of expressions.<br/>
+   ///                                     ○ Match '5'.<br/>
+   ///                                     ○ Match a character in the set [0-5].<br/>
  ///                 ○ Match '.'.<br/>
  ///         ○ 3rd capture group.<br/>
-   ///             ○ Match with 5 alternative expressions.<br/>
+   ///             ○ Match with 4 alternative expressions.<br/>
  ///                 ○ Match a character in the set [0-9].<br/>
  ///                 ○ Match a sequence of expressions.<br/>
  ///                     ○ Match a character in the set [1-9].<br/>
  ///                     ○ Match a character in the set [0-9] exactly 2 times.<br/>
  ///                 ○ Match a sequence of expressions.<br/>
  ///                     ○ Match '2'.<br/>
-   ///                     ○ Match a character in the set [0-4].<br/>
-   ///                     ○ Match a character in the set [0-9].<br/>
-   ///                 ○ Match a sequence of expressions.<br/>
-   ///                     ○ Match the string "25".<br/>
-   ///                     ○ Match a character in the set [0-5].<br/>
+   ///                     ○ Match with 2 alternative expressions.<br/>
+   ///                         ○ Match a sequence of expressions.<br/>
+   ///                             ○ Match a character in the set [0-4].<br/>
+   ///                             ○ Match a character in the set [0-9].<br/>
+   ///                         ○ Match a sequence of expressions.<br/>
+   ///                             ○ Match '5'.<br/>
+   ///                             ○ Match a character in the set [0-5].<br/>
  ///         ○ Match if at the end of the string or if before an ending newline.<br/>
  ///     ○ Match a sequence of expressions.<br/>
  ///         ○ Loop greedily any number of times.<br/>
                                      //{
                                          int capture_starting_pos1 = pos;
                                          
-                                           // Match with 5 alternative expressions.
+                                           // Match with 4 alternative expressions.
                                          //{
                                              int alternation_starting_pos1 = pos;
                                              int alternation_starting_capturepos1 = base.Crawlpos();
                                              
                                              // Branch 3
                                              //{
-                                                   if ((uint)slice.Length < 3 ||
-                                                       slice[0] != '2' || // Match '2'.
-                                                       !char.IsBetween(slice[1], '0', '4') || // Match a character in the set [0-4].
-                                                       !char.IsAsciiDigit(slice[2])) // Match a character in the set [0-9].
-                                                   {
-                                                       goto AlternationBranch4;
-                                                   }
-                                                   
-                                                   Utilities.StackPush(ref base.runstack!, ref stackpos, 3, alternation_starting_pos1, alternation_starting_capturepos1);
-                                                   pos += 3;
-                                                   slice = inputSpan.Slice(pos);
-                                                   goto AlternationMatch1;
-                                                   
-                                                   AlternationBranch4:
-                                                   pos = alternation_starting_pos1;
-                                                   slice = inputSpan.Slice(pos);
-                                                   UncaptureUntil(alternation_starting_capturepos1);
-                                               //}
-                                               
-                                               // Branch 4
-                                               //{
-                                                   if ((uint)slice.Length < 3 ||
-                                                       !slice.StartsWith("25", StringComparison.OrdinalIgnoreCase) || // Match the string "25" (ordinal case-insensitive)
-                                                       !char.IsBetween(slice[2], '0', '5')) // Match a character in the set [0-5].
+                                                   // Match '2'.
+                                                   if (slice.IsEmpty || slice[0] != '2')
                                                  {
                                                      goto LoopIterationNoMatch;
                                                  }
                                                  
-                                                   Utilities.StackPush(ref base.runstack!, ref stackpos, 4, alternation_starting_pos1, alternation_starting_capturepos1);
-                                                   pos += 3;
-                                                   slice = inputSpan.Slice(pos);
+                                                   // Match with 2 alternative expressions.
+                                                   //{
+                                                       if ((uint)slice.Length < 2)
+                                                       {
+                                                           goto LoopIterationNoMatch;
+                                                       }
+                                                       
+                                                       switch (slice[1])
+                                                       {
+                                                           case '0' or '1' or '2' or '3' or '4':
+                                                               
+                                                               // Match a character in the set [0-9].
+                                                               if ((uint)slice.Length < 3 || !char.IsAsciiDigit(slice[2]))
+                                                               {
+                                                                   goto LoopIterationNoMatch;
+                                                               }
+                                                               
+                                                               pos += 3;
+                                                               slice = inputSpan.Slice(pos);
+                                                               break;
+                                                               
+                                                           case '5':
+                                                               
+                                                               // Match a character in the set [0-5].
+                                                               if ((uint)slice.Length < 3 || !char.IsBetween(slice[2], '0', '5'))
+                                                               {
+                                                                   goto LoopIterationNoMatch;
+                                                               }
+                                                               
+                                                               pos += 3;
+                                                               slice = inputSpan.Slice(pos);
+                                                               break;
+                                                               
+                                                           default:
+                                                               goto LoopIterationNoMatch;
+                                                       }
+                                                   //}
+                                                   
+                                                   Utilities.StackPush(ref base.runstack!, ref stackpos, 3, alternation_starting_pos1, alternation_starting_capturepos1);
                                                  goto AlternationMatch1;
                                              //}
                                              
                                                  case 2:
                                                      goto AlternationBranch3;
                                                  case 3:
-                                                       goto AlternationBranch4;
-                                                   case 4:
                                                      goto LoopIterationNoMatch;
                                              }
                                              
                              //{
                                  capture_starting_pos2 = pos;
                                  
-                                   // Match with 5 alternative expressions.
+                                   // Match with 4 alternative expressions.
                                  //{
                                      alternation_starting_pos2 = pos;
                                      alternation_starting_capturepos2 = base.Crawlpos();
                                          // Match a character in the set [0-9].
                                          if (slice.IsEmpty || !char.IsAsciiDigit(slice[0]))
                                          {
-                                               goto AlternationBranch5;
+                                               goto AlternationBranch4;
                                          }
                                          
                                          alternation_branch = 0;
                                          slice = inputSpan.Slice(pos);
                                          goto AlternationMatch2;
                                          
-                                           AlternationBranch5:
+                                           AlternationBranch4:
                                          pos = alternation_starting_pos2;
                                          slice = inputSpan.Slice(pos);
                                          UncaptureUntil(alternation_starting_capturepos2);
                                              !char.IsBetween(slice[0], '1', '9') || // Match a character in the set [1-9].
                                              !char.IsAsciiDigit(slice[1])) // Match a character in the set [0-9].
                                          {
-                                               goto AlternationBranch6;
+                                               goto AlternationBranch5;
                                          }
                                          
                                          alternation_branch = 1;
                                          slice = inputSpan.Slice(pos);
                                          goto AlternationMatch2;
                                          
-                                           AlternationBranch6:
+                                           AlternationBranch5:
                                          pos = alternation_starting_pos2;
                                          slice = inputSpan.Slice(pos);
                                          UncaptureUntil(alternation_starting_capturepos2);
                                              !char.IsAsciiDigit(slice[1]) || // Match a character in the set [0-9] exactly 2 times.
                                              !char.IsAsciiDigit(slice[2]))
                                          {
-                                               goto AlternationBranch7;
+                                               goto AlternationBranch6;
                                          }
                                          
                                          alternation_branch = 2;
                                          slice = inputSpan.Slice(pos);
                                          goto AlternationMatch2;
                                          
-                                           AlternationBranch7:
+                                           AlternationBranch6:
                                          pos = alternation_starting_pos2;
                                          slice = inputSpan.Slice(pos);
                                          UncaptureUntil(alternation_starting_capturepos2);
                                      
                                      // Branch 3
                                      //{
-                                           if ((uint)slice.Length < 3 ||
-                                               slice[0] != '2' || // Match '2'.
-                                               !char.IsBetween(slice[1], '0', '4') || // Match a character in the set [0-4].
-                                               !char.IsAsciiDigit(slice[2])) // Match a character in the set [0-9].
-                                           {
-                                               goto AlternationBranch8;
-                                           }
-                                           
-                                           alternation_branch = 3;
-                                           pos += 3;
-                                           slice = inputSpan.Slice(pos);
-                                           goto AlternationMatch2;
-                                           
-                                           AlternationBranch8:
-                                           pos = alternation_starting_pos2;
-                                           slice = inputSpan.Slice(pos);
-                                           UncaptureUntil(alternation_starting_capturepos2);
-                                       //}
-                                       
-                                       // Branch 4
-                                       //{
-                                           if ((uint)slice.Length < 3 ||
-                                               !slice.StartsWith("25", StringComparison.OrdinalIgnoreCase) || // Match the string "25" (ordinal case-insensitive)
-                                               !char.IsBetween(slice[2], '0', '5')) // Match a character in the set [0-5].
+                                           // Match '2'.
+                                           if (slice.IsEmpty || slice[0] != '2')
                                          {
                                              goto LoopBacktrack;
                                          }
                                          
-                                           alternation_branch = 4;
-                                           pos += 3;
-                                           slice = inputSpan.Slice(pos);
+                                           // Match with 2 alternative expressions.
+                                           //{
+                                               if ((uint)slice.Length < 2)
+                                               {
+                                                   goto LoopBacktrack;
+                                               }
+                                               
+                                               switch (slice[1])
+                                               {
+                                                   case '0' or '1' or '2' or '3' or '4':
+                                                       
+                                                       // Match a character in the set [0-9].
+                                                       if ((uint)slice.Length < 3 || !char.IsAsciiDigit(slice[2]))
+                                                       {
+                                                           goto LoopBacktrack;
+                                                       }
+                                                       
+                                                       pos += 3;
+                                                       slice = inputSpan.Slice(pos);
+                                                       break;
+                                                       
+                                                   case '5':
+                                                       
+                                                       // Match a character in the set [0-5].
+                                                       if ((uint)slice.Length < 3 || !char.IsBetween(slice[2], '0', '5'))
+                                                       {
+                                                           goto LoopBacktrack;
+                                                       }
+                                                       
+                                                       pos += 3;
+                                                       slice = inputSpan.Slice(pos);
+                                                       break;
+                                                       
+                                                   default:
+                                                       goto LoopBacktrack;
+                                               }
+                                           //}
+                                           
+                                           alternation_branch = 3;
                                          goto AlternationMatch2;
                                      //}
                                      
                                      switch (alternation_branch)
                                      {
                                          case 0:
-                                               goto AlternationBranch5;
+                                               goto AlternationBranch4;
                                          case 1:
-                                               goto AlternationBranch6;
+                                               goto AlternationBranch5;
                                          case 2:
-                                               goto AlternationBranch7;
+                                               goto AlternationBranch6;
                                          case 3:
-                                               goto AlternationBranch8;
-                                           case 4:
                                              goto LoopBacktrack;
                                      }
"^(?'protocol'\\w+\\:\\/\\/)?(?>(?'user'.*)@) ..." (283 uses)
[GeneratedRegex("^(?'protocol'\\w+\\:\\/\\/)?(?>(?'user'.*)@)?(?'endpoint'[^\\/:]+)(?>\\:(?'port'\\d+))?[\\/:](?'identifier'.*?)\\/?(?>\\.git)?$")]
  ///     ○ Match a character other than '\n' lazily any number of times.<br/>
  /// ○ Match '/' greedily, optionally.<br/>
  /// ○ Optional (greedy).<br/>
-   ///     ○ Atomic group.<br/>
-   ///         ○ Match the string ".git".<br/>
+   ///     ○ Match the string ".git".<br/>
  /// ○ Match if at the end of the string or if before an ending newline.<br/>
  /// </code>
  /// </remarks>
                  int loop_iteration2 = 0;
                  int loop_iteration3 = 0;
                  int stackpos = 0;
+                   int startingStackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
                  // Match if at the beginning of the string.
                  //}
                  
                  // Optional (greedy).
-                   //{
+                   {
+                       startingStackpos = stackpos;
                      loop_iteration3 = 0;
                      
                      LoopBody3:
                      pos = base.runstack![--stackpos];
                      UncaptureUntil(base.runstack![--stackpos]);
                      slice = inputSpan.Slice(pos);
-                       LoopEnd3:;
-                   //}
+                       LoopEnd3:
+                       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+                   }
                  
                  // Match if at the end of the string or if before an ending newline.
                  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
                  {
-                       goto LoopIterationNoMatch3;
+                       goto CharLoopBacktrack2;
                  }
                  
                  // The input matched.
"(?<timeOfDay>凌晨|清晨|早上|早|上午|中午|下午|午后|晚上|夜里|夜晚 ..." (186 uses)
[GeneratedRegex("(?<timeOfDay>凌晨|清晨|早上|早|上午|中午|下午|午后|晚上|夜里|夜晚|半夜|夜间|深夜|傍晚|晚)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)]
  ///             ○ Match '上' atomically, optionally.<br/>
  ///         ○ Match a sequence of expressions.<br/>
  ///             ○ Match '夜'.<br/>
-   ///             ○ Atomic group.<br/>
-   ///                 ○ Match a character in the set [\u665A\u91CC\u95F4].<br/>
+   ///             ○ Match a character in the set [\u665A\u91CC\u95F4].<br/>
  ///         ○ Match the string "半夜".<br/>
  ///         ○ Match the string "深夜".<br/>
  ///         ○ Match the string "傍晚".<br/>
"^(?'protocol'\\w+)?(\\:\\/\\/)?(?>(?'user'.* ..." (125 uses)
[GeneratedRegex("^(?'protocol'\\w+)?(\\:\\/\\/)?(?>(?'user'.*)@)?(?'endpoint'[^\\/:]+)(?>\\:(?'port'\\d+))?[\\/:](?'identifier'.*?)\\/?(?>\\.git)?$")]
  ///     ○ Match a character other than '\n' lazily any number of times.<br/>
  /// ○ Match '/' greedily, optionally.<br/>
  /// ○ Optional (greedy).<br/>
-   ///     ○ Atomic group.<br/>
-   ///         ○ Match the string ".git".<br/>
+   ///     ○ Match the string ".git".<br/>
  /// ○ Match if at the end of the string or if before an ending newline.<br/>
  /// </code>
  /// </remarks>
                  int loop_iteration3 = 0;
                  int loop_iteration4 = 0;
                  int stackpos = 0;
+                   int startingStackpos = 0;
                  ReadOnlySpan<char> slice = inputSpan.Slice(pos);
                  
                  // Match if at the beginning of the string.
                  //}
                  
                  // Optional (greedy).
-                   //{
+                   {
+                       startingStackpos = stackpos;
                      loop_iteration4 = 0;
                      
                      LoopBody4:
                      pos = base.runstack![--stackpos];
                      UncaptureUntil(base.runstack![--stackpos]);
                      slice = inputSpan.Slice(pos);
-                       LoopEnd4:;
-                   //}
+                       LoopEnd4:
+                       stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+                   }
                  
                  // Match if at the end of the string or if before an ending newline.
                  if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
                  {
-                       goto LoopIterationNoMatch4;
+                       goto CharLoopBacktrack3;
                  }
                  
                  // The input matched.

For more diff examples, see https://gist.github.com/MihuBot/726ab2347ad8984e604bb47e9b493b8b

Sample source code for further analysis
const string JsonPath = "RegexResults-1833.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FKDmJWKA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

@danmoseley
Copy link
Member Author

Diffs mostly improvements.

The "remove unnecessary atomic" did increase some diffs slightly. See "^(?'protocol'\\w+)?(\\:\\/\\/)?(?>(?'user'.* ..." (125 uses) above. I think it is correct either way, it is causing some redundant bookkeeping code to be omitted. Suggest potentially follow up later to remove that.

@danmoseley
Copy link
Member Author

Looking into it now, maybe we can push to this PR as well

@MihuBot
Copy link

MihuBot commented Mar 25, 2026

@danmoseley
Copy link
Member Author

Note

This analysis was AI/Copilot-generated based on investigation of the regexdiff results.

regexdiff analysis

The regexdiff shows 460 of 18,857 patterns with source generator changes. These fall into two categories, both beneficial:

1. Shared-prefix factoring (from the ExtractCommonPrefixText fix)

Alternations that previously bailed out on a non-text branch now correctly skip past it and continue extracting common prefixes from the remaining branches. This produces the expected prefix-factored code.

2. Bonus auto-atomicization of loops (from the ReduceAtomic change)

This is the more interesting one. Stripping Atomic(Multi)Multi unblocks a downstream optimization in FindAndMakeLoopsAtomic.

That method (line ~2154) has an allowlist of child node types eligible for auto-atomicization:

if (loopChild.Kind is
        RegexNodeKind.Boundary or RegexNodeKind.ECMABoundary or
        RegexNodeKind.Multi or
        RegexNodeKind.One or RegexNodeKind.Notone or RegexNodeKind.Set &&
    !MayContainBacktracking(node.Child(0)))

The while loop that walks to the last descendant (line ~2143) only traverses through Capture and Concatenate — not Atomic. So Loop(0,1, Atomic(Multi)) stopped at the Atomic node, which isn't in the allowlist, and the loop was not auto-atomicized.

After our change, Loop(0,1, Multi) exposes the Multi directly. It passes the allowlist, MakeLoopAtomic() fires, and the loop gets wrapped in Atomic(Loop(...)).

The generated code difference for e.g. (?>\.git)?$ is:

Before (non-atomic loop) After (atomic loop)
Scope //{ ... //} (no block) { ... } (real block)
Stack No save/restore startingStackpos save/restore
$ failure goto LoopIterationNoMatch4 (bounces through iteration unwind, tries 0 iterations, then backtracks) goto CharLoopBacktrack3 (direct backtrack past loop)

The try-0-iterations bounce in the "before" path is dead work for patterns like (?>\.git)?$ — if .git matched (advancing 4 chars), the position before .git can't satisfy $. The "after" path skips this.

The startingStackpos save/restore is the loop's own per-iteration bookkeeping cleanup (the loop pushes pos/captures per iteration regardless of child type). It's needed so the direct goto CharLoopBacktrack3 doesn't leave orphan stack entries.

Net effect: the ReduceAtomic change is a strict improvement — it both simplifies the tree (removing a redundant node) and unblocks auto-atomicization that the Atomic wrapper was ironically preventing.

@danmoseley
Copy link
Member Author

Need to understand perf results. There's a few regressions, want to be sure they're noise

@danmoseley
Copy link
Member Author

@MihuBot benchmark Regex

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@MihuBot
Copy link

MihuBot commented Mar 26, 2026

@danmoseley
Copy link
Member Author

Note

This analysis was generated with Copilot assistance.

Benchmark analysis: two runs, same commit — all deltas are noise

Ran @MihuBot benchmark Regex twice on the same commit to check reproducibility. Also independently verified that the source generator produces byte-identical output for every benchmark pattern × options combination before and after the PR (by running the generator on both code paths locally and comparing character counts).

Run 1 outliers vs Run 2:

Benchmark Run 1 Run 2 Verdict
Uri_IsNotMatch Compiled 3.53x 0.97 Vanished
MatchesWords IC,Compiled 1.23 0.79 Flipped to improvement
SplitWords IC,Compiled 1.27 0.86 Flipped to improvement
ReplaceWords IC,Compiled 1.22 0.99 Vanished
IsMatch#11 NonBacktracking 0.69 1.02 Vanished (Main had 30ns error on 70ns measurement)
(?i)Twain NonBacktracking 0.86 0.99 Vanished
IP_IsMatch Compiled 0.81 0.94 Unstable

New outliers in Run 2 (not in Run 1):

Benchmark Run 1 Run 2
IP_IsNotMatch None 1.00 1.09
\p{L} NonBacktracking 1.00 0.90
\p{Ll} NonBacktracking 1.00 1.08
MatchesWords Compiled 0.99 1.07

The only outlier that appeared in both runs is SliceSlice IgnoreCase (~1.12 both times), but SliceSlice IC,Compiled is 1.00 in both runs, and SliceSlice IC,NonBacktracking went from 0.92 to 1.01 — so this is likely warm-up noise in the interpreted engine across ~8,884 simple word patterns whose source gen output is confirmed identical.

Conclusion: No real performance impact. Every pattern tested produces the same regex tree before and after the PR, so any deltas are JIT/machine noise.

@danmoseley
Copy link
Member Author

@MihaZupan the comment just above #126114 (comment) kicked off mihubot with garbage. should it check it's at the start of a line or has non junk after it or something...

@danmoseley
Copy link
Member Author

/ba-g -- opened internal issue for the "_cffi_backend" Python error. The other one is some android offline device. regardless, unrelated

@danmoseley danmoseley merged commit a5ae058 into dotnet:main Mar 26, 2026
82 of 89 checks passed
@danmoseley danmoseley deleted the dev/danmose/regex-reduction-cleanup branch March 26, 2026 03:26
@MihuBot
Copy link

MihuBot commented Mar 26, 2026

@MihaZupan
Copy link
Member

MihaZupan commented Mar 26, 2026

It already checks if the mention was in a code or quote block. I might have forgotten to check for inline code.
Might also throw the args at an LLM to decide if it's likely not a real command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants