Remove redundant Atomic wrapper and fix shared-prefix extraction in regex reduction#126114
Conversation
ReduceAtomic() already unwraps Atomic from Empty/Nothing children, but not from One/Notone/Set/Multi children that also inherently cannot backtrack. The Atomic wrapper on these nodes adds useless save/restore scaffolding in the interpreter and dead code in source-generated output. Add cases to unwrap Atomic from these node kinds, following the same pattern as the existing Empty/Nothing handling. Affects 383 of 17,434 real-world patterns (2.2%) analyzed from Regex_RealWorldPatterns.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ExtractCommonPrefixText() bails out of the entire alternation when FindBranchOneOrMultiStart() returns null for any branch, even though later branches might share a common text prefix that could be factored. For example, in [^x]|ab|ac the first branch is a Set node (not text), so the method returned early without factoring branches 2 and 3 which share prefix 'a'. Changing 'return alternation' to 'continue' lets the outer loop advance past non-text branches and factor any subsequent consecutive text branches that share a prefix. The inner loop already handles non-matching branches via break, and singleton runs are correctly skipped by the existing check at endingIndex - startingIndex <= 1. Affects 1,090 of 17,434 real-world patterns (6.3%) analyzed from Regex_RealWorldPatterns.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@EgorBot -linux_amd |
|
@MihuBot regexdiff |
There was a problem hiding this comment.
Pull request overview
This PR improves regex parse-time tree reduction in System.Text.RegularExpressions by removing redundant atomic nodes and by enabling shared-prefix factoring in alternations even when non-text branches appear earlier.
Changes:
- Unwrap redundant
Atomicnodes when the child node can’t backtrack (One,Notone,Set,Multi). - Fix
ExtractCommonPrefixTextto skip non-text branches (instead of bailing out) so later consecutive text branches can still be factored. - Add unit + functional tests covering both reductions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs | Removes redundant atomic wrappers for inherently non-backtrackable leaf nodes; fixes alternation shared-prefix extraction to skip non-text branches. |
| src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs | Adds reduction-shape tests for redundant atomic removal and for shared-prefix factoring past non-text branches. |
| src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs | Adds behavioral match tests to ensure these reductions preserve matching semantics. |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
Oops I meant mihubot |
|
@MihuBot benchmark Regex |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
460 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"^\\s*\\[(([^#;]|\\\\#|\\\\;)+)\\]\\s*([#;].*)?$" (3059 uses)[GeneratedRegex("^\\s*\\[(([^#;]|\\\\#|\\\\;)+)\\]\\s*([#;].*)?$")] /// ○ 1st capture group.<br/>
/// ○ Loop greedily at least once.<br/>
/// ○ 2nd capture group.<br/>
- /// ○ Match with 3 alternative expressions.<br/>
+ /// ○ Match with 2 alternative expressions.<br/>
/// ○ Match a character in the set [^#;].<br/>
- /// ○ Match the string "\\#".<br/>
- /// ○ Match the string "\\;".<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match '\\'.<br/>
+ /// ○ Match a character in the set [#;].<br/>
/// ○ Match ']'.<br/>
/// ○ Match a whitespace character greedily any number of times.<br/>
/// ○ Optional (greedy).<br/>
//{
int capture_starting_pos1 = pos;
- // Match with 3 alternative expressions.
+ // Match with 2 alternative expressions.
//{
int alternation_starting_pos = pos;
int alternation_starting_capturepos = base.Crawlpos();
// Branch 1
//{
- // Match the string "\\#".
- if (!slice.StartsWith("\\#"))
- {
- goto AlternationBranch1;
- }
-
- Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
- pos += 2;
- slice = inputSpan.Slice(pos);
- goto AlternationMatch;
-
- AlternationBranch1:
- pos = alternation_starting_pos;
- slice = inputSpan.Slice(pos);
- UncaptureUntil(alternation_starting_capturepos);
- //}
-
- // Branch 2
- //{
- // Match the string "\\;".
- if (!slice.StartsWith("\\;"))
+ if ((uint)slice.Length < 2 ||
+ slice[0] != '\\' || // Match '\\'.
+ (((ch = slice[1]) != '#') & (ch != ';'))) // Match a character in the set [#;].
{
goto LoopIterationNoMatch;
}
- Utilities.StackPush(ref base.runstack!, ref stackpos, 2, alternation_starting_pos, alternation_starting_capturepos);
+ Utilities.StackPush(ref base.runstack!, ref stackpos, 1, alternation_starting_pos, alternation_starting_capturepos);
pos += 2;
slice = inputSpan.Slice(pos);
goto AlternationMatch;
case 0:
goto AlternationBranch;
case 1:
- goto AlternationBranch1;
- case 2:
goto LoopIterationNoMatch;
}"^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25 ..." (1964 uses)[GeneratedRegex("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$|^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])\\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\\-]*[A-Za-z0-9])$", RegexOptions.IgnoreCase)] /// ○ Loop exactly 3 times.<br/>
/// ○ 1st capture group.<br/>
/// ○ 2nd capture group.<br/>
- /// ○ Match with 5 alternative expressions.<br/>
+ /// ○ Match with 4 alternative expressions.<br/>
/// ○ Match a character in the set [0-9].<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a character in the set [0-9] exactly 2 times.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match '2'.<br/>
- /// ○ Match a character in the set [0-4].<br/>
- /// ○ Match a character in the set [0-9].<br/>
- /// ○ Match a sequence of expressions.<br/>
- /// ○ Match the string "25".<br/>
- /// ○ Match a character in the set [0-5].<br/>
+ /// ○ Match with 2 alternative expressions.<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match a character in the set [0-4].<br/>
+ /// ○ Match a character in the set [0-9].<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match '5'.<br/>
+ /// ○ Match a character in the set [0-5].<br/>
/// ○ Match '.'.<br/>
/// ○ 3rd capture group.<br/>
- /// ○ Match with 5 alternative expressions.<br/>
+ /// ○ Match with 4 alternative expressions.<br/>
/// ○ Match a character in the set [0-9].<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match a character in the set [1-9].<br/>
/// ○ Match a character in the set [0-9] exactly 2 times.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match '2'.<br/>
- /// ○ Match a character in the set [0-4].<br/>
- /// ○ Match a character in the set [0-9].<br/>
- /// ○ Match a sequence of expressions.<br/>
- /// ○ Match the string "25".<br/>
- /// ○ Match a character in the set [0-5].<br/>
+ /// ○ Match with 2 alternative expressions.<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match a character in the set [0-4].<br/>
+ /// ○ Match a character in the set [0-9].<br/>
+ /// ○ Match a sequence of expressions.<br/>
+ /// ○ Match '5'.<br/>
+ /// ○ Match a character in the set [0-5].<br/>
/// ○ Match if at the end of the string or if before an ending newline.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Loop greedily any number of times.<br/>
//{
int capture_starting_pos1 = pos;
- // Match with 5 alternative expressions.
+ // Match with 4 alternative expressions.
//{
int alternation_starting_pos1 = pos;
int alternation_starting_capturepos1 = base.Crawlpos();
// Branch 3
//{
- if ((uint)slice.Length < 3 ||
- slice[0] != '2' || // Match '2'.
- !char.IsBetween(slice[1], '0', '4') || // Match a character in the set [0-4].
- !char.IsAsciiDigit(slice[2])) // Match a character in the set [0-9].
- {
- goto AlternationBranch4;
- }
-
- Utilities.StackPush(ref base.runstack!, ref stackpos, 3, alternation_starting_pos1, alternation_starting_capturepos1);
- pos += 3;
- slice = inputSpan.Slice(pos);
- goto AlternationMatch1;
-
- AlternationBranch4:
- pos = alternation_starting_pos1;
- slice = inputSpan.Slice(pos);
- UncaptureUntil(alternation_starting_capturepos1);
- //}
-
- // Branch 4
- //{
- if ((uint)slice.Length < 3 ||
- !slice.StartsWith("25", StringComparison.OrdinalIgnoreCase) || // Match the string "25" (ordinal case-insensitive)
- !char.IsBetween(slice[2], '0', '5')) // Match a character in the set [0-5].
+ // Match '2'.
+ if (slice.IsEmpty || slice[0] != '2')
{
goto LoopIterationNoMatch;
}
- Utilities.StackPush(ref base.runstack!, ref stackpos, 4, alternation_starting_pos1, alternation_starting_capturepos1);
- pos += 3;
- slice = inputSpan.Slice(pos);
+ // Match with 2 alternative expressions.
+ //{
+ if ((uint)slice.Length < 2)
+ {
+ goto LoopIterationNoMatch;
+ }
+
+ switch (slice[1])
+ {
+ case '0' or '1' or '2' or '3' or '4':
+
+ // Match a character in the set [0-9].
+ if ((uint)slice.Length < 3 || !char.IsAsciiDigit(slice[2]))
+ {
+ goto LoopIterationNoMatch;
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ case '5':
+
+ // Match a character in the set [0-5].
+ if ((uint)slice.Length < 3 || !char.IsBetween(slice[2], '0', '5'))
+ {
+ goto LoopIterationNoMatch;
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ default:
+ goto LoopIterationNoMatch;
+ }
+ //}
+
+ Utilities.StackPush(ref base.runstack!, ref stackpos, 3, alternation_starting_pos1, alternation_starting_capturepos1);
goto AlternationMatch1;
//}
case 2:
goto AlternationBranch3;
case 3:
- goto AlternationBranch4;
- case 4:
goto LoopIterationNoMatch;
}
//{
capture_starting_pos2 = pos;
- // Match with 5 alternative expressions.
+ // Match with 4 alternative expressions.
//{
alternation_starting_pos2 = pos;
alternation_starting_capturepos2 = base.Crawlpos();
// Match a character in the set [0-9].
if (slice.IsEmpty || !char.IsAsciiDigit(slice[0]))
{
- goto AlternationBranch5;
+ goto AlternationBranch4;
}
alternation_branch = 0;
slice = inputSpan.Slice(pos);
goto AlternationMatch2;
- AlternationBranch5:
+ AlternationBranch4:
pos = alternation_starting_pos2;
slice = inputSpan.Slice(pos);
UncaptureUntil(alternation_starting_capturepos2);
!char.IsBetween(slice[0], '1', '9') || // Match a character in the set [1-9].
!char.IsAsciiDigit(slice[1])) // Match a character in the set [0-9].
{
- goto AlternationBranch6;
+ goto AlternationBranch5;
}
alternation_branch = 1;
slice = inputSpan.Slice(pos);
goto AlternationMatch2;
- AlternationBranch6:
+ AlternationBranch5:
pos = alternation_starting_pos2;
slice = inputSpan.Slice(pos);
UncaptureUntil(alternation_starting_capturepos2);
!char.IsAsciiDigit(slice[1]) || // Match a character in the set [0-9] exactly 2 times.
!char.IsAsciiDigit(slice[2]))
{
- goto AlternationBranch7;
+ goto AlternationBranch6;
}
alternation_branch = 2;
slice = inputSpan.Slice(pos);
goto AlternationMatch2;
- AlternationBranch7:
+ AlternationBranch6:
pos = alternation_starting_pos2;
slice = inputSpan.Slice(pos);
UncaptureUntil(alternation_starting_capturepos2);
// Branch 3
//{
- if ((uint)slice.Length < 3 ||
- slice[0] != '2' || // Match '2'.
- !char.IsBetween(slice[1], '0', '4') || // Match a character in the set [0-4].
- !char.IsAsciiDigit(slice[2])) // Match a character in the set [0-9].
- {
- goto AlternationBranch8;
- }
-
- alternation_branch = 3;
- pos += 3;
- slice = inputSpan.Slice(pos);
- goto AlternationMatch2;
-
- AlternationBranch8:
- pos = alternation_starting_pos2;
- slice = inputSpan.Slice(pos);
- UncaptureUntil(alternation_starting_capturepos2);
- //}
-
- // Branch 4
- //{
- if ((uint)slice.Length < 3 ||
- !slice.StartsWith("25", StringComparison.OrdinalIgnoreCase) || // Match the string "25" (ordinal case-insensitive)
- !char.IsBetween(slice[2], '0', '5')) // Match a character in the set [0-5].
+ // Match '2'.
+ if (slice.IsEmpty || slice[0] != '2')
{
goto LoopBacktrack;
}
- alternation_branch = 4;
- pos += 3;
- slice = inputSpan.Slice(pos);
+ // Match with 2 alternative expressions.
+ //{
+ if ((uint)slice.Length < 2)
+ {
+ goto LoopBacktrack;
+ }
+
+ switch (slice[1])
+ {
+ case '0' or '1' or '2' or '3' or '4':
+
+ // Match a character in the set [0-9].
+ if ((uint)slice.Length < 3 || !char.IsAsciiDigit(slice[2]))
+ {
+ goto LoopBacktrack;
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ case '5':
+
+ // Match a character in the set [0-5].
+ if ((uint)slice.Length < 3 || !char.IsBetween(slice[2], '0', '5'))
+ {
+ goto LoopBacktrack;
+ }
+
+ pos += 3;
+ slice = inputSpan.Slice(pos);
+ break;
+
+ default:
+ goto LoopBacktrack;
+ }
+ //}
+
+ alternation_branch = 3;
goto AlternationMatch2;
//}
switch (alternation_branch)
{
case 0:
- goto AlternationBranch5;
+ goto AlternationBranch4;
case 1:
- goto AlternationBranch6;
+ goto AlternationBranch5;
case 2:
- goto AlternationBranch7;
+ goto AlternationBranch6;
case 3:
- goto AlternationBranch8;
- case 4:
goto LoopBacktrack;
}"^(?'protocol'\\w+\\:\\/\\/)?(?>(?'user'.*)@) ..." (283 uses)[GeneratedRegex("^(?'protocol'\\w+\\:\\/\\/)?(?>(?'user'.*)@)?(?'endpoint'[^\\/:]+)(?>\\:(?'port'\\d+))?[\\/:](?'identifier'.*?)\\/?(?>\\.git)?$")] /// ○ Match a character other than '\n' lazily any number of times.<br/>
/// ○ Match '/' greedily, optionally.<br/>
/// ○ Optional (greedy).<br/>
- /// ○ Atomic group.<br/>
- /// ○ Match the string ".git".<br/>
+ /// ○ Match the string ".git".<br/>
/// ○ Match if at the end of the string or if before an ending newline.<br/>
/// </code>
/// </remarks>
int loop_iteration2 = 0;
int loop_iteration3 = 0;
int stackpos = 0;
+ int startingStackpos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match if at the beginning of the string.
//}
// Optional (greedy).
- //{
+ {
+ startingStackpos = stackpos;
loop_iteration3 = 0;
LoopBody3:
pos = base.runstack![--stackpos];
UncaptureUntil(base.runstack![--stackpos]);
slice = inputSpan.Slice(pos);
- LoopEnd3:;
- //}
+ LoopEnd3:
+ stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+ }
// Match if at the end of the string or if before an ending newline.
if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
{
- goto LoopIterationNoMatch3;
+ goto CharLoopBacktrack2;
}
// The input matched."(?<timeOfDay>凌晨|清晨|早上|早|上午|中午|下午|午后|晚上|夜里|夜晚 ..." (186 uses)[GeneratedRegex("(?<timeOfDay>凌晨|清晨|早上|早|上午|中午|下午|午后|晚上|夜里|夜晚|半夜|夜间|深夜|傍晚|晚)", RegexOptions.ExplicitCapture | RegexOptions.Singleline)] /// ○ Match '上' atomically, optionally.<br/>
/// ○ Match a sequence of expressions.<br/>
/// ○ Match '夜'.<br/>
- /// ○ Atomic group.<br/>
- /// ○ Match a character in the set [\u665A\u91CC\u95F4].<br/>
+ /// ○ Match a character in the set [\u665A\u91CC\u95F4].<br/>
/// ○ Match the string "半夜".<br/>
/// ○ Match the string "深夜".<br/>
/// ○ Match the string "傍晚".<br/>"^(?'protocol'\\w+)?(\\:\\/\\/)?(?>(?'user'.* ..." (125 uses)[GeneratedRegex("^(?'protocol'\\w+)?(\\:\\/\\/)?(?>(?'user'.*)@)?(?'endpoint'[^\\/:]+)(?>\\:(?'port'\\d+))?[\\/:](?'identifier'.*?)\\/?(?>\\.git)?$")] /// ○ Match a character other than '\n' lazily any number of times.<br/>
/// ○ Match '/' greedily, optionally.<br/>
/// ○ Optional (greedy).<br/>
- /// ○ Atomic group.<br/>
- /// ○ Match the string ".git".<br/>
+ /// ○ Match the string ".git".<br/>
/// ○ Match if at the end of the string or if before an ending newline.<br/>
/// </code>
/// </remarks>
int loop_iteration3 = 0;
int loop_iteration4 = 0;
int stackpos = 0;
+ int startingStackpos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match if at the beginning of the string.
//}
// Optional (greedy).
- //{
+ {
+ startingStackpos = stackpos;
loop_iteration4 = 0;
LoopBody4:
pos = base.runstack![--stackpos];
UncaptureUntil(base.runstack![--stackpos]);
slice = inputSpan.Slice(pos);
- LoopEnd4:;
- //}
+ LoopEnd4:
+ stackpos = startingStackpos; // Ensure any remaining backtracking state is removed.
+ }
// Match if at the end of the string or if before an ending newline.
if (pos < inputSpan.Length - 1 || ((uint)pos < (uint)inputSpan.Length && inputSpan[pos] != '\n'))
{
- goto LoopIterationNoMatch4;
+ goto CharLoopBacktrack3;
}
// The input matched.For more diff examples, see https://gist.github.com/MihuBot/726ab2347ad8984e604bb47e9b493b8b Sample source code for further analysisconst string JsonPath = "RegexResults-1833.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FKDmJWKA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs
Show resolved
Hide resolved
|
Diffs mostly improvements. The "remove unnecessary atomic" did increase some diffs slightly. See |
|
Looking into it now, maybe we can push to this PR as well |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
See benchmark results at https://gist.github.com/MihuBot/c3889fd8bfe4d9796ee6e0eb15983467 |
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
|
Note This analysis was AI/Copilot-generated based on investigation of the regexdiff results. regexdiff analysisThe regexdiff shows 460 of 18,857 patterns with source generator changes. These fall into two categories, both beneficial: 1. Shared-prefix factoring (from the
|
| Before (non-atomic loop) | After (atomic loop) | |
|---|---|---|
| Scope | //{ ... //} (no block) |
{ ... } (real block) |
| Stack | No save/restore | startingStackpos save/restore |
$ failure |
goto LoopIterationNoMatch4 (bounces through iteration unwind, tries 0 iterations, then backtracks) |
goto CharLoopBacktrack3 (direct backtrack past loop) |
The try-0-iterations bounce in the "before" path is dead work for patterns like (?>\.git)?$ — if .git matched (advancing 4 chars), the position before .git can't satisfy $. The "after" path skips this.
The startingStackpos save/restore is the loop's own per-iteration bookkeeping cleanup (the loop pushes pos/captures per iteration regardless of child type). It's needed so the direct goto CharLoopBacktrack3 doesn't leave orphan stack entries.
Net effect: the ReduceAtomic change is a strict improvement — it both simplifies the tree (removing a redundant node) and unblocks auto-atomicization that the Atomic wrapper was ironically preventing.
|
Need to understand perf results. There's a few regressions, want to be sure they're noise |
|
@MihuBot benchmark Regex |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
See benchmark results at https://gist.github.com/MihuBot/cc4a1f1cc13cad8ab0b73d7d0e9fd115 |
|
Note This analysis was generated with Copilot assistance. Benchmark analysis: two runs, same commit — all deltas are noiseRan Run 1 outliers vs Run 2:
New outliers in Run 2 (not in Run 1):
The only outlier that appeared in both runs is SliceSlice IgnoreCase (~1.12 both times), but SliceSlice IC,Compiled is 1.00 in both runs, and SliceSlice IC,NonBacktracking went from 0.92 to 1.01 — so this is likely warm-up noise in the interpreted engine across ~8,884 simple word patterns whose source gen output is confirmed identical. Conclusion: No real performance impact. Every pattern tested produces the same regex tree before and after the PR, so any deltas are JIT/machine noise. |
|
@MihaZupan the comment just above #126114 (comment) kicked off mihubot with garbage. should it check it's at the start of a line or has non junk after it or something... |
|
/ba-g -- opened internal issue for the "_cffi_backend" Python error. The other one is some android offline device. regardless, unrelated |
|
See benchmark results at https://gist.github.com/MihuBot/7f79014ef9c224452264941fbb810a00 |
|
It already checks if the mention was in a code or quote block. I might have forgotten to check for inline code. |
Two small regex tree reduction improvements found during analysis of 17,434 real-world patterns from
Regex_RealWorldPatterns.json(#126104):1. Remove redundant Atomic wrapper from non-backtrackable nodes
ReduceAtomic()already unwrapsAtomicfromEmpty/Nothingchildren, but not fromOne/Notone/Set/Multichildren that also inherently cannot backtrack. These arise when the user writes(?>...)wrapping content that reduces to a single fixed-length match, e.g.(?>\u591C[\u665A\u91CC\u95F4]). The Atomic wrapper adds unnecessary save/restore scaffolding. Affects 383 patterns (2.2%).No measurable perf impact (JIT already eliminates the dead code), but produces a cleaner tree.
2. Fix
ExtractCommonPrefixTextto continue past non-text branchesThe method bails out of the entire alternation when
FindBranchOneOrMultiStart()returns null for any branch. This prevents factoring later consecutive text branches that share a prefix. For example in[^#;]|\\#|\\;, the Set branch causes early return, so branches\\#and\\;(which share prefix\) are never factored.Changed
return alternationtocontinueso the outer loop advances past non-text branches. The inner loop and singleton-run check already handle all edge cases correctly. Affects 1,090 patterns (6.3%), though most are 2-branch cases with negligible perf benefit. ~50 patterns with 4+ shared branches see modest improvement from enabling switch dispatch.Note
This PR was generated with AI assistance (GitHub Copilot).