Fix BOL anchor not writing back updated position in TryFindNextPossibleStartingPosition#125280
Conversation
|
I'll let this run since it's going, but I assume it will produce the exact same as this with that change, perf: 1000 matches,
After the fix, I guess |
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
When the beginning-of-line (BOL) anchor in TryFindNextPossibleStartingPosition finds a newline via IndexOf and advances pos, the updated position was never written back to base.runtextpos. This caused the match engine to retry from the original position, negating the vectorized IndexOf optimization for patterns like bare '^' with Multiline. Fix both the source generator (RegexGenerator.Emitter.cs) and the IL compiler (RegexCompiler.cs) to set base.runtextpos = pos after the BOL position update. Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
@copilot use the performance benchmark skill to validate the improvement from this change |
@EgorBot -linux_amd -osx_arm64 using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private string _input = default!;
private Regex _bolCompiled = default!;
[Params(5, 80, 320, 1280)]
public int Gap { get; set; }
[GlobalSetup]
public void Setup()
{
var sb = new StringBuilder();
for (int i = 0; i < 1000; i++)
{
sb.Append('a', Gap);
sb.Append('\n');
}
_input = sb.ToString();
_bolCompiled = new Regex("^", RegexOptions.Multiline | RegexOptions.Compiled);
}
[Benchmark]
public int BolCount() => _bolCompiled.Count(_input);
} |
There was a problem hiding this comment.
Pull request overview
Fixes an issue in the regex “find next starting position” anchor handling where the computed position bump for ^ (BOL with Multiline) wasn’t persisted back to base.runtextpos, causing the caller to retry from the original position and lose the intended skip.
Changes:
- Update the IL-emitting compiler (
RegexCompiler.cs) to store the advancedposback intobase.runtextposfor the BOL optimization path. - Update the source generator emitter (
RegexGenerator.Emitter.cs) to emitbase.runtextpos = pos;after advancing past the next newline for BOL.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs | Persists the BOL-advanced pos into RuntextposField so NoSearch mode uses the updated starting position. |
| src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs | Emits base.runtextpos = pos; after BOL-based pos advancement so generated code doesn’t lose the skip. |
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
|
@MihuBot regexdiff |
|
Maybe we should have a skill in this repo that explains @MihuBot commands like "benchmark regex" and "regexdiff" and such. I don't see anything for that. @MihaZupan ? |
|
Oh whoops, seeems like it started a jitdiff because of me mentioning it.. |
|
172 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"^" (5871 uses)[GeneratedRegex("^", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
return true;"^#include <([^>]+)>\\s*$" (2606 uses)[GeneratedRegex("^#include <([^>]+)>\\s*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern has the literal "#include" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence."^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE ..." (1964 uses)[GeneratedRegex("^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE KEY *-+\\r?\\n(Proc-Type: 4,ENCRYPTED\\r?\\nDEK-Info: (?<cipherName>[A-Z0-9-]+),(?<salt>[A-F0-9]+)\\r?\\n\\r?\\n)?(?<data>([a-zA-Z0-9/+=]{1,80}\\r?\\n)+)-+ *END \\k<keyName> PRIVATE KEY *-+", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a character in the set \-."^ *> ?" (823 uses)[GeneratedRegex("^ *> ?", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a character in the set [ >]."^ +$" (823 uses)[GeneratedRegex("^ +$", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a character in the set ."^ {4}" (823 uses)[GeneratedRegex("^ {4}", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern has the literal " " at the beginning of the pattern. Find the next occurrence."^( *)((?:[*+-]|\\d+\\.)) [^\\n]*(?:\\n(?!\\1 ..." (823 uses)[GeneratedRegex("^( *)((?:[*+-]|\\d+\\.)) [^\\n]*(?:\\n(?!\\1(?:[*+-]|\\d+\\.) )[^\\n]*)*", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a character in the set [ *+\-\d]."^(\\d+)\\.(\\d+)\\.(\\d+)" (599 uses)[GeneratedRegex("^(\\d+)\\.(\\d+)\\.(\\d+)", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a Unicode digit."\r\n ^\r\n [\\x20\\t]* ..." (569 uses)[GeneratedRegex("\r\n ^\r\n [\\x20\\t]*\r\n \\w+ [\\x20\\t]+\r\n (?<frame>\r\n (?<type> [^\\x20\\t]+ ) \\.\r\n (?<method> [^\\x20\\t]+? ) [\\x20\\t]*\r\n (?<params> \\( ( [\\x20\\t]* \\)\r\n | (?<pt> .+?) [\\x20\\t]+ (?<pn> .+?)\r\n (, [\\x20\\t]* (?<pt> .+?) [\\x20\\t]+ (?<pn> .+?) )* \\) ) )\r\n ( [\\x20\\t]+\r\n ( # Microsoft .NET stack traces\r\n \\w+ [\\x20\\t]+\r\n (?<file> [a-z] \\: .+? )\r\n \\: \\w+ [\\x20\\t]+\r\n (?<line> [0-9]+ ) \\p{P}?\r\n | # Mono stack traces\r\n \\[0x[0-9a-f]+\\] [\\x20\\t]+ \\w+ [\\x20\\t]+\r\n <(?<file> [^>]+ )>\r\n :(?<line> [0-9]+ )\r\n )\r\n )?\r\n )\r\n \\s*\r\n $", RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern begins with a character in the set [\t \w]."^LD_LIBRARY_PATH=(.*)$" (526 uses)[GeneratedRegex("^LD_LIBRARY_PATH=(.*)$", RegexOptions.Multiline)] {
goto NoMatchFound;
}
+ base.runtextpos = pos;
}
// The pattern has the literal "LD_LIBRARY_PATH=" at the beginning of the pattern. Find the next occurrence.For more diff examples, see https://gist.github.com/MihuBot/d7021d5c0016c70a0e532293f19acdab JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1804.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FIih1ySA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
|
@copilot, benchmark results are ready: Please analyze the results and act accordingly. |
Benchmark results confirm the fix provides significant improvements that scale with inter-match gap size: Linux AMD (EPYC 9V45):
macOS ARM64 (Apple M4):
The speedup scales linearly with gap size as expected — without the fix, the vectorized Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
As above, this fixes quite everyday patterns like |
We do have a skill for Egor's bot for benchmarks. MihuBot's arguments are simpler (no benchmark source), documented here. I don't think teaching copilot how to trigger it via comments would be too useful, but giving the agent the ability to invoke it and wait for results as it's working (i.e. via REST API) could be interesting. |
Description
The BOL (
^withMultiline) handler inTryFindNextPossibleStartingPositionuses vectorizedIndexOf('\n')to advanceposto the next line start, but never writes the result back tobase.runtextpos. For patterns where BOL is the only optimization (e.g., bare^), the method returnstruewithNoSearchmode and the caller retries from the original position — negating the skip entirely.The generated code was:
RegexGenerator.Emitter.cs): Emitbase.runtextpos = pos;after the BOL position advance and length checkRegexCompiler.cs): Emit equivalentLdthis(); Ldloc(pos); Stfld(RuntextposField);All 30,817 functional and 1,034 unit regex tests pass.
Benchmark Results
EgorBot benchmarks confirm the fix on both x64 and ARM64 (
Regex.Count()on 1000 matches of^withMultiline):Linux AMD (EPYC 9V45):
macOS ARM64 (Apple M4):
Speedup scales linearly with inter-match gap size as expected — without the fix, the vectorized
IndexOf('\n')skip was computed but never written back, causing character-by-character scanning.💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.