Skip to content

Fix BOL anchor not writing back updated position in TryFindNextPossibleStartingPosition#125280

Merged
stephentoub merged 2 commits intomainfrom
copilot/fix-regex-pattern-performance
Mar 7, 2026
Merged

Fix BOL anchor not writing back updated position in TryFindNextPossibleStartingPosition#125280
stephentoub merged 2 commits intomainfrom
copilot/fix-regex-pattern-performance

Conversation

Copy link
Contributor

Copilot AI commented Mar 6, 2026

Description

The BOL (^ with Multiline) handler in TryFindNextPossibleStartingPosition uses vectorized IndexOf('\n') to advance pos to the next line start, but never writes the result back to base.runtextpos. For patterns where BOL is the only optimization (e.g., bare ^), the method returns true with NoSearch mode and the caller retries from the original position — negating the skip entirely.

The generated code was:

int pos = base.runtextpos;

if (pos > 0 && inputSpan[pos - 1] != '\n')
{
    int newlinePos = inputSpan.Slice(pos).IndexOf('\n');
    if ((uint)newlinePos > inputSpan.Length - pos - 1)
        goto NoMatchFound;
    pos += newlinePos + 1;
    // ← base.runtextpos never set
}

return true;
  • Source generator (RegexGenerator.Emitter.cs): Emit base.runtextpos = pos; after the BOL position advance and length check
  • IL compiler (RegexCompiler.cs): Emit equivalent Ldthis(); Ldloc(pos); Stfld(RuntextposField);

All 30,817 functional and 1,034 unit regex tests pass.

Benchmark Results

EgorBot benchmarks confirm the fix on both x64 and ARM64 (Regex.Count() on 1000 matches of ^ with Multiline):

Linux AMD (EPYC 9V45):

Gap (chars) PR main Speedup
5 13.97 μs 32.51 μs 2.3x
80 14.22 μs 447.66 μs 31x
320 16.40 μs 1,763.62 μs 108x
1280 33.08 μs 11,002.82 μs 333x

macOS ARM64 (Apple M4):

Gap (chars) PR main Speedup
5 9.73 μs 25.84 μs 2.7x
80 11.87 μs 422.31 μs 36x
320 19.20 μs 3,005.44 μs 157x
1280 61.44 μs 36,555.42 μs 595x

Speedup scales linearly with inter-match gap size as expected — without the fix, the vectorized IndexOf('\n') skip was computed but never written back, causing character-by-character scanning.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@danmoseley
Copy link
Member

I'll let this run since it's going, but I assume it will produce the exact same as this
de0549f

with that change, perf:

1000 matches, Regex.Count(), varying inter-match gap (ns/match):

Gap (chars) Engine ^ Before ^ After (?=\n) (control) Improvement
5 Compiled 30 44 42 (noise)
5 SourceGenerated -- 53 53 (noise)
80 Compiled 341 44 45 8x
80 SourceGenerated -- 56 55 ~6x
320 Compiled 2,008 -- -- ~87x
1,280 Compiled 16,805 74 73 227x
1,280 SourceGenerated -- 87 85 ~193x

After the fix, ^ matches (?=\n) performance for both Compiled and SourceGenerated engines across all gap sizes. It is also now faster than (?<=\n|\A) (which has no FindFirstChar skip and must check every position).

I guess (?<=\n|\A) could be lowered to ^ but that's not very intersting.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

When the beginning-of-line (BOL) anchor in TryFindNextPossibleStartingPosition
finds a newline via IndexOf and advances pos, the updated position was never
written back to base.runtextpos. This caused the match engine to retry from
the original position, negating the vectorized IndexOf optimization for
patterns like bare '^' with Multiline.

Fix both the source generator (RegexGenerator.Emitter.cs) and the IL compiler
(RegexCompiler.cs) to set base.runtextpos = pos after the BOL position update.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix regex pattern performance issue with '^' construct Fix BOL anchor not writing back updated position in TryFindNextPossibleStartingPosition Mar 6, 2026
@stephentoub
Copy link
Member

@copilot use the performance benchmark skill to validate the improvement from this change

@stephentoub stephentoub marked this pull request as ready for review March 6, 2026 22:17
Copilot AI review requested due to automatic review settings March 6, 2026 22:17
@stephentoub stephentoub enabled auto-merge (squash) March 6, 2026 22:17
Copy link
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot use the performance benchmark skill to validate the improvement from this change

@EgorBot -linux_amd -osx_arm64

using System.Text;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private string _input = default!;
    private Regex _bolCompiled = default!;

    [Params(5, 80, 320, 1280)]
    public int Gap { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var sb = new StringBuilder();
        for (int i = 0; i < 1000; i++)
        {
            sb.Append('a', Gap);
            sb.Append('\n');
        }
        _input = sb.ToString();
        _bolCompiled = new Regex("^", RegexOptions.Multiline | RegexOptions.Compiled);
    }

    [Benchmark]
    public int BolCount() => _bolCompiled.Count(_input);
}

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an issue in the regex “find next starting position” anchor handling where the computed position bump for ^ (BOL with Multiline) wasn’t persisted back to base.runtextpos, causing the caller to retry from the original position and lose the intended skip.

Changes:

  • Update the IL-emitting compiler (RegexCompiler.cs) to store the advanced pos back into base.runtextpos for the BOL optimization path.
  • Update the source generator emitter (RegexGenerator.Emitter.cs) to emit base.runtextpos = pos; after advancing past the next newline for BOL.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs Persists the BOL-advanced pos into RuntextposField so NoSearch mode uses the updated starting position.
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Emits base.runtextpos = pos; after BOL-based pos advancement so generated code doesn’t lose the skip.

@danmoseley
Copy link
Member

@MihuBot regexdiff

@danmoseley
Copy link
Member

Maybe we should have a skill in this repo that explains @MihuBot commands like "benchmark regex" and "regexdiff" and such. I don't see anything for that. @MihaZupan ?

@danmoseley
Copy link
Member

Oh whoops, seeems like it started a jitdiff because of me mentioning it..

danmoseley
danmoseley approved these changes Mar 6, 2026
@MihuBot
Copy link

MihuBot commented Mar 6, 2026

172 out of 18857 patterns have generated source code changes.

Examples of GeneratedRegex source diffs
"^" (5871 uses)
[GeneratedRegex("^", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  return true;
"^#include <([^>]+)>\\s*$" (2606 uses)
[GeneratedRegex("^#include <([^>]+)>\\s*$", RegexOptions.IgnoreCase | RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern has the literal "#include" ordinal case-insensitive at the beginning of the pattern. Find the next occurrence.
"^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE ..." (1964 uses)
[GeneratedRegex("^-+ *BEGIN (?<keyName>\\w+( \\w+)*) PRIVATE KEY *-+\\r?\\n(Proc-Type: 4,ENCRYPTED\\r?\\nDEK-Info: (?<cipherName>[A-Z0-9-]+),(?<salt>[A-F0-9]+)\\r?\\n\\r?\\n)?(?<data>([a-zA-Z0-9/+=]{1,80}\\r?\\n)+)-+ *END \\k<keyName> PRIVATE KEY *-+", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a character in the set \-.
"^ *> ?" (823 uses)
[GeneratedRegex("^ *> ?", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a character in the set [ >].
"^ +$" (823 uses)
[GeneratedRegex("^ +$", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a character in the set  .
"^ {4}" (823 uses)
[GeneratedRegex("^ {4}", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern has the literal "    " at the beginning of the pattern. Find the next occurrence.
"^( *)((?:[*+-]|\\d+\\.)) [^\\n]*(?:\\n(?!\\1 ..." (823 uses)
[GeneratedRegex("^( *)((?:[*+-]|\\d+\\.)) [^\\n]*(?:\\n(?!\\1(?:[*+-]|\\d+\\.) )[^\\n]*)*", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a character in the set [ *+\-\d].
"^(\\d+)\\.(\\d+)\\.(\\d+)" (599 uses)
[GeneratedRegex("^(\\d+)\\.(\\d+)\\.(\\d+)", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a Unicode digit.
"\r\n ^\r\n [\\x20\\t]* ..." (569 uses)
[GeneratedRegex("\r\n            ^\r\n            [\\x20\\t]*\r\n            \\w+ [\\x20\\t]+\r\n            (?<frame>\r\n                (?<type> [^\\x20\\t]+ ) \\.\r\n                (?<method> [^\\x20\\t]+? ) [\\x20\\t]*\r\n                (?<params>  \\( ( [\\x20\\t]* \\)\r\n                               |                    (?<pt> .+?) [\\x20\\t]+ (?<pn> .+?)\r\n                                 (, [\\x20\\t]* (?<pt> .+?) [\\x20\\t]+ (?<pn> .+?) )* \\) ) )\r\n                ( [\\x20\\t]+\r\n                    ( # Microsoft .NET stack traces\r\n                    \\w+ [\\x20\\t]+\r\n                    (?<file> [a-z] \\: .+? )\r\n                    \\: \\w+ [\\x20\\t]+\r\n                    (?<line> [0-9]+ ) \\p{P}?\r\n                    | # Mono stack traces\r\n                    \\[0x[0-9a-f]+\\] [\\x20\\t]+ \\w+ [\\x20\\t]+\r\n                    <(?<file> [^>]+ )>\r\n                    :(?<line> [0-9]+ )\r\n                    )\r\n                )?\r\n            )\r\n            \\s*\r\n            $", RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern begins with a character in the set [\t \w].
"^LD_LIBRARY_PATH=(.*)$" (526 uses)
[GeneratedRegex("^LD_LIBRARY_PATH=(.*)$", RegexOptions.Multiline)]
      {
          goto NoMatchFound;
      }
+       base.runtextpos = pos;
  }
  
  // The pattern has the literal "LD_LIBRARY_PATH=" at the beginning of the pattern. Find the next occurrence.

For more diff examples, see https://gist.github.com/MihuBot/d7021d5c0016c70a0e532293f19acdab

JIT assembly changes
Total bytes of base: 55609071
Total bytes of diff: 55610219
Total bytes of delta: 1148 (0.00 % of base)
Total relative delta: 4.41
    diff is a regression.
    relative diff is a regression.

For a list of JIT diff regressions, see Regressions.md

Sample source code for further analysis
const string JsonPath = "RegexResults-1804.json";
if (!File.Exists(JsonPath))
{
    await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FIih1ySA");
    using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
    archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}

using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");



record KnownPattern(string Pattern, RegexOptions Options, int Count);

sealed class RegexEntry
{
    public required KnownPattern Regex { get; set; }
    public required string MainSource { get; set; }
    public required string PrSource { get; set; }
    public string? FullDiff { get; set; }
    public string? ShortDiff { get; set; }
    public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
    public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
}

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2026

@copilot, benchmark results are ready:

Please analyze the results and act accordingly.
NOTE: some benchmarks may be flaky or bi-modal, so use your judgment when interpreting small differences.

Copy link
Contributor Author

Copilot AI commented Mar 6, 2026

@copilot, benchmark results are ready:

Benchmark results confirm the fix provides significant improvements that scale with inter-match gap size:

Linux AMD (EPYC 9V45):

Gap PR main Speedup
5 13.97 μs 32.51 μs 2.3x
80 14.22 μs 447.66 μs 31x
320 16.40 μs 1,763.62 μs 108x
1280 33.08 μs 11,002.82 μs 333x

macOS ARM64 (Apple M4):

Gap PR main Speedup
5 9.73 μs 25.84 μs 2.7x
80 11.87 μs 422.31 μs 36x
320 19.20 μs 3,005.44 μs 157x
1280 61.44 μs 36,555.42 μs 595x

The speedup scales linearly with gap size as expected — without the fix, the vectorized IndexOf('\n') skip was computed but never written back, causing character-by-character scanning. With the fix, the skip is properly applied. Results are consistent with @danmoseley's earlier measurements and confirm the fix on both x64 and ARM64.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • bot.egorbo.com
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node /home/REDACTED/work/_temp/ghcca-node/node/bin/node --enable-source-maps /home/REDACTED/work/_temp/copilot-developer-action-main/dist/index.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@danmoseley
Copy link
Member

As above, this fixes quite everyday patterns like ^LD_LIBRARY_PATH=(.*)$ where there IS a nice string to search fast for. But if I understand right, until this fix, that could be arbitrarily impacted by wasting the find next position advance. So I guess depending on the text, this might significantly improve such patterns too. Just maybe not as much as ^

@MihaZupan
Copy link
Member

MihaZupan commented Mar 7, 2026

Maybe we should have a skill in this repo that explains @MihuBot commands like "benchmark regex" and "regexdiff" and such

We do have a skill for Egor's bot for benchmarks. MihuBot's arguments are simpler (no benchmark source), documented here. I don't think teaching copilot how to trigger it via comments would be too useful, but giving the agent the ability to invoke it and wait for results as it's working (i.e. via REST API) could be interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants