Skip to content

Fix regex parser to reject quantifiers on anchors (e.g., ^*, $+, \b?)#120638

Closed
Copilot wants to merge 3 commits intomainfrom
copilot/fix-regex-parsing-issue
Closed

Fix regex parser to reject quantifiers on anchors (e.g., ^*, $+, \b?)#120638
Copilot wants to merge 3 commits intomainfrom
copilot/fix-regex-parsing-issue

Conversation

Copy link
Contributor

Copilot AI commented Oct 11, 2025

Fixes #18880

Summary

The .NET regex parser was incorrectly accepting patterns where anchors are quantified, such as ^*, $+, \b?, etc. These patterns are meaningless since anchors are zero-width assertions and quantifying them has no practical effect. Most major regex engines (JavaScript, Python, Java, PCRE) correctly reject these patterns with errors like "The preceding token is not quantifiable."

Note: The Rust regex engine does permit quantified anchors and treats them as valid patterns with specific matching behavior. This change aligns .NET with the majority of regex engines rather than Rust's permissive approach.

Changes

Core Implementation

RegexNode.cs

  • Added IsAnchor property to identify anchor node kinds that cannot be quantified
  • Covers all anchor types: ^, $, \A, \Z, \z, \b, \B, \G, and internal UpdateBumpalong

RegexParser.cs

  • Added validation before calling MakeQuantifier to check if the node is an anchor
  • Throws QuantifierAfterNothing error with appropriate message when quantifiers are applied to anchors
  • Handles all quantifier types: *, +, ?, {n}, {n,m}

Test Updates

New Tests (RegexParserTests.cs)

  • Added 30+ test cases covering all anchor types with various quantifiers
  • Tests cover single-char quantifiers (*, +, ?) and range quantifiers ({n}, {n,m})
  • Tests verify correct error type and offset reporting

Updated Existing Tests

  • Updated RegexParserTests.netcoreapp.cs to expect errors for \b*, \b+, \b?, \A{10}
  • Commented out patterns in RegexRustTests.cs (patterns like (?m)^*, (?m)$+) that were valid in Rust but are now invalid in .NET
  • Commented out invalid patterns in Regex.Match.Tests.cs (lookbehind patterns with quantified anchors)
  • Updated Regex.KnownPattern.Tests.cs to filter out real-world patterns containing quantified anchors

Example

// Before: These patterns were incorrectly accepted
var regex = new Regex(@"^*_.$");
string input = "THIS_LINE_HAS_MANY_UNDERSCORES_L";
string result = regex.Match(input).Groups[0].Value; // Returns "_L"

// After: Correctly throws RegexParseException
var regex = new Regex(@"^*_.$"); 
// RegexParseException: Invalid pattern '^*_.$' at offset 2. Quantifier '*' following nothing.

Breaking Change

This is a breaking change for code that relies on the current behavior of accepting quantified anchors. However:

  • The existing behavior is meaningless and likely unintentional in user code
  • This aligns .NET with the majority of mainstream regex engines (JavaScript, Python, Java, PCRE)
  • The impact is expected to be minimal given the patterns have no practical use

Fixes #29174

Original prompt

This section details on the original issue you should resolve

<issue_title>Regex should fail to parse "^*"</issue_title>
<issue_description>I have the following C# regular expression

var regex = new Regex(@"^*_.$");
string input = "THIS_LINE_HAS_MANY_UNDERSCORES_L";

string result = regex.Match(input).Groups[0].Value;
Console.WriteLine(result);

Even though to the best of my knowledge the regex is invalid (* has no element to match on), it outputs the following (which is really just the _.$ part):

_L

Is ^ treated as a character class by the .NET Regex engine?</issue_description>

Comments on the Issue (you are @copilot in this section)

@danmoseley Other engines seem to give "The preceding token is not quantifiable" which seems appropriate. The question is whether this is worth fixing (given cost/risk of doing so, and potential to break someone who would be surprised) @danmoseley > @danmosemsft it'a bug! fix it!

We cannot fix all bugs. We allocate engineering time based on considerations like customer impact, cost of fix and so forth. This has presumably been an issue for a long time and so the impact seems low. That being said, we'd likely accept a PR for it if you're offering to make one. It would need tests and also care to make sure the compiled and interpreted paths behave the same (I'm guessing this is easy because it's at the parsing stage)

It seems to me that if other engines fail on this, and it's very unlikely anyone is relying on it, we could make it an error. @ViktorHofer as area expert, thoughts?</comment_new>
<comment_new>@CyrusNajmabadi
If this changes, please let me know and I'll update the Roslyn side :-)</comment_new>
<comment_new>@
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.</comment_new>
<comment_new>@danmoseley
would we take a PR to error on this syntax? if so, seems worth keeping open for that.

even if not, perhaps we should add a test to ensure the behavior doesn't change. at least it's the same for all engines (makes sense as it's the parser I guess)

    static void Main()
    {
        string input = "^abc_d";
        Console.WriteLine(Regex.Match(input, @"^*_.$").Groups[0].Value);
        Console.WriteLine(Regex.Match(input, @"^*_.$", RegexOptions.Compiled).Groups[0].Value);
        Console.WriteLine(Regex.Match(input, @"^*_.$", RegexOptions.NonBacktracking).Groups[0].Value);
        Console.WriteLine(GeneratedRegex().Match(input).Groups[0].Value);
    }

    [GeneratedRegex(@"^*_.$")]
    private static partial Regex GeneratedRegex();
_d
_d
_d
_d
```</body></comment_new>
</comments>

Fixes #29174


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits October 11, 2025 18:12
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix regex parsing issue for invalid expression Fix regex parser to reject quantifiers on anchors (e.g., ^*, $+, \b?) Oct 11, 2025
Copilot AI requested a review from stephentoub October 11, 2025 18:38
@stephentoub
Copy link
Member

Commented out invalid patterns in RegexRustTests.cs (patterns like (?m)^*, (?m)$+)

@copilot, does this mean there are mainstream engines that permit this?

Copy link
Contributor Author

Copilot AI commented Oct 11, 2025

Commented out invalid patterns in RegexRustTests.cs (patterns like (?m)^*, (?m)$+)

@copilot, does this mean there are mainstream engines that permit this?

Yes, the Rust regex engine does permit quantified anchors. These test cases were ported from the Rust regex library test suite (as noted in the file header), and they demonstrate that Rust treats patterns like (?m)^* and (?m)$+ as valid, with specific match behavior.

For example, (?m)^* in Rust matches zero or more beginning-of-line positions and returns matches at every position (including between characters), while (?m)^+ matches one or more consecutive beginning-of-line positions.

However, most other mainstream engines (JavaScript, Python, Java, PCRE) reject these patterns as errors. My change aligns .NET with the majority of engines rather than with Rust's permissive behavior.

@stephentoub
Copy link
Member

Thanks. In that case, we should leave the behavior as-is. Having the quantifier on the anchor does not hurt anything, is logically valid (it's just applying the semantics of the quantifier to the anchor, which while not particularly useful, follows reasonable rules), and is supported by other engines. Since we already have tests exercising the behavior, we can just consider the issue closed.

@stephentoub stephentoub deleted the copilot/fix-regex-parsing-issue branch October 17, 2025 12:48
@github-actions github-actions bot locked and limited conversation to collaborators Nov 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex should fail to parse "^*" [Linux/ARM32] System.Security.Cryptography.Algorithms.Tests fail

2 participants