Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DotNet 6 Preview 7: Regex.Match and Regex.IsMatch wrong results with startat parameter overload #58794

Closed
fitdev opened this issue Sep 8, 2021 · 7 comments
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner

Comments

@fitdev
Copy link

fitdev commented Sep 8, 2021

Description

When using

Regex.Match(string input, int startat)
Regex.IsMatch(string input, int startat)

It finds no match, contrary to using:

Regex.Match(input.Substring(startat))
Regex.IsMatch(input.Substring(startat))

in which case match is found.

Shouldn't overloads taking in int startat parameter behave the same as if input.Substring(startat) was passed instead?

I don't want to use Substring workaround since this leads to unnecessary allocations and copying. In that reagrd it would be nice if Regex methods would have Span-based overloads.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Sep 8, 2021
@ghost
Copy link

ghost commented Sep 8, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

When using

Regex.Match(string input, int startat)
Regex.IsMatch(string input, int startat)

It finds no match, contrary to using:

Regex.Match(input.Substring(startat))
Regex.IsMatch(input.Substring(startat))

in which case match is found.

Shouldn't overloads taking in int startat parameter behave the same as if input.Substring(startat) was passed instead?

I don't want to use Substring workaround since this leads to unnecessary allocations and copying. In that reagrd it would be nice if Regex methods would have Span-based overloads.

Author: fitdev
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: -

@eerhardt
Copy link
Member

eerhardt commented Sep 8, 2021

@fitdev - can you provide a program that reproduces the issue with expected and actual results?

@stephentoub
Copy link
Member

Shouldn't overloads taking in int startat parameter behave the same as if input.Substring(startat) was passed instead?

Not necessarily. For example, if the pattern contains constructs that look before the current position (e.g. a lookbehind), there's a large difference between using startat to start matching in the middle of a string (where the engine is able to look at the content of the input before the startat) and trimming off everything that came before startat (in which case the engine is unable to look there). Another example is anchors, where an anchor can have a very specific meaning about its position relative to the beginning of the input.

@stephentoub
Copy link
Member

In that reagrd it would be nice if Regex methods would have Span-based overloads.

That's tracked by #23602, which also discusses why from a public API perspective it'd be possible to add an IsMatch(ReadOnlySpan<char>) but much more challenging to add a Match(ReadOnlySpan<char>).

@stephentoub
Copy link
Member

stephentoub commented Sep 8, 2021

don't want to use Substring workaround since this leads to unnecessary allocations and copying

Have you tried using the overload that takes the beginning position and a length, rather than a startat position? (Yes, it's confusing from an API perspective that two overloads differ in the meaning of an int in the same position, but that's the case.)

@fitdev
Copy link
Author

fitdev commented Sep 8, 2021

@stephentoub Thank you very much for your fast response and detailed explanation.

Have you tried using the overload that takes the beginning position and a length, rather than a startat position?

Yes I have and noticed that it behaves as expected. However the problem is that there is such an overload only for Regex.Match but no overload for Regex.IsMatch, which is the one I really wanted to use, since it does not allocate a Match object which I don't really need.

So I guess I will have to stick with input.Substring(startat) for now, since it seems to have less of an overhead than using Regex.Match, though I may be wrong about this and would appreciate your correction here since you are the performance expert (loved your recent in-depth blog post)!

One last point then: the docs should then be improved and clearly explain this subtle behavior difference, both between using input.Substring(startat) and Regex.[Is]Match(input, startat) APIs and between Regex.IsMatch(input, startat) and Regex.Match(input, beginning, length) APIs (startatandbeginning` seem synonymous yet have different meanings as you explained).

@stephentoub
Copy link
Member

the docs should then be improved and clearly explain this subtle behavior difference

If they don't already, I agree that'd be useful. Note the docs are open source; if you'd like to submit a PR to do so, that'd be welcome.
https://github.com/dotnet/dotnet-api-docs/blob/main/xml/System.Text.RegularExpressions/Regex.xml

That said, they do try to make this clear, in the details Remarks section for both overloads, e.g.
"beginning always defines the index of the leftmost character to include in the search, and length defines the maximum number of characters to search. Together, they define the range of the search."

So I guess I will have to stick with input.Substring(startat) for now, since it seems to have less of an overhead than using Regex.Match

I'd need to measure, but it's likely to depend on how long the input string and how much you're trimming.

there is such an overload only for Regex.Match but no overload for Regex.IsMatch, which is the one I really wanted to use

Please feel free to open an issue proposing such an overload. Seems reasonable for Match and IsMatch to have a consistent set of overloads.

loved your recent in-depth blog post

Thanks. Glad you enjoyed it.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

3 participants