Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lazy loop IndexOf optimization #68400

Merged
merged 1 commit into from Apr 22, 2022
Merged

Conversation

stephentoub
Copy link
Member

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic. The optimization is meant to use IndexOf{Any} in a situation like <[^>]*?> where a lazy loop is consuming input until it sees some character. In this example, we'd want to IndexOf('>'), but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up. In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic.  The optimization is meant to use IndexOf{Any} in a situation like `<[^>]*?>` where a lazy loop is consuming input until it sees some character.  In this example, we'd want to `IndexOf('>')`, but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up.  In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).
@ghost
Copy link

ghost commented Apr 22, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This optimization wasn't kicking in for the most desirable cases due to a flaw in the logic. The optimization is meant to use IndexOf{Any} in a situation like <[^>]*?> where a lazy loop is consuming input until it sees some character. In this example, we'd want to IndexOf('>'), but the optimization wasn't kicking in because it saw that the not'd character matched the subsequent literal, and gave up. In such a case, we don't actually want to give up, we just can't short-circuit the whole match if we find the loop character (if they don't overlap, then finding the loop character means the loop ends before getting to something that could match the rest of the pattern and can thus immediately fail).

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we still have #63300 open and I plan to work on it soon, but would it be worth adding some tests on this PR to ensure some pattern examples are using IndexOf as expected? If you don't want to do that now, let's just add a note on that issue so that I remember to add a test case for this PR along the rest of the tests.

@stephentoub
Copy link
Member Author

let's just add a note on that issue so that I remember to add a test case for this PR along the rest of the tests.

Will do. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants