Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix boundary handling in Regex auto-atomicity optimization (#79088)
This fixes a regression that occurred between .NET Core 3.1 and .NET 5, when we added the auto-atomicity optimization. This optimization is based on the premise that if a loop is followed by something that can't possibly match anything the loop could give up when backtracking, then backtracking for that loop is wasted effort, and thus the loop can be made atomic. For example, given the expression `a*b`, there's nothing the `a*` loop could give up when backtracking that would also match `b`, thus this can be transformed into `(?>a*)b`. As part of this, we also factor in word boundaries, but we're too aggressive in our handling of them. If you have `a+\b`, there's nothing that `a+` could give up that would enable the `\b` to match. However, if you have `a*\b`, since `\b` is a zero-width assertion, it is actually possible for the `a*` to give up something (the empty match) that could match `\b`, yet we're erroneously still converting that to be `(?>a*)\b`. The fix is to constrain that part of the optimization to require a non-zero minimum bound on the loop in order to make it atomic. While it's simple to repro incorrect handling here, it's also rare to find in real-world use, as word boundary anchors are used to demarcate words, this issue really only affects cases of empty words, and it's unusual that someone would write an expression to try to identify empty words.
- Loading branch information
1 parent
8841dd2
commit 41f57b7
Showing
3 changed files
with
36 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters