-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex - Support Possessive Quantifiers #24381
Comments
|
I personally prefer |
|
Even though Regex doesn't accept that specific notation, you can achieve exactly the same by using Atomic groups, which are supported. As @TheConstructor points out, you could write the example above by doing |
|
@joperezr would you accept a PR that tries to translate |
|
Our Regex engine at parse time already tries to make groups atomic if possible, so today if you have a pattern like
This would require some consideration as it would be a breaking change (Today, this pattern will throw an exception at construction time since today we don't allow nested quantifiers like that unless you use parentheses). I think we should do this only if it is justified by this being a highly used feature, and I don't think that's the case. @stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers? |
Right, the .NET syntax doesn't support the possessive quantifer syntactic sugar, so none of the patterns in our collection will use them, since they all parse correctly. You could look through the multilingual corpus of regexes in the links from #62971 to see how popular possessive quantifiers are there. |
I put together this short LINQPad-script to get an upper-boundary from the data-set mentioned in #62971: async Task Main()
{
var lines = File.ReadLines(Path.Combine(Path.GetDirectoryName(LINQPad.Util.CurrentQueryPath), "uniq-regexes-8.json"));
var count = 0L;
var useCount_IStype_to_nPosts = new Dictionary<string, long>();
var useCount_registry_to_nModules = new Dictionary<string, long>();
foreach(var line in lines)
{
try
{
var pattern = JsonSerializer.Deserialize<RegPattern>(line);
if (pattern.pattern.Contains("++") || pattern.pattern.Contains("?+") || pattern.pattern.Contains("*+"))
{
//pattern.Dump(pattern.pattern, collapseTo: 0);
MergeDictionaries(useCount_IStype_to_nPosts, pattern.useCount_IStype_to_nPosts);
MergeDictionaries(useCount_registry_to_nModules, pattern.useCount_registry_to_nModules);
count++;
}
}
catch(JsonException e)
{
e.Dump(line, collapseTo: 0);
}
}
count.Dump();
useCount_IStype_to_nPosts.Dump("useCount_IStype_to_nPosts");
useCount_registry_to_nModules.Dump("useCount_registry_to_nModules");
}
record RegPattern(string pattern, string[] supportedLangs, string type, Dictionary<string, long> useCount_IStype_to_nPosts, Dictionary<string, long> useCount_registry_to_nModules);
private static void MergeDictionaries(IDictionary<string, long> target, IReadOnlyDictionary<string, long> source)
{
foreach(var (key,value) in source)
{
target.TryGetValue(key, out var targetValue);
target[key] = targetValue + value;
}
}It finds 2088 unique pattern with useCount_IStype_to_nPosts
and useCount_registry_to_nModules
What I didn't realize, but the paper found, is that possessive quantifiers using I do realize, that supporting this would change the behaviour of the Regular Expression Engine, but I can hardly see where replacing an exception by an actual implementation would yield to severely broken programs. In the end, why would you have a pattern, that always throws an exception in your program? And why is this exception driving your program? There is of course a certain chance of accidental usage in new source code. Lastly possessive quantifiers are one of the lesser performance hogs (compared to regular quantifiers, recursive patterns or balancing groups), as they rule out backtracking. Having them offers a more concise way of eliminating backtracking than atomic groups, and possessive quantifiers are (in my opinion) easier to maintain, as the are directly obvious at the respective quantifier. I understand, that they are not high up on your priority list, but I would like to have the opportunity to submit a PR into review. |
|
I'm not concerned about taking something that fails to parse and making it valid; that is, for example, the case for pretty much every new C# feature. My concern would be if there's anything that today is valid with one meaning and this would introduce an ambiguity, or if it would lead to any kind of inconsistencies, or if it would cause any meaningful slowdown in the parser. If adding this syntactic sugar doesn't harm anything existing, I'd personally be ok seeing it added, but it's not a priority for our team to implement. |
|
@stephentoub since it sounds like we'd take a change for this in principle, can we keep this open in case a community members is interested? |
|
If it's doable in non-breaking manner (e.g. something that parses one way today changes tomorrow), sure. |
Current popular regex engines like
java.util.regexorPCREsupport greedy, lazy and possessive quantifiers. The current .NET regex engine does only support the former two. Though possessive quantifiers are syntactic sugar and can be mimicked with atomic grouping today, consider supporting them as they gained popularity over the last years.Abstract:
Possessive quantifiers work the same as greedy quantifiers but without backtracking on the input string. That means that the following pattern
D++[A-Z]+matches the input stringDDDDEbut notDDDD.The text was updated successfully, but these errors were encountered: