Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex - Support Possessive Quantifiers #24381

Open
ViktorHofer opened this issue Dec 7, 2017 · 9 comments
Open

Regex - Support Possessive Quantifiers #24381

ViktorHofer opened this issue Dec 7, 2017 · 9 comments
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@ViktorHofer
Copy link
Member

Current popular regex engines like java.util.regex or PCRE support greedy, lazy and possessive quantifiers. The current .NET regex engine does only support the former two. Though possessive quantifiers are syntactic sugar and can be mimicked with atomic grouping today, consider supporting them as they gained popularity over the last years.

Abstract:
Possessive quantifiers work the same as greedy quantifiers but without backtracking on the input string. That means that the following pattern D++[A-Z]+ matches the input string DDDDE but not DDDD.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@joperezr joperezr moved this from Needs triage to Future in Triage POD for Meta, Reflection, etc Nov 24, 2021
@TheConstructor
Copy link

I personally prefer abc*+ over (?>abc*), as it allows switching behaviors quicker and brings the behavior-change closer to the affected operation (the single * in this case).

@joperezr
Copy link
Member

joperezr commented Jul 5, 2022

Even though Regex doesn't accept that specific notation, you can achieve exactly the same by using Atomic groups, which are supported. As @TheConstructor points out, you could write the example above by doing (?>D+)[A-Z]+. Given there hasn't been a lot of interaction on this issue in the past 5 years, I will go ahead and close it for now until we get more feedback about supporting these quantifier's notation.

@joperezr joperezr closed this as completed Jul 5, 2022
@TheConstructor
Copy link

TheConstructor commented Jul 5, 2022

@joperezr would you accept a PR that tries to translate x*+ to (?>x*)?

@joperezr
Copy link
Member

joperezr commented Jul 5, 2022

Our Regex engine at parse time already tries to make groups atomic if possible, so today if you have a pattern like x* we are already translating that to instead be (?>x*), since that is much more performant as it doesn't backtrack. This, however, doesn't happen if the loop can't automatically be made atomic, for example, a pattern like x*\w won't automatically turn the x into an atomic group, since that might conflict with the \w as inputs like xxxx won't match that pattern anymore.

would you accept a PR that tries to translate x*+ to (?>x*)?

This would require some consideration as it would be a breaking change (Today, this pattern will throw an exception at construction time since today we don't allow nested quantifiers like that unless you use parentheses). I think we should do this only if it is justified by this being a highly used feature, and I don't think that's the case. @stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

@stephentoub
Copy link
Member

@stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

Right, the .NET syntax doesn't support the possessive quantifer syntactic sugar, so none of the patterns in our collection will use them, since they all parse correctly.

You could look through the multilingual corpus of regexes in the links from #62971 to see how popular possessive quantifiers are there.

@TheConstructor
Copy link

@joperezr

This would require some consideration as it would be a breaking change (Today, this pattern will throw an exception at construction time since today we don't allow nested quantifiers like that unless you use parentheses). I think we should do this only if it is justified by this being a highly used feature, and I don't think that's the case. @stephentoub I suppose that the data you have for those millions of patterns are all used in .NET and hence won't be using this feature, but do we also have some data of non-.NET patterns to see how many of them use possessive quantifiers?

I put together this short LINQPad-script to get an upper-boundary from the data-set mentioned in #62971:

async Task Main()
{
	var lines = File.ReadLines(Path.Combine(Path.GetDirectoryName(LINQPad.Util.CurrentQueryPath), "uniq-regexes-8.json"));
	var count = 0L;
	var useCount_IStype_to_nPosts = new Dictionary<string, long>();
	var useCount_registry_to_nModules = new Dictionary<string, long>();
	foreach(var line in lines)
	{
		try
		{
			var pattern = JsonSerializer.Deserialize<RegPattern>(line);
			if (pattern.pattern.Contains("++") || pattern.pattern.Contains("?+") || pattern.pattern.Contains("*+"))
			{
				//pattern.Dump(pattern.pattern, collapseTo: 0);
				MergeDictionaries(useCount_IStype_to_nPosts, pattern.useCount_IStype_to_nPosts);
				MergeDictionaries(useCount_registry_to_nModules, pattern.useCount_registry_to_nModules);
				count++;
			}
		}
		catch(JsonException e)
		{
			e.Dump(line, collapseTo: 0);
		}
	}
	count.Dump();
	useCount_IStype_to_nPosts.Dump("useCount_IStype_to_nPosts");
	useCount_registry_to_nModules.Dump("useCount_registry_to_nModules");
}

record RegPattern(string pattern, string[] supportedLangs, string type, Dictionary<string, long> useCount_IStype_to_nPosts, Dictionary<string, long> useCount_registry_to_nModules);

private static void MergeDictionaries(IDictionary<string, long> target, IReadOnlyDictionary<string, long> source)
{
	foreach(var (key,value) in source)
	{
		target.TryGetValue(key, out var targetValue);
		target[key] = targetValue + value;
	}
}

It finds 2088 unique pattern with useCount_IStype_to_nPosts

Key Value
StackOverflowRegexSource 179
RegExLibRegexSource 1

and useCount_registry_to_nModules

Key Value
packagist 889
npm 5830
cpan 293
pypi 159
rubygems 554
maven 113
godoc 146
crates.io 15

What I didn't realize, but the paper found, is that possessive quantifiers using + are supported in Java, PHP, Ruby and Perl, not supported in JavaScript and Go, but worst: the + is seemingly ignored in Rust (see paper page 7 or 449).

I do realize, that supporting this would change the behaviour of the Regular Expression Engine, but I can hardly see where replacing an exception by an actual implementation would yield to severely broken programs. In the end, why would you have a pattern, that always throws an exception in your program? And why is this exception driving your program? There is of course a certain chance of accidental usage in new source code.

Lastly possessive quantifiers are one of the lesser performance hogs (compared to regular quantifiers, recursive patterns or balancing groups), as they rule out backtracking. Having them offers a more concise way of eliminating backtracking than atomic groups, and possessive quantifiers are (in my opinion) easier to maintain, as the are directly obvious at the respective quantifier.

I understand, that they are not high up on your priority list, but I would like to have the opportunity to submit a PR into review.

@stephentoub
Copy link
Member

I'm not concerned about taking something that fails to parse and making it valid; that is, for example, the case for pretty much every new C# feature. My concern would be if there's anything that today is valid with one meaning and this would introduce an ambiguity, or if it would lead to any kind of inconsistencies, or if it would cause any meaningful slowdown in the parser. If adding this syntactic sugar doesn't harm anything existing, I'd personally be ok seeing it added, but it's not a priority for our team to implement.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 17, 2022
@danmoseley
Copy link
Member

@stephentoub since it sounds like we'd take a change for this in principle, can we keep this open in case a community members is interested?

@stephentoub
Copy link
Member

If it's doable in non-breaking manner (e.g. something that parses one way today changes tomorrow), sure.

@danmoseley danmoseley reopened this May 25, 2023
@danmoseley danmoseley added the help wanted [up-for-grabs] Good issue for external contributors label May 25, 2023
@dotnet dotnet unlocked this conversation Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Projects
No open projects
Development

No branches or pull requests

6 participants