Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed
40 of 48 tasks
Tracked by #57209
joperezr opened this issue Dec 14, 2021 · 12 comments
Closed
40 of 48 tasks
Tracked by #57209

System.Text.RegularExpressions work planned for .NET 7 #62758

joperezr opened this issue Dec 14, 2021 · 12 comments
Assignees
Labels
area-System.Text.RegularExpressions Epic Groups multiple user stories. Can be grouped under a theme. Priority:1 Work that is critical for the release, but we could probably ship without Team:Libraries
Milestone

Comments

@joperezr
Copy link
Member

joperezr commented Dec 14, 2021

This issue captures the planned work for .NET 7. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.

Summary

There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.

Planned for .NET 7

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

cc: @stephentoub @danmoseley @olsaarik @veanes

@joperezr joperezr added area-System.Text.RegularExpressions tracking This issue is tracking the completion of other related issues. labels Dec 14, 2021
@joperezr joperezr added this to the 7.0.0 milestone Dec 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 14, 2021
@ghost
Copy link

ghost commented Dec 14, 2021

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

cc: @stephentoub @danmoseley @olsaarik @veanes

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions, tracking

Milestone: 7.0.0

@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Dec 14, 2021
@joperezr joperezr self-assigned this Dec 14, 2021
@danmoseley
Copy link
Member

thanks for pulling this together.

A user story is essentially a significant unit of user visible value that can be completed independently.

So this feels like it can probably actually be several issues each labeled with "user story" eg

  • Predictable regular expression performance, using a non-backtracking engine
  • Faster, no-JIT regular expressions, using a source generator
  • Higher performance regular expression matching // random perf stuff
  • More robust regular expression engines // for stuff like test coverage improvements, general fixes and robustness?

All the stories would ultimately be parented by a single Theme, not sure what that will be yet.

@jeffhandley jeffhandley added Team:Libraries Priority:1 Work that is critical for the release, but we could probably ship without Epic Groups multiple user stories. Can be grouped under a theme. and removed tracking This issue is tracking the completion of other related issues. labels Jan 10, 2022
@jeffhandley jeffhandley changed the title Planned work for System.Text.RegularExpressions for 7.0.0 System.Text.RegularExpressions work planned for .NET 7 Jan 14, 2022
@huan086
Copy link

huan086 commented Jun 16, 2022

Feature request: can a more helpful message be shown when an OutOfMemoeyException is thrown in the regex source generator? Currently it shows OutOfMemoeyException and the generic message that source generation has failed and the code may not work correctly. It would be helpful to link to best practices to simplify the regex, and another link to show how to increase memory limit

Background: I have a 1 million character regex which is a word list in different languages, each word separated by |. It works (slowly) in regex interpreted mode. Hoping to increase performance using source generator but ran into OOM

@stephentoub
Copy link
Member

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

@huan086
Copy link

huan086 commented Jun 21, 2022

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

This regex labels each location name it finds with the Geoname ID. The interpreted regex typically runs within 5 seconds for a sentence to 60 seconds for paragraph of about 500 words.

Using a regex instead of a loop and IndexOf to take advantage of matching the longest name. E.g. finding the location "New York City" and "New York" will only match "New York City".

Cutting down the number of locations to 1/100 of the list below allows the RegexGenerator to run successfully, creating a source file that is 14MB

https://1drv.ms/t/s!AoUkMG56ao1eq_A4tUTHghOmnndp8A?e=zFPSgH

@joperezr
Copy link
Member Author

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

@huan086
Copy link

huan086 commented Jun 23, 2022

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

You are right, turns out Regex isn't the correct solution. A loop on every location name, finding all positions of the location name in text repeatedly using IndexOf, and handling the few cases where locations are substrings of other locations by looking at the positions matched before, runs faster. The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

Originally I thought finding a set of strings within an input text is most straightforward with a Regex.

Back to the original comment I made, it would be helpful to offer some advice when bad regex pattern creates OOM.

@joperezr
Copy link
Member Author

joperezr commented Jun 27, 2022

it would be helpful to offer some advice when bad regex pattern creates OOM.

That's good feedback, and we can think of something to solve that. I do think it is very rare to have 1 million character RegEx patterns, so I do think that OOM (at least in most of our engines) should be rare and more of an edge case. For .NET 7, we are introducing a new engine: The Non-Backtracking engine, which uses a DFA and NFA based algorithms underneath, which works mainly as a way to ensure linear processing times in terms of the input. This new engine does have a mechanism similar to what you suggest, and will in fact throw an exception when you call the constructor if your pattern is too large and could cause an OOM exception. We can probably think about sharing this concept of catching potential problematic patterns at construction time for the other engines. For more info on the new non-backtracking engine, and for knowledge on the rest of the improvements we've been working on for .NET 7, I would suggest checking out @stephentoub's regex blog post.

@jzabroski
Copy link
Contributor

jzabroski commented Jul 18, 2022

@stephentoub I read through your blog post, as @joperezr suggested, and I think there's a feature missing here that I am somewhat surprised didn't get even punted to a .NET 8 backlog somewhere, or at least deliberately listed as "out of scope" for time constraints:

Using Regex named groups, create structured DTOs to save the match data:

// Define DTO that holds match results
public partial class RegexDto
{
}

// Define regex
[RegexGenerator("(?<DayOfWeek>Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", nameof(RegexDto))]
private static partial Regex MyCoolRegex();

// Would generate partial class equal to...
public partial class RegexDto
{
  public Group DayOfWeek { get; set; } 
}

// Perform match operation 
var regexDto = MyCoolRegex().Match(someText);

// New school approach to reading groups
var dayOfWeek = regexDto.DayOfWeek.Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.DayOfWeek.Value)
    : (DayOfWeek?)null);
// Old school approach to reading groups
var dayOfWeek = regexDto.Groups["DayOfWeek"].Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.Groups["DayOfWeek"].Value)
    : (DayOfWeek?)null);

The Visual Studio IntelliSense algorithm already has some basic knowledge around this, but it's currently strings-only. Here is a screenshot of some stupid regex I wrote, to munge some odd SQL tool data format:

image

Additional benefit would be possible eventual removal of intermediary allocations to GroupCollection and CaptureCollection.

Additional benefit would be that IntelliSense doesnt catch you if you copy-paste Regex code from somewhere else, and have a group name that doesnt exist in the regex itself.

Additionally, something I have always found rather silly about how Visual Studio pretty prints regexes, is that it doesn't provide an UI to visually aid in seeing each group as a standalone value. It is thus very hard to see whether parens are balanced or not. I've never understood why it just doesn't work like this, with underlines showing the named group zones:

image

Having a slightly different background color for groups would also be pretty helpful. - like Benjamin Moore Yellow Sorbet or Benjamin Moore Sun Dance.

The only reason I can think of, is analysis paralysis, where developers wondered how to visualize nested named groups. But I'd be willing to bet that most regex's in the wild are relatively flat in nature, as most people tend to think of "Named Groups as Parser Tokens", including myself. - Anything not in a named group is just "epsilon" or non-material "whitespace". For example, by far the most common use I have for regex's is a simple way to match filenames. I'm guessing this is one of the main uses people have for regexes: matching things like file name extension plus some filename prefix plus some formatted date stamp/datetime stamp munged somewhere before the extension, or some kind of monotonically increasing sequence id.

@jzabroski
Copy link
Contributor

The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

This is actually mentioned above: #62447

In general, it would be nice if Aho-Corasick itself were a utility class. I had written a post somewhere on reddit on my experiments creating very fast needle in haystack string search using a parallel, C#-based, Aho-Corasick. The actual use case was for something like this. Except, I rather prefer an interface that models the data structure first, and therefore it built a Trie object, vs. generated a switch statement that sliced through the string as though it were a flattened trie. Having a Trie-centric approach would be more useful for things like IDEs where you'd want to be able to redo an algorithm after some inserts.

@stephentoub
Copy link
Member

@joperezr, @jeffhandley, I've never been clear on the purpose of these epic issues. Does this need to remain open? As these are just links to other issues, they don't track anything themselves that isn't tracked elsewhere.

@joperezr
Copy link
Member Author

Yes this can be closed. This was more for my personal tracking and planning, but I agree that the remaining work is tracked by individual issues so no point any longer on keeping this open.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions Epic Groups multiple user stories. Can be grouped under a theme. Priority:1 Work that is critical for the release, but we could probably ship without Team:Libraries
Projects
None yet
Development

No branches or pull requests

6 participants