Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: RegexOptions.Constrained #57891

Closed
stephentoub opened this issue Aug 22, 2021 · 27 comments · Fixed by #60607
Closed

[API Proposal]: RegexOptions.Constrained #57891

stephentoub opened this issue Aug 22, 2021 · 27 comments · Fixed by #60607
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Milestone

Comments

@stephentoub
Copy link
Member

Background and motivation

The .NET regex engine today works based on an NFA/backtracking-based engine. For .NET 7, we're exploring adding an opt-in DFA-based engine. To opt-in, we'd use a new RegexOptions enum value.

API Proposal

namespace System.Text.RegularExpressions
{
     public enum RegexOptions
     {
          ...
          Constrained = 0x0400
     }
}

Other name choices could include DFA, Deterministic, Safe, Predictable, ...

API Usage

var r = new Regex(..., RegexOptions.Constrained);

Risks

It is opt-in as the engine will almost certainly have some limitations on what patterns are supported (e.g. look behinds), impact on capture semantics, etc.

@stephentoub stephentoub added area-System.Text.RegularExpressions api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 22, 2021
@stephentoub stephentoub added this to the 7.0.0 milestone Aug 22, 2021
@ghost
Copy link

ghost commented Aug 22, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

The .NET regex engine today works based on an NFA/backtracking-based engine. For .NET 7, we're exploring adding an opt-in DFA-based engine. To opt-in, we'd use a new RegexOptions enum value.

API Proposal

namespace System.Text.RegularExpressions
{
     public enum RegexOptions
     {
          ...
          Constrained = 0x0400
     }
}

Other name choices could include DFA, Deterministic, Safe, Predictable, ...

API Usage

var r = new Regex(..., RegexOptions.Constrained);

Risks

It is opt-in as the engine will almost certainly have some limitations on what patterns are supported (e.g. look behinds), impact on capture semantics, etc.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, api-ready-for-review

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 22, 2021
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Aug 22, 2021
@davidfowl
Copy link
Member

Do you envision an analyzer that could flag literals with patterns that wouldn't support the constrained enum?

@stephentoub
Copy link
Member Author

Yes, but we're not there yet. My expectation is an analyzer would also be aligned with a regex source generator (which will necessitate factoring out the parser into an internal reusable component) and wouldn't be specific to this particular option, flagging invalid option combinations and invalid patterns based on the supplied options.

@TonyValenti
Copy link

I vote for the name "Deterministic".

@svick
Copy link
Contributor

svick commented Aug 22, 2021

@TonyValenti I don't like that option, because to me, it would imply that the default option is somehow nondeterminstic, i.e. that it is not guaranteed to produce the same result every time. (Yes, it is implemented using a nondeterministic finite automaton, but that doesn't make the code itself nondeterministic.)

@TonyValenti
Copy link

@svick good point! What is the advantage of a DFA vs NFA? Maybe it shouldn't be an option and the engine should pick after analyzing the regex.

@stephentoub
Copy link
Member Author

What is the advantage of a DFA vs NFA?

The former can provide guaranteed non-exponential bounds on execution time.

Maybe it shouldn't be an option and the engine should pick after analyzing the regex.

For anything other than IsMatch, it needs to be explicit as it's not only limited in what patterns it can support, it changes the execution model in a way that can influence observable results.

@GrabYourPitchforks
Copy link
Member

@stephentoub Do you imagine any sort of security guarantee that could be made here, e.g. that it's allowable to pass potentially hostile regex patterns as long as it's in constrained mode? If so, I imagine it would be interesting to expose an API "can this pattern be matched in constrained mode?".

@TonyValenti
Copy link

@GrabYourPitchforks Related to that, I have often wanted a Regex.TryParse method. Today I have to wrap a new(...) inside of a try and catch the exception.

@stephentoub
Copy link
Member Author

I think it'd be reasonable to have an IsValid(string, RegexOptions) method that took both a pattern and options and returned whether the combination is acceptable... that could mean the pattern is bad in general, the options themselves are invalid, or the pattern is invalid with those options.

@GrabYourPitchforks
Copy link
Member

The reason I asked about hostile inputs specifically is that we have a bunch of customers (think web-based editors) who allow arbitrary users to enter custom regex patterns for searching text. Right now they have to use custom configuration for the regex through a combination of manual pattern inspection and relying on the default timeout. If we could give them something guaranteed safer along with an API to use it successfully, I'm sure they'd be interested.

@terrajobst terrajobst added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 24, 2021
@terrajobst
Copy link
Member

terrajobst commented Aug 24, 2021

Video

  • It seems we need to find a name that makes it clear what it does; also it seems we want most people to pick this for new regexes, assuming they can live with the caveats.
  • We should discuss this with @stephentoub and others
namespace System.Text.RegularExpressions
{
    public enum RegexOptions
    {
        Constrained = 0x0400
    }
}

@terrajobst terrajobst added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-needs-work API needs work before it is approved, it is NOT ready for implementation labels Aug 24, 2021
@terrajobst
Copy link
Member

(flipping back to api-ready-for-review so it's on our radar again)

@am11
Copy link
Member

am11 commented Aug 24, 2021

It is opt-in as the engine will almost certainly have some limitations on what patterns are supported (e.g. look behinds), impact on capture semantics, etc.

Is it a correct takeaway that the future regex source generator will inherit same limitations as the proposed DFA (no look behinds, no sub- expressions / captures), or would it be possible to fully simulate NFA in the source generator? If latter is possible, would there still be significance (due to performance benefits, design simplicity?) for consumer to prefer DFA over existing NFA or future regex srcgen?

@TonyValenti
Copy link

Call it RegexOptions.Optimized.

Everyone will use it.

@TonyValenti
Copy link

And if you don't use Optimized, I would suggest picking a name that conveys you are gaining a benefit (Optimized, Minimized, etc) as opposed to loosing something.

Also, if there is a TryParse equivalent, the regex constructor could silently call it with the "Optimized" parameter. It if succeeds, it can use the optimized regex and if not, it can fall back to the default.

@stephentoub
Copy link
Member Author

Call it RegexOptions.Optimized.

No :) That's not what it is.

Let's take a step back...

In the regex world, engines are generally divided into one of two camps, deemed NFA and DFA (while not necessarily strictly tied to the theoretical meaning of those terms in implementation). NFA-based engines process regular expressions more closely aligned with how you might think of them working. Imagine a regex "abc|def|ghi". My brain at least thinks of that being processed as "try to match abc, if that fails try to match def, if that fails try to match ghi". That's basically what an NFA-based engine does, albeit with lots of possible optimizations on top of it. It tries one thing, and if that fails it backs up and it tries the next, and if that fails it backs up and tries the next. Etc. That backing up is called backtracking. NFA-based engines are able to track a lot of information about the current state of processing, which is why for example the .NET implementation is able to easily track all captures for a given capture within a loop. They can be very efficient... as long as there's not a lot of backtracking involved. Once there is a lot of backtracking, the worst-case efficiency can blow up to exponential cost in the length of the input; this is often referred to "excessive backtracking" or "catastrophic backtracking", and often occurs when there are loops inside of loops, e.g. try using regex101.com to match the regex (a+)+$ against the input aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab. For most of the languages available to choose from (PCRE, Python, Java, etc.), it'll halt processing early with a message like "Catastrophic backtracking has been detected and the execution of your expression has been halted." .NET is in the same camp, and it's why Regex exposes a timeout capability. Also note that if you choose Golang from the language list, it doesn't produce such a warning: that's because it uses RE2 which is a DFA-based engine...

NFA-based engines are called that because they're logically represented as non-deterministic finite automata, or NFA. There's a node in a graph for each state of the match, with one or more transitions to other nodes based on the character next consumed from the input. If I had the expression a(bc|.e), you can think of that as being a node for a with a transition for character b to a node for bc and a transition for any character to a node .e. Of course, b fits the description of "any character", so when we process b, we can logically end up being in two states at once. Since we're not dealing with quantum computing here, NFA-based engines employ backtracking, to follow one path, and then if that doesn't work out, go back and employ the other path. DFA-based engines instead effectively transform the graph into one where there's only one possible transition for a given character out of any node, with the new node representing all possible current states in one. This means we no longer have backtracking, which means we no longer have the worst-case exponential cliff. But as you can imagine, there are other issues. For one, that graph transformation can result in a huge number of nodes, which directly translates into potentially huge memory consumption. To counteract that, DFA-based engines might lazily compute the graph and throw away nodes to make space for new nodes, but while you still don't have exponential processing in the length of the input, you can significantly increase the typical cost for processing a regex. You also lose the ability to appropriately track certain information... like where a match actually started. For this reason, there are typically other costs involved, e.g. to determine the actual text that matched, some DFA-based engines will run the regex to find the end of the match, and then when successfully found, invert the pattern and re-run the regex in reverse from the ending point, in order to find the starting point. And engines are often forced to limit what patterns are acceptable.

I'm waving my hands here, but that's the general gist.

So, .NET currently has an NFA-based engine, which works quite well (and as of .NET 5, very competitively from a performance perspective), but because it does employ backtracking for some patterns, developers need to be careful not to allow arbitrary regex patterns to be supplied, or it could be a DoS vector (in particular if no timeout is supplied), and if a developer themselves authors a pattern that could suffer from catastrophic backtracking, similarly needs to be careful about accepting arbitrary inputs to that regex.

For .NET 7, we've been working with Microsoft Research (MSR) on a new DFA-based implementation, and it's coming along quite well. Now, to the specific questions asked on this thread...

Do you imagine any sort of security guarantee that could be made here, e.g. that it's allowable to pass potentially hostile regex patterns as long as it's in constrained mode?

That is a goal. The primary benefit of this mode is that it can place guarantees on worst-case execution time in the length of the input.

also it seems we want most people to pick this for new regexes

I'm not convinced of that yet. The caveats will include limits on what can be in the pattern (though we will hopefully be able to reduce those to a bare minimum), changes in results (if there are multiple ways a pattern could match input, you might get different answers with the new engine than you do with the current engine), changes in performance characteristics (it's very likely some not-unusual patterns will run slower on average), and changes in capture semantics (functionality like capturing all versions of a capture in a loop will not work). If a regex is being exposed to untrusted patterns, yes. If a regex is being exposed to untrusted inputs, it would depend on the pattern. If everything is trusted, there's no real benefit to the new engine other than if it ends up running faster for a given expression; it might in some cases, especially if there's backtracking, and will likely be slower in others. Guidance here can evolve as we learn more about what exactly we'll ship and what its characteristics end up being.

Is it a correct takeaway that the future regex source generator will inherit same limitations as the proposed DFA

No. The plan for the regex source generator is to do the same thing as RegexOptions.Compiled does, except emitting C# rather than IL. If you use the source generator with existing options, you'll get C# akin to the source you get today. If you use the source generator with the new RegexOptions.Constrained (or whatever we call it option), you'd get source generated for the DFA-mode (though that hasn't been implemented yet for any form of compilation).

Also, if there is a TryParse equivalent, the regex constructor could silently call it with the "Optimized" parameter. It if succeeds, it can use the optimized regex and if not, it can fall back to the default.

As noted, it's not just about what patterns it supports, but also about the behavior of execution. For IsMatch, we can under the covers switch between engines if that's deemed fruitful and safe. But for Match, Matches, Replace, Split, etc., those execution characteristics have observable behavior impact, and it'd be a breaking change in many situations to switch engine.

I also want to reiterate that this is not an "Optimized" mode, so we should not call it that ;-) It provides guarantees around worst-case behavior, and it might end up making some patterns run much faster on average (especially if there's backtracking involved, e.g. a pattern with lots of alternation and/or loops), but it can also consume significantly more memory and make some patterns run slower.

@jkotas
Copy link
Member

jkotas commented Aug 25, 2021

What is the code size of the new regex engine? Do we need to worry about making it naturally trimmable? The proposed API won't make it naturally trimmable.

@stephentoub
Copy link
Member Author

What is the code size of the new regex engine? Do we need to worry about making it naturally trimmable? The proposed API won't make it naturally trimmable.

Right now it's a few hundred K, though I expect that to come down some as common parts are refactored out, dead code from previous incarnations is eliminated, etc. As an option it'll be trimmable to the same extent RegexOptions.Compiled is trimmable today: if you don't pass any RegexOptions, we can trim it away, but if you do, we wouldn't be able to. I do expect it could be trimmed away if you intead use the source generator, just as I'd expect that for the current interpreter and in-memory compiler implementations; that should be a goal for the not-yet-written source generator, including adding new APIs if necessary to enable that.

What do you suggest instead?

@jkotas
Copy link
Member

jkotas commented Aug 25, 2021

We can consider exposing the new regex engine via new set of factory APIs instead of a flag. It would make it naturally trimmable. I am not sure whether it is worth it or what it would look like exactly.

@danmoseley
Copy link
Member

cc @veanes @OlliSaarikivi

@danmoseley
Copy link
Member

We can consider exposing the new regex engine via new set of factory APIs instead of a flag.

Or as the basis for a new, lighterweight regex API, more similar to eg PCRE API, without the rich match/group objects (that in some cases the new engine can't support). This API also be suitable for being fully span based.

@veanes
Copy link
Contributor

veanes commented Aug 25, 2021

This name for the option came up in another discussion thread.
It captures quite precisely what the new backend is all about:
RegexOptions.Nonbacktracking
(I could imagine later that RegexOptions.Nonbacktracking|RegexOptions.Compiled for example would cause code generation -- as of now RegexOptions.Nonbacktracking|RegexOptions.Compiled would throw NotSupportedException.)

@stephentoub
Copy link
Member Author

RegexOptions.Nonbacktracking

Yup, I'd be ok with NonBacktracking.

as of now RegexOptions.Nonbacktracking|RegexOptions.Compiled would throw NotSupportedException

In .NET Core prior to Compiled being implemented, it was just a nop. Seems we could do the same thing here.

@danmoseley
Copy link
Member

In .NET Core prior to Compiled being implemented, it was just a nop. Seems we could do the same thing here.

Yup, unless we anticipate an observable change in behavior (other than perf) - I assume we don't.

BTW, right now Compiled effectively is used to mean "prioritize throughput and you can assume a JIT exists" and that would continue to be its meaning with the DFA engine. Conversely "prioritize throughput but you can't assume a JIT exists" would use the source generator.

@veanes
Copy link
Contributor

veanes commented Aug 25, 2021

Sounds good. I'll make RegexOptions.Compiled behave as noop in DFA (Nonbacktracking) mode.

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 31, 2021
@terrajobst
Copy link
Member

terrajobst commented Aug 31, 2021

  • NonBacktracking seems like a fine name; it doesn't describe exactly what we do (which is good, because it gives the implementation sufficient flexibility) while it still gives the developer a good indication of the resulting trade-offs.
namespace System.Text.RegularExpressions
{
     public enum RegexOptions
     {
          ...
          NonBacktracking = 0x0400
     }
}

@stephentoub stephentoub self-assigned this Sep 9, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants