Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex performance compare with other programming languages #23683

Closed
doggy8088 opened this issue Sep 29, 2017 · 61 comments
Closed

Regex performance compare with other programming languages #23683

doggy8088 opened this issue Sep 29, 2017 · 61 comments
Assignees
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@doggy8088
Copy link

There is a mariomka/regex-benchmark repo that run regex benchmark for different programming languages. I'm just wondered why C# Regex performance is the last and way slower than any other programming languages. Is there any way to speed up Regex performance in .NET or any reason why .NET Regex is that slow?

@Clockwork-Muse
Copy link
Contributor

Note I would have expected to achieve about what Java is able to get, at least.

.... the fact that IP addresses are so wildly different across the board suggests some engine differences, or something. Certainly, specifying RegexOptions.ECMAScript cuts the non-ip numbers by about 1/3, which would be telling (still a bad showing, though).

@danmoseley
Copy link
Member

Thank you for pointing us to this benchmark. @ViktorHofer made a change dotnet/corefx#24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.

This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.

We would welcome collaboration on regex performance. Are either of you interested?

@vancem
Copy link
Contributor

vancem commented Sep 29, 2017

The ones in this benchmark look like they might not be "simple enough" however.

The benchmark lists three expressions

Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
IPv4: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])

The only one I would not expect the 'simplified' parser to handle is the IPv4 example because it uses the {3} expression to repeat something 3 times (if you had expanded it out three times I would be in my conception of 'simplified'. Of course the whole strategy I laid out was that we would add more complex features as usage dictates.

Personally I don't think the IPv4 regex is great example from software engineering perspective. It is trying to keep the numbers < 256, and that is painful to do in a regex. Instead I would expect people to use a simpler pattern (\d+).(\d+).(\d+).(\d+) convert the strings to number and do the numeric validation as a semantic check.

@Clockwork-Muse
Copy link
Contributor

@danmosemsft said:

Thank you for pointing us to this benchmark. @ViktorHofer made a change dotnet/corefx#24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored.

This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however.

It lists it as .NET Core, but desktop doesn't do any better (what I'm on at the moment).

@danmoseley
Copy link
Member

@Clockwork-Muse that is interesting. Usually compiled helps. This will need some profiling.

@EgorBo
Copy link
Member

EgorBo commented Sep 29, 2017

btw, on Mono RegexOptions.Compiled is currently ignored due to a bug (it explains why it's so slow in the benchmark).

@vancem
Copy link
Contributor

vancem commented Sep 30, 2017

One final thing on RegEx. Today when we say 'compiled' we mean generating IL at RUNTIME (using ReflectionEmit). Better than that would be to generate the IL (frankly C#) at COMPILE TIME. This was always a stumbling block before because we did not have ways of injecting tools at build time without users having to do something painful, but Nuget packages support this and we should take advantage of this (frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME. It would be straightforward to generate C# code (it would even be readable), that would recognize the pattern and would be SUPER FAST (and would have a nice comment that shows the Regex it recognizes.

@tupunco
Copy link

tupunco commented Sep 30, 2017

In many other languages, Compile is used as the Parser process for Regex. But a Compiled option is provided on .NET. It is easy to be used incorrectly.


I have made a serious mistake. I used to think that the regular engines are much the same. Today, after careful testing and found that this is not true.
C# benchmark speed is consumed in the Match process, rather than the Parser process.
It seems that C# regular engine is really too complicated to achieve, the performance needs to be optimized.

@ViktorHofer ViktorHofer self-assigned this Oct 2, 2017
@ViktorHofer
Copy link
Member

ViktorHofer commented Oct 2, 2017

Better than that would be to generate the IL (frankly C#) at COMPILE TIME

Just some additional information.
In .NET Framework we support this by calling Regex.CompileToAssembly(..) (http://referencesource.microsoft.com/#System/regex/system/text/regularexpressions/Regex.cs,1193) which produces an assembly that can be referenced at compile time. With that you get IntelliSense and all the other benefits from it BUT there is still pain involved as this feature isn't really dynamic at all.
In .NET Core we currently don't support this feature and throw PlattformNotSupportedExceptions (with the latest nightlies) when trying to use the API. We do support IL generation at runtime now but only on netcoreapp applications, not UWP.

What @vancem suggested with NuGet packages sounds interesting. If we want to we could even publish common patterns like email adresses, domains, ip addresses and others as compiled regex assemblies. But that's something for future discussions as we would first need to bring back the mentioned API in .NET Core (which isn't trivial).

@danmoseley
Copy link
Member

Costing covers making a plan.

@danmoseley
Copy link
Member

We have made improvements here but do not plan more for 2.1 that isn't already tracked. Moving to Future to collect future work

@jods4
Copy link

jods4 commented Aug 13, 2018

Some specific use-cases would benefit greatly from faster Regex.
For example: dev tools/extensions tend to use them extensively to find patterns in code (think // TODO), filter filenames and so on.

The best performing Regex engine is actually a re-usable C library that includes a JIT, maybe it's worth looking into integrating it rather than writing your own Regex JIT.
Under BSD license, github mirror:
https://github.com/luvit/pcre2

I know every regex engine has its own flavor, but PCRE is compatible with Perl regexes which are rather broad feature-wise!

@vancem idea of referencing a Nuget package to precompile all static regexes would be awesome when it comes to startup time.

EDIT BTW if you want to do some comparisons, someone wrote a mixed assembly wrapper around PCRE, here: https://github.com/ltrzesniewski/pcre-net

@ViktorHofer
Copy link
Member

Thanks, we know about both pcre and pcre-net. We had offline discussions about adopting pcre but nothing concrete yet.

@iSazonov
Copy link
Contributor

iSazonov commented Oct 11, 2018

  1. Need to mention ripgrep https://blog.burntsushi.net/ripgrep/
    It use Rust regex library (based on RE2) and have other optimizations (like Intel Teddy re-implementation).
    It is worth studying.

  2. In PowerShell repo we discovered that Select-String cmdlet performance is very bad. There are two reasons for this.

  • Slow Regex CoreFX engine
  • Tons string allocations.
    Regex only works with strings and we have to read files line by line which leads to a huge number of string allocations. We really need to enhance Regex to work not only with strings but also with ReadOnlySpan and buffers (perhaps incrementally).
    Seems the enhancement doesn't require to rewrite whole Regex engine and could be implemented in the near future. Thoughts?

(frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME.

The @vancem expectation seems to be true. I would add that we can expect that most regex patterns simple because complex patterns are difficult to write and difficult to debug. (A developer can always break a complex pattern into pieces and apply them consistently without loss of performance) If so, a developer could choose a simpler and faster implementation of RegEx engine. I mean that we could have several different implementations and allow the developer to choose the option that best suits their situation.
The options could be simple string search, fastest but feature limited DFA and full feature NFA engines.

@danmoseley
Copy link
Member

@ViktorHofer would your proposed Span based API likely fix the string allocation problem? If so, was the reason that is on pause was that it was feasible to implement for the interpreted codepath but rather expensive to implement for the compiled codepath? If so, I wonder if the benefit is sufficiently large that we could implement it only for interpreted, and throw (for now) for compiled.

@iSazonov just curious, do your regexes tend to be compiled?

@iSazonov
Copy link
Contributor

@danmosemsft If your question is about a Select-String cmdlet, then I would expect that we need compile regex patterns. (But I see in the cmdlet code that we are not doing this and this should be fixed.)

@danmoseley
Copy link
Member

Compiling a pattern is not necessarily the right choice. I believe they can't be unloaded oncd compiled.

@iSazonov
Copy link
Contributor

In general, yes, but PowerShell can internally compiles script blocks for performance so compile a regex is not problem. We could get a performance boost for large files especially with TC.

@iSazonov
Copy link
Contributor

A closer look at ripgrep I discovered that heuristics for selecting one of 7 (!) search methods are used there. https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/src/literal/mod.rs#L39
It takes definitely a lot of work for .Net Core library to catch up with the search leaders.

It seems Span.Sort() API was approved already. Did the MSFT team consider Span.Match()/IsMatch() option?

@GSPP
Copy link

GSPP commented May 6, 2019

The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.

Two downsides:

  1. Engineering work
  2. This does not support all features (e.g. backreferences)

Most regex expressions that occur in practice are supported. FSA matching is extremely fast.

If we care about regex performance we should not try to squeeze out a few percent here or there. The switch to an FSA implementation (for supported expressions which are most) can provide an exponential speedup.

Also, compilation is very important as a constant factor optimization. It was added recently to .NET Core.

@iSazonov
Copy link
Contributor

iSazonov commented May 7, 2019

I also think that we could have several engines and choose the right one after regex parsing.

@danmoseley
Copy link
Member

danmoseley commented May 8, 2019

I do agree regex performance has fallen behind. The way forward may not be major engineering work in the existing Regex engine. As well as the cost there is some risk of breaking behavior (possibly necessarily).

My thinking was we may want to do a combination of

  1. selected tweaks to the existing engine, such as maybe API that takes span.
  2. offer an alternative engine based on existing code. possibly behind a new API (the existing API has some inefficiencies, such as creating objects for groups/matches/captures, and some idiosyncracies, such as .NET is the only engine that supports a captures collection as far as I know, and it also returns all matches at once)

For 2. I did some experimentation with PCRE as discussed here
ltrzesniewski/pcre-net#12
https://github.com/dotnet/corefxlab/compare/master...danmosemsft:pcre?expand=1
Although in that case I stuck it behind the existing API, I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.

@danmoseley
Copy link
Member

cc @ltrzesniewski

@GSPP
Copy link

GSPP commented May 8, 2019

The best quality of implementation would be to upgrade the existing API. Exact same syntax and API surface. It might make sense to add a parallel API surface that is more optimized (less allocations etc.).

I understand it's attractive to just drop the existing API and create something new in a greenfield way. The better solution for customers, though, is to avoid that fragmentation and just fix the engine. This requires thorough compatibility work but surely this is possible. Does Microsoft not have access to large code bodies that use Regex? It must be possible to compile a compatibility test suite.

If effort was not a concern we'd create a new fast engine for supported expressions (FSA based with no backtracking) and keep the old engine for unsupported expressions. Almost all practically occurring expressions are formed from a rather simple syntax subset. This would be completely transparent to existing code. Customers do not need to learn new ways of doing things.

@danmoseley
Copy link
Member

just fix the engine

This is not cheap. Writing a new regex engine that is competitive is a big investment and would take time. Where possible it would be good to build on the work others have done and spend our dev dollars in other places. Making it fully compatible is likely an opposing requirement anyway.

Almost all practically occurring expressions are formed from a rather simple syntax subset.

You are quite right - we did an audit of a large quantity of code and this is correct. You can get an idea of the nuances of existing behavior by looking at the tests I disabled in the branch above to get tests passing against PCRE.

@ltrzesniewski
Copy link
Contributor

I began to think it would make better sense to offer a smaller, more optimized API that better matched PCRE semantics.

That's understandable if performance is the only thing you're going after, but IMO that would be a missed opportunity. PCRE has a very rich feature set which you can leverage to do things that are just not possible in other engines. That's why I decided to support everything in PCRE.NET - it gives the user so much tools.

There's one feature I left out because I deemed it unnecessary and I had provided a suitable equivalent. And sure enough, just yesterday someone posted an issue asking me to implement it. People want these features. :)

Of course, you could provide simpler APIs that would use faster code paths alongside more feature-complete APIs to get the best of both worlds.

Writing a different API for each regex engine makes sense from a feature availability standpoint.

The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking.

PCRE (and thus PCRE.NET) provide both implementations. But in the case of PCRE, the O(N) algorithm is not necessarily faster. Read here for more details, but in short:

It is substantially slower than the standard algorithm. This is partly because it has to search for all possible matches, but is also because it is less susceptible to optimization.

Also, PCRE doesn't implement compiled regexes with this algorithm. Engines like RE2 that are designed from the ground up with this algorithm in mind are probably faster.

@keith-hall
Copy link

I wonder whether it could help to make changes to the "regular expression pattern parser" rather than in the matching engine itself. Sort of like static analysis for regex patterns, and rewriting the patterns to get maximum performance from the existing engine without changing the match results.

Specifically, I'm thinking of functionality like "unrolling the loop", or automatically wrapping some subexpressions in an atomic group to curb unnecessary backtracking if there is no chance of it finding a match anyway. (Please forgive me for the lack of concrete examples here, I am on mobile atm.)

I guess this would only really help when the original author of the expression didn't already try to optimize it, or decided it would be more maintainable in a simpler form etc., but it could perhaps be useful for cases where users can provide their own patterns, what do you think? I imagine the development effort would be a lot smaller than trying to rewrite the match engine or support third party engines? It could even be part of a separate API that developers could opt-in to, in case that is a concern.

That said, one thing I would find useful, and am considering working on myself when/if time permits (and submitting as a NuGet package), would be a "regex AST". (Currently, the patterns that are parsed are stored in classes marked as internal.) This could power the "static analysis" I suggested earlier, and have the advantage of being able to "rewrite" a pattern originally written for one engine to be suitable for use in another. (i.e. whether \h means horizontal whitespace or a hexadecimal digit etc.)
The reason I mention this, is that if multiple engines will be supported in future, I imagine it would be beneficial to have this functionality as part of the framework. :)

@GSPP
Copy link

GSPP commented May 9, 2019

I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in ltrzesniewski/pcre-net#12.

This would not be affiliated with the official .NET libraries, right? It could be an independent project.

If such a powerful library exists then I agree that there is limited need for improving Regex. It seems better to recommend to users: If performance is not a concern use the most convenient tool. If performance is important use the alternative engine. It's just a NuGet package away.

As a .NET user I'd rather see the engineering time spent elsewhere given these considerations.

@ltrzesniewski
Copy link
Contributor

I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in ltrzesniewski/pcre-net#12.

PCRE.NET no longer uses C++/CLI, I've rewritten the interop part entirely so I could target netstandard2.0.
It also now supports Windows, Linux and OS X.

I left that issue open for the perf aspect, on which I haven't worked much.

@danmoseley
Copy link
Member

Note that there have been some big improvements made to Regex perf in master (since 3.1 forked). Perhaps someone is interested in measuring with those.

@iSazonov
Copy link
Contributor

@danmosemsft Could you please point the PRs?

@danmoseley
Copy link
Member

@iSazonov these. I do not know whether @stephentoub plans more work.

https://github.com/dotnet/runtime/pulls?q=is%3Apr+label%3Aarea-System.Text.RegularExpressions+is%3Aclosed+author%3Astephentoub

It would be fun to get more benchmark data, maybe I can do the mariomka benchmark at the top when I have access to that machine at home.

@danmoseley
Copy link
Member

These significant recent wins suggest to me the existing engine is not at the limit of where we can get it. Also, that it may be less important to offer a way to plug in another engine behind the existing API (because the existing API is faster now)

I think the future is likely a combination of both

  1. continue optimizing existing engine, possibly adding new API for Span, etc.
  2. encourage and possibly contribute to alternative .NET API such as eg PCRE.NET

@danmoseley
Copy link
Member

cc @pgovind fyi.

@stephentoub
Copy link
Member

I do not know whether @stephentoub plans more work.

I do.

recent wins suggest to me the existing engine is not at the limit of where we can get it

I agree. There are still things that can be done to make further meaningful improvements beyond what's already been done for .NET 5.

@pgovind
Copy link
Contributor

pgovind commented Jan 4, 2020

Cool. I'll try to run the benchmarks next week on my machine to see where we're at.

@stephentoub
Copy link
Member

(been staying off GitHub while on vacation, but I'll be putting up another significant PR early next week)

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub
Copy link
Member

For what it's worth, here's my results (on Ubuntu) using the benchmark at the top of this issue, targeting 2.2 and with the compile flag switched back on (it looks like they removed it)

@danmosemsft, can you run your test again, but using .NET 5 master?

@danmoseley
Copy link
Member

danmoseley commented May 4, 2020

Pasted from mariomka/regex-benchmark#14 (comment)

PHP | 26.57 | 24.27 | 8.20 | 59.05
C PCRE2 | 29.40 | 25.70 | 7.15 | 62.25
Rust | 30.90 | 29.14 | 8.29 | 68.33
C++ Boost | 62.97 | 62.49 | 23.13 | 148.58
Javascript | 95.32 | 67.16 | 2.19 | 164.68
Perl | 151.18 | 90.58 | 33.98 | 275.74
Crystal | 172.62 | 148.42 | 19.01 | 340.04
C# .Net Core 50 ECMA | 215.35 | 172.11 | 17.35 | 404.80
D ldc | 223.99 | 221.51 | 6.71 | 452.21
C# .Net Core 50 | 314.79 | 254.98 | 17.40 | 587.17
D dmd | 292.91 | 303.55 | 8.84 | 605.29
Ruby | 365.49 | 324.14 | 59.95 | 749.58
Kotlin | 237.89 | 300.62 | 408.42 | 946.93
Java | 233.96 | 317.21 | 429.29 | 980.46
Python 2 | 317.74 | 210.41 | 521.91 | 1050.06
Go | 411.11 | 399.17 | 634.96 | 1445.24
C++ STL | 604.55 | 488.13 | 422.20 | 1514.88
C# .Net Core 31 | 1504.76 | 1256.17 | 104.65 | 2865.57
C# Mono | 3923.71 | 3301.79 | 218.56 | 7444.07

I also offered a PR to add RegexOptions.ECMAScript (ie., to give the ECMA results above) since none of the C, Rust, C++ and Java versions are doing Unicode matching.

@iSazonov
Copy link
Contributor

iSazonov commented May 4, 2020

since none of the C, Rust, C++ and Java versions are doing Unicode matching.

I wonder if Rust does not understand Utf8.

@danmoseley
Copy link
Member

@iSazonov see mariomka/regex-benchmark#26

@danmoseley
Copy link
Member

Note that #35824 should take this down substantially further. But still a fair bit off Rust

@EgorBo
Copy link
Member

EgorBo commented May 8, 2020

I've tested recent Mono (llvm-jit mode) vs daily .net 5 on this benchmark
Mono is 30% slower than CoreCLR.

@danmoseley
Copy link
Member

@EgorBo that woudl be recent Mono with all of @stephentoub improvements to the library, I assume. How does it compare with/without RegexOptions.Compiled. Are both regressed or is it specific to ref-emit?

@EgorBo
Copy link
Member

EgorBo commented May 9, 2020

@EgorBo that woudl be recent Mono with all of @stephentoub improvements to the library, I assume. How does it compare with/without RegexOptions.Compiled. Are both regressed or is it specific to ref-emit?

Yes, it's a mono built from dotnet/runtime master so it uses all the latest Regex changes.
Numbers for "Compiled" mode:

coreclr:
7.1213 - 5
51.5561 - 92
54.7056 - 5301

mono-llvm-jit:
13.63 - 5
86.0223 - 92
87.768 - 5301

mono-jit:
13.1713 - 5
101.0694 - 92
103.0671 - 5301

For non-Compiled mode the difference is smaller between coreclr and mono-jit-llvm (~30%)
NOTE: Mono can be faster in AOT-LLVM mode (because it runs more llvm optimizations than the LLVM in JIT mode) but it looks like it's broken at the moment, will try again once I fix it.

@danmoseley
Copy link
Member

Thanks. What's your interpretation of the numbers - is this hitting an area in mono that is known to have a perf gap?

@EgorBo
Copy link
Member

EgorBo commented May 9, 2020

Thanks. What's your interpretation of the numbers - is this hitting an area in mono that is known to have a perf gap?

Not sure, had a quick profiler run (via xcode instruments) and it looks a lot of time is spent in GC.

@eerhardt eerhardt removed the untriaged New issue has not been triaged by the area owner label Jul 8, 2020
@eerhardt
Copy link
Member

eerhardt commented Jul 8, 2020

We did a lot of perf work in this area in 5.0. Can this issue be closed now?

@danmoseley
Copy link
Member

I think so - we're doing very substantially better on both regex-redux and the mariomka benchmark. If there's more work, it should probably be driven by a new benchmark or insights. Also, @EgorBo there's a Mono specific issue, that should have it's own issue not least because it would be different engineers working on it. (Perhaps it is important for Blazor? @marek-safar )

@eerhardt
Copy link
Member

eerhardt commented Jul 8, 2020

Perhaps it is important for Blazor?

One of the 3 Blazor test apps we are working with uses Regex -

https://github.com/SteveSandersonMS/CarChecker/blob/e9c7e7114e7936e38389074f1b5005e4b27a0e8e/Shared/VehiclePart.cs#L43-L50

Note that on WASM, RuntimeFeature.IsDynamicCodeCompiled always returns false, which means the Regex will never use the IL Compiler and instead will always be interpreted.

// if the compile option is set, then compile the code
if (RuntimeFeature.IsDynamicCodeCompiled && UseOptionC())
{
factory = Compile(pattern, _code!, options, matchTimeout != InfiniteMatchTimeout);
_code = null;
}

@marek-safar
Copy link
Contributor

I'm not aware of any regex perf problems/asks for browser

@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests