New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex performance compare with other programming languages #23683
Comments
Note I would have expected to achieve about what Java is able to get, at least. .... the fact that IP addresses are so wildly different across the board suggests some engine differences, or something. Certainly, specifying |
Thank you for pointing us to this benchmark. @ViktorHofer made a change dotnet/corefx#24158 post 2.0 which should help substantially. They are already passing in the compiled flag, it is being ignored. This should give us performance similar to desktop .NET Framework. We would like to do much better though. @vancem has suggested we could create a "lightweight" regex codepath used for the 80% case where the regexes are "simple". The ones in this benchmark look like they might not be "simple enough" however. We would welcome collaboration on regex performance. Are either of you interested? |
The benchmark lists three expressions
The only one I would not expect the 'simplified' parser to handle is the IPv4 example because it uses the {3} expression to repeat something 3 times (if you had expanded it out three times I would be in my conception of 'simplified'. Of course the whole strategy I laid out was that we would add more complex features as usage dictates. Personally I don't think the IPv4 regex is great example from software engineering perspective. It is trying to keep the numbers < 256, and that is painful to do in a regex. Instead I would expect people to use a simpler pattern (\d+).(\d+).(\d+).(\d+) convert the strings to number and do the numeric validation as a semantic check. |
@danmosemsft said:
It lists it as .NET Core, but desktop doesn't do any better (what I'm on at the moment). |
@Clockwork-Muse that is interesting. Usually compiled helps. This will need some profiling. |
btw, on Mono RegexOptions.Compiled is currently ignored due to a bug (it explains why it's so slow in the benchmark). |
One final thing on RegEx. Today when we say 'compiled' we mean generating IL at RUNTIME (using ReflectionEmit). Better than that would be to generate the IL (frankly C#) at COMPILE TIME. This was always a stumbling block before because we did not have ways of injecting tools at build time without users having to do something painful, but Nuget packages support this and we should take advantage of this (frankly my expectation is that > 90% of all RegEx patterns are know at COMPILE TIME. It would be straightforward to generate C# code (it would even be readable), that would recognize the pattern and would be SUPER FAST (and would have a nice comment that shows the Regex it recognizes. |
I have made a |
Just some additional information. What @vancem suggested with NuGet packages sounds interesting. If we want to we could even publish common patterns like email adresses, domains, ip addresses and others as compiled regex assemblies. But that's something for future discussions as we would first need to bring back the mentioned API in .NET Core (which isn't trivial). |
Costing covers making a plan. |
We have made improvements here but do not plan more for 2.1 that isn't already tracked. Moving to Future to collect future work |
Some specific use-cases would benefit greatly from faster Regex. The best performing Regex engine is actually a re-usable C library that includes a JIT, maybe it's worth looking into integrating it rather than writing your own Regex JIT. I know every regex engine has its own flavor, but PCRE is compatible with Perl regexes which are rather broad feature-wise! @vancem idea of referencing a Nuget package to precompile all static regexes would be awesome when it comes to startup time. EDIT BTW if you want to do some comparisons, someone wrote a mixed assembly wrapper around PCRE, here: https://github.com/ltrzesniewski/pcre-net |
Thanks, we know about both pcre and pcre-net. We had offline discussions about adopting pcre but nothing concrete yet. |
The @vancem expectation seems to be true. I would add that we can expect that most regex patterns simple because complex patterns are difficult to write and difficult to debug. (A developer can always break a complex pattern into pieces and apply them consistently without loss of performance) If so, a developer could choose a simpler and faster implementation of RegEx engine. I mean that we could have several different implementations and allow the developer to choose the option that best suits their situation. |
@ViktorHofer would your proposed Span based API likely fix the string allocation problem? If so, was the reason that is on pause was that it was feasible to implement for the interpreted codepath but rather expensive to implement for the compiled codepath? If so, I wonder if the benefit is sufficiently large that we could implement it only for interpreted, and throw (for now) for compiled. @iSazonov just curious, do your regexes tend to be compiled? |
@danmosemsft If your question is about a Select-String cmdlet, then I would expect that we need compile regex patterns. (But I see in the cmdlet code that we are not doing this and this should be fixed.) |
Compiling a pattern is not necessarily the right choice. I believe they can't be unloaded oncd compiled. |
In general, yes, but PowerShell can internally compiles script blocks for performance so compile a regex is not problem. We could get a performance boost for large files especially with TC. |
A closer look at ripgrep I discovered that heuristics for selecting one of 7 (!) search methods are used there. https://github.com/rust-lang/regex/blob/77140e7c1628ac31026d2421c6a4c3b0eb19506c/src/literal/mod.rs#L39 It seems Span.Sort() API was approved already. Did the MSFT team consider Span.Match()/IsMatch() option? |
The .NET Regex engine executes expressions through backtracking. Some other engines build a finite state automaton instead. It is possible to match a string in O(N) time as opposed to possibly exponential time with backtracking. Two downsides:
Most regex expressions that occur in practice are supported. FSA matching is extremely fast. If we care about regex performance we should not try to squeeze out a few percent here or there. The switch to an FSA implementation (for supported expressions which are most) can provide an exponential speedup. Also, compilation is very important as a constant factor optimization. It was added recently to .NET Core. |
I also think that we could have several engines and choose the right one after regex parsing. |
I do agree regex performance has fallen behind. The way forward may not be major engineering work in the existing Regex engine. As well as the cost there is some risk of breaking behavior (possibly necessarily). My thinking was we may want to do a combination of
For 2. I did some experimentation with PCRE as discussed here |
The best quality of implementation would be to upgrade the existing API. Exact same syntax and API surface. It might make sense to add a parallel API surface that is more optimized (less allocations etc.). I understand it's attractive to just drop the existing API and create something new in a greenfield way. The better solution for customers, though, is to avoid that fragmentation and just fix the engine. This requires thorough compatibility work but surely this is possible. Does Microsoft not have access to large code bodies that use Regex? It must be possible to compile a compatibility test suite. If effort was not a concern we'd create a new fast engine for supported expressions (FSA based with no backtracking) and keep the old engine for unsupported expressions. Almost all practically occurring expressions are formed from a rather simple syntax subset. This would be completely transparent to existing code. Customers do not need to learn new ways of doing things. |
This is not cheap. Writing a new regex engine that is competitive is a big investment and would take time. Where possible it would be good to build on the work others have done and spend our dev dollars in other places. Making it fully compatible is likely an opposing requirement anyway.
You are quite right - we did an audit of a large quantity of code and this is correct. You can get an idea of the nuances of existing behavior by looking at the tests I disabled in the branch above to get tests passing against PCRE. |
That's understandable if performance is the only thing you're going after, but IMO that would be a missed opportunity. PCRE has a very rich feature set which you can leverage to do things that are just not possible in other engines. That's why I decided to support everything in PCRE.NET - it gives the user so much tools. There's one feature I left out because I deemed it unnecessary and I had provided a suitable equivalent. And sure enough, just yesterday someone posted an issue asking me to implement it. People want these features. :) Of course, you could provide simpler APIs that would use faster code paths alongside more feature-complete APIs to get the best of both worlds. Writing a different API for each regex engine makes sense from a feature availability standpoint.
PCRE (and thus PCRE.NET) provide both implementations. But in the case of PCRE, the O(N) algorithm is not necessarily faster. Read here for more details, but in short:
Also, PCRE doesn't implement compiled regexes with this algorithm. Engines like RE2 that are designed from the ground up with this algorithm in mind are probably faster. |
I wonder whether it could help to make changes to the "regular expression pattern parser" rather than in the matching engine itself. Sort of like static analysis for regex patterns, and rewriting the patterns to get maximum performance from the existing engine without changing the match results. Specifically, I'm thinking of functionality like "unrolling the loop", or automatically wrapping some subexpressions in an atomic group to curb unnecessary backtracking if there is no chance of it finding a match anyway. (Please forgive me for the lack of concrete examples here, I am on mobile atm.) I guess this would only really help when the original author of the expression didn't already try to optimize it, or decided it would be more maintainable in a simpler form etc., but it could perhaps be useful for cases where users can provide their own patterns, what do you think? I imagine the development effort would be a lot smaller than trying to rewrite the match engine or support third party engines? It could even be part of a separate API that developers could opt-in to, in case that is a concern. That said, one thing I would find useful, and am considering working on myself when/if time permits (and submitting as a NuGet package), would be a "regex AST". (Currently, the patterns that are parsed are stored in classes marked as |
I like the idea of having a new regex engine in .NET. In that sense I like the work proposed in ltrzesniewski/pcre-net#12. This would not be affiliated with the official .NET libraries, right? It could be an independent project. If such a powerful library exists then I agree that there is limited need for improving As a .NET user I'd rather see the engineering time spent elsewhere given these considerations. |
PCRE.NET no longer uses C++/CLI, I've rewritten the interop part entirely so I could target netstandard2.0. I left that issue open for the perf aspect, on which I haven't worked much. |
Note that there have been some big improvements made to Regex perf in master (since 3.1 forked). Perhaps someone is interested in measuring with those. |
@danmosemsft Could you please point the PRs? |
@iSazonov these. I do not know whether @stephentoub plans more work. It would be fun to get more benchmark data, maybe I can do the mariomka benchmark at the top when I have access to that machine at home. |
These significant recent wins suggest to me the existing engine is not at the limit of where we can get it. Also, that it may be less important to offer a way to plug in another engine behind the existing API (because the existing API is faster now) I think the future is likely a combination of both
|
cc @pgovind fyi. |
I do.
I agree. There are still things that can be done to make further meaningful improvements beyond what's already been done for .NET 5. |
Cool. I'll try to run the benchmarks next week on my machine to see where we're at. |
(been staying off GitHub while on vacation, but I'll be putting up another significant PR early next week) |
@danmosemsft, can you run your test again, but using .NET 5 master? |
Pasted from mariomka/regex-benchmark#14 (comment) PHP | 26.57 | 24.27 | 8.20 | 59.05 I also offered a PR to add RegexOptions.ECMAScript (ie., to give the ECMA results above) since none of the C, Rust, C++ and Java versions are doing Unicode matching. |
I wonder if Rust does not understand Utf8. |
Note that #35824 should take this down substantially further. But still a fair bit off Rust |
I've tested recent Mono (llvm-jit mode) vs daily .net 5 on this benchmark |
@EgorBo that woudl be recent Mono with all of @stephentoub improvements to the library, I assume. How does it compare with/without RegexOptions.Compiled. Are both regressed or is it specific to ref-emit? |
Yes, it's a mono built from
For non-Compiled mode the difference is smaller between coreclr and mono-jit-llvm (~30%) |
Thanks. What's your interpretation of the numbers - is this hitting an area in mono that is known to have a perf gap? |
Not sure, had a quick profiler run (via xcode instruments) and it looks a lot of time is spent in GC. |
We did a lot of perf work in this area in 5.0. Can this issue be closed now? |
I think so - we're doing very substantially better on both regex-redux and the mariomka benchmark. If there's more work, it should probably be driven by a new benchmark or insights. Also, @EgorBo there's a Mono specific issue, that should have it's own issue not least because it would be different engineers working on it. (Perhaps it is important for Blazor? @marek-safar ) |
One of the 3 Blazor test apps we are working with uses Regex - Note that on WASM, runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs Lines 76 to 81 in 54a09d2
|
I'm not aware of any regex perf problems/asks for browser |
There is a mariomka/regex-benchmark repo that run regex benchmark for different programming languages. I'm just wondered why C#
Regex
performance is the last and way slower than any other programming languages. Is there any way to speed upRegex
performance in .NET or any reason why .NETRegex
is that slow?The text was updated successfully, but these errors were encountered: