Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

ViktorHofer · 2018-02-20T03:29:00Z

Rationale

Today we have the ability to replace certain parts of a string identified by a regex pattern by calling Regex.Replace(string input, string pattern, string replacement). This does work well for small input strings but if you chain multiple Replace calls together and pass a large input string to it the number of string allocations are immensely high.

Example

In the regex-redux benchmark there are 3 huge sets of allocations that dominate the rest.

There is a cascade of replacements on a 50M character string, which creates 500MB of string garbage, presumably in LOH
Its input file is read into a 100MB string. (double byte - 50MB file)
There are 100MB of StringBuilder allocations.

API Proposal

namespace System.Text.RegularExpressions
{
    public class Regex
    {
        // Returns amount of bytes written into the buffer
        public static int Replace(ReadOnlySpan<char> input, Span<char> buffer, string pattern, string replacement, RegexOptions options = RegexOptions.None, TimeSpan matchTimeout = default);
    }
}

Notes

This API would help us remove unnecessary string allocations in the regex-redux benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexredux&lang=csharpcore&id=9).
The Regex.Replace implementation uses Match internally to check the pattern on the input string. Can we also change the Match code to operate on Spans or do we need to allocate a string in that step?

Sample Usage

// Sample implementation for regex-redux
const int InputFileSize = 1024 * 1024 * 50; // 50 MB input file

// Borrow buffers
 char[] buffer1Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
 char[] buffer2Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
 Span<char> buffer1 = new Span<char>(buffer1Arr);
 Span<char> buffer2 = new Span<char>(buffer2Arr);

int bufferBytesWritten = Console.In.ReadBlock(buffer1);
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "tHa[Nt]", "<4>");
bufferBytesWritten = Regex.Replace(buffer2.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer1, "aND|caN|Ha[DS]|WaS", "<3>");
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "a[NSt]|BY", "<2>");
bufferBytesWritten = Regex.Replace(buffer2.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer1, "<[^>]*>", "|");
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "\\|[^|][^|]*\\|", "-");

// Return buffers
ArrayPool<char>.Shared.Return(buffer1Arr);
ArrayPool<char>.Shared.Return(buffer2Arr);

return bufferBytesWritten;

Relates to https://github.com/dotnet/corefx/issues/27124

cc @stephentoub @danmosemsft @vancem @jkotas @jfree

The text was updated successfully, but these errors were encountered:

danmoseley · 2018-02-20T04:30:49Z

Current allocations are about 100MB initial string, 500MB intermediate strings, and about 100MB String builders (already pooled). The scheme above should cut this down to about 2 X100MB.

danmoseley · 2018-02-20T07:00:30Z

Tricky part looks like what to do with Regex.Match, with Regex.Replace uses, as the string tendrils go pretty deep. I almost think it might be easiest to try the refactoring to use Span/Memory and see how far it can be done. It would likely remove strings from most of the insides of S.T.RE. That would help inform what APi is appropriate/possible.

jkotas · 2018-02-20T07:40:59Z

Rent char[] buffer from ArrayPool (size of the input file => 1024 * 1024 * 50 => char[50M]) and put it into a Span.

The current ArrayPool implementation does not pool buffers this big.

ViktorHofer · 2018-02-20T11:22:54Z

What about making the pool size configurable by a setting switch in AppDomain.Current.SetData?

stephentoub · 2018-02-20T11:40:18Z

Regex.Replace overload added and impl changed

I don't understand the proposal...

What does this return? It's allocating a result rather than writing into a destination span?

Is it's built on top of match, how are you going to deal with it creating heap-based objects that store the input, which would now include a span? Pin the span(s) and pass pointers around?

What are the specific allocations being avoided here?

ViktorHofer · 2018-02-20T14:31:37Z

I updated my initial post to provide a better understanding. PTAL.

Is it's built on top of match, how are you going to deal with it creating heap-based objects that store the input, which would now include a span? Pin the span(s) and pass pointers around?

That's the tricky part!

jkotas · 2018-02-20T14:44:30Z

There is a cascade of replacements on a 50M character string, which creates 500MB of string garbage, presumably in LOH

Do you have an estimate on how much faster is the benchmark going to run if you get rid of garbage?

jkotas · 2018-02-20T14:48:09Z

char[] buffer1Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
char[] buffer2Arr = ArrayPool<char>.Shared.Rent(InputFileSize);

Pooling is only worth it if there is something running a loop. Does the benchmark run in a loop? It does not look like it to me.

ViktorHofer · 2018-02-20T14:48:40Z

Do you have an estimate on how much faster is the benchmark going to run if you get rid of garbage?

GC Heap Analyzer shows 825ms was paused in GC. The benchmark currently takes ~15 seconds.

danmoseley · 2018-02-20T14:48:50Z

There is ~800ms on ~15 sec scenario spent in GC. I would imagine some part of that.

ViktorHofer · 2018-02-20T14:51:55Z

Pooling is only worth it if there is something running a loop. Does the benchmark run in a loop? It does not look like it to me.

The benchmark does five Regex.Replace calls subsequently (a chain of pattern replacements). Even though it's currently not designed as, this is the loop scenario.

jkotas · 2018-02-20T14:56:00Z

five Regex.Replace calls subsequently (a chain of pattern replacements)

That's a streaming scenario, given that the dataset is 50MB. What you would really want to make it go really fast is a streaming RegEx that you pass a stream in and it gives you a stream back.

danmoseley · 2018-02-20T16:57:03Z

@jkotas interesting, how would a streaming API look? Note that the first replace removes newlines, so all subsequent regexes must operate on a single logical buffer. It would need a clever engine indeed to match against fragments of the buffer such that it had the same result.

BTW, a .Replace that accepts a ROS as input seems uncontroversial - that would eliminate the string and StringBuilder involved in reading from the file..?

jkotas · 2018-02-20T17:43:48Z

how would a streaming API look?

Using the new shiny pipelines, it can be something like this:

public class RegexReplacePipeline
{
    public RegexReplacePipeline(string pattern, string replacement, RegexOptions options = RegexOptions.None, TimeSpan matchTimeout = default);

    public PipeReader Create(PipeReader reader, PipeOptions options);
}

This API would give you a great control over scheduling the 5 chained regex replace operations. E.g. it allows you to set it up such that second thread will start processing the second replace as soon as the first thread running the first replace starts producing output.

danmoseley · 2018-02-20T17:47:57Z

Interesting. The regex work itself, in the general case, would need the entire input buffer to start work,
right? (Since regexes in general operate on the entire input -- notwithstanding some exotic regex engine changes)

jkotas · 2018-02-20T18:54:08Z

The point of streaming regex engine is that it does not need to have the whole buffer available all the time. I understand that there may be non-trivial changes require in the current corefx regex engine to make it compatible streaming.

BTW: Anybody happens to know whether there are good streaming engines out there? I would thing that grep or find/replace in Visual Studio may be implemented as streaming.

danmoseley · 2018-02-20T20:01:31Z

PCRE has partial matching as does Boost (here). The idea seems to be if it says partial match, you feed it another block, and ask it to try again at the start position of the partial match.

I could imagine that we could do such a thing - possibly substantial work. We would want to actually consume a stream, rather than just flag a partial match and let the caller deal with ti.

danmoseley · 2018-02-20T20:03:59Z

A first implementation of a regex over stream could be done that was efficient for the case where there are reasonable length input lines and RegexOptions.Multiline is off (the default). It would read a line at a time from the stream. Of course that's fairly trivial to achieve with the current API. The problem with regex-redux is everything is one huge line.

danmoseley · 2018-02-20T21:09:50Z

@powercode in #23602 you mentioned

This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs.

Can you clarify what API you feel would improve that? Regex.Match creates strings, as you know - in the Match objects. Is merely accepting a Span still a win?

powercode · 2018-02-21T11:08:48Z

Since our use case is the grep scenario, the big cost comes from allocating strings for each line in a file.
The most common case is finding text in a directory structure, which means allocating a string for every line in every file in every directory :)
That can be a very large number of strings. My hope was to get to a scenario where we could do memory mapping and map Spans for doing the matching, and only allocate strings for the lines where there is an actual match.

If there can be less allocations within the Match function, that is a bonus.

So my scenario would benefit the most from having a Span for the match input.

I haven't worked with the Span APIs yet, so I feel a bit reluctant to provide suggestions for API design, but as long as the scenario of matching every line in a file is considered, I am happy.
We will probably parallelize on the file level, but maybe it would be interesting to be able to find matches in huge files in parallel too.

By the way, PowerShell is currently lacking a sed like cmdlet, and the Replacement API suggestions above would certainly make the implementation of that a lot easier.

4creators · 2018-02-21T19:42:54Z

BTW: Anybody happens to know whether there are good streaming engines out there?

One of the best streaming, high performance regex libraries is Hyperscan by Intel, although, it does not support complete regex syntax. It is licensed under BSD 3 clause license.

ViktorHofer · 2018-03-01T14:17:12Z

Thank you all for your feedback. I will close this work item in favor of https://github.com/dotnet/corefx/issues/24145 where I'm currently drafting an API proposal for Span overloads on the existing API set.

ViktorHofer changed the title ~~Introduce Regex.Replace Span API & optimize Regex-Redux benchmar~~ Introduce Regex.Replace Span API & optimize Regex-Redux benchmark Feb 20, 2018

ViktorHofer self-assigned this Feb 20, 2018

ViktorHofer closed this as completed Mar 1, 2018

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.1.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

ViktorHofer commented Feb 20, 2018

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

jkotas commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

stephentoub commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

jkotas commented Feb 20, 2018

jkotas commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

danmoseley commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

jkotas commented Feb 20, 2018

danmoseley commented Feb 20, 2018 •

edited

jkotas commented Feb 20, 2018 •

edited

danmoseley commented Feb 20, 2018

jkotas commented Feb 20, 2018

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

powercode commented Feb 21, 2018 •

edited

4creators commented Feb 21, 2018

ViktorHofer commented Mar 1, 2018

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

Comments

ViktorHofer commented Feb 20, 2018

Rationale

Example

API Proposal

Notes

Sample Usage

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

jkotas commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

stephentoub commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

jkotas commented Feb 20, 2018

jkotas commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

danmoseley commented Feb 20, 2018

ViktorHofer commented Feb 20, 2018

jkotas commented Feb 20, 2018

danmoseley commented Feb 20, 2018 • edited

jkotas commented Feb 20, 2018 • edited

danmoseley commented Feb 20, 2018

jkotas commented Feb 20, 2018

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

danmoseley commented Feb 20, 2018

powercode commented Feb 21, 2018 • edited

4creators commented Feb 21, 2018

ViktorHofer commented Mar 1, 2018

danmoseley commented Feb 20, 2018 •

edited

jkotas commented Feb 20, 2018 •

edited

powercode commented Feb 21, 2018 •

edited