Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

Closed
ViktorHofer opened this issue Feb 20, 2018 · 22 comments
Closed

Introduce Regex.Replace Span API & optimize Regex-Redux benchmark #25097

ViktorHofer opened this issue Feb 20, 2018 · 22 comments
Assignees
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@ViktorHofer
Copy link
Member

Rationale

Today we have the ability to replace certain parts of a string identified by a regex pattern by calling Regex.Replace(string input, string pattern, string replacement). This does work well for small input strings but if you chain multiple Replace calls together and pass a large input string to it the number of string allocations are immensely high.

Example

In the regex-redux benchmark there are 3 huge sets of allocations that dominate the rest.

  • There is a cascade of replacements on a 50M character string, which creates 500MB of string garbage, presumably in LOH
    image
  • Its input file is read into a 100MB string. (double byte - 50MB file)
  • There are 100MB of StringBuilder allocations.

API Proposal

namespace System.Text.RegularExpressions
{
    public class Regex
    {
        // Returns amount of bytes written into the buffer
        public static int Replace(ReadOnlySpan<char> input, Span<char> buffer, string pattern, string replacement, RegexOptions options = RegexOptions.None, TimeSpan matchTimeout = default);
    }
}

Notes

This API would help us remove unnecessary string allocations in the regex-redux benchmark (http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexredux&lang=csharpcore&id=9).
The Regex.Replace implementation uses Match internally to check the pattern on the input string. Can we also change the Match code to operate on Spans or do we need to allocate a string in that step?

Sample Usage

// Sample implementation for regex-redux
const int InputFileSize = 1024 * 1024 * 50; // 50 MB input file

// Borrow buffers
 char[] buffer1Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
 char[] buffer2Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
 Span<char> buffer1 = new Span<char>(buffer1Arr);
 Span<char> buffer2 = new Span<char>(buffer2Arr);

int bufferBytesWritten = Console.In.ReadBlock(buffer1);
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "tHa[Nt]", "<4>");
bufferBytesWritten = Regex.Replace(buffer2.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer1, "aND|caN|Ha[DS]|WaS", "<3>");
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "a[NSt]|BY", "<2>");
bufferBytesWritten = Regex.Replace(buffer2.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer1, "<[^>]*>", "|");
bufferBytesWritten = Regex.Replace(buffer1.Slice(0, bufferBytesWritten).AsReadOnlySpan(), buffer2, "\\|[^|][^|]*\\|", "-");

// Return buffers
ArrayPool<char>.Shared.Return(buffer1Arr);
ArrayPool<char>.Shared.Return(buffer2Arr);

return bufferBytesWritten;

Relates to https://github.com/dotnet/corefx/issues/27124

cc @stephentoub @danmosemsft @vancem @jkotas @jfree

@ViktorHofer ViktorHofer changed the title Introduce Regex.Replace Span API & optimize Regex-Redux benchmar Introduce Regex.Replace Span API & optimize Regex-Redux benchmark Feb 20, 2018
@ViktorHofer ViktorHofer self-assigned this Feb 20, 2018
@danmoseley
Copy link
Member

Current allocations are about 100MB initial string, 500MB intermediate strings, and about 100MB String builders (already pooled). The scheme above should cut this down to about 2 X100MB.

@danmoseley
Copy link
Member

Tricky part looks like what to do with Regex.Match, with Regex.Replace uses, as the string tendrils go pretty deep. I almost think it might be easiest to try the refactoring to use Span/Memory and see how far it can be done. It would likely remove strings from most of the insides of S.T.RE. That would help inform what APi is appropriate/possible.

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

Rent char[] buffer from ArrayPool (size of the input file => 1024 * 1024 * 50 => char[50M]) and put it into a Span.

The current ArrayPool implementation does not pool buffers this big.

@ViktorHofer
Copy link
Member Author

What about making the pool size configurable by a setting switch in AppDomain.Current.SetData?

@stephentoub
Copy link
Member

Regex.Replace overload added and impl changed

I don't understand the proposal...

What does this return? It's allocating a result rather than writing into a destination span?

Is it's built on top of match, how are you going to deal with it creating heap-based objects that store the input, which would now include a span? Pin the span(s) and pass pointers around?

What are the specific allocations being avoided here?

@ViktorHofer
Copy link
Member Author

I updated my initial post to provide a better understanding. PTAL.

Is it's built on top of match, how are you going to deal with it creating heap-based objects that store the input, which would now include a span? Pin the span(s) and pass pointers around?

That's the tricky part!

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

There is a cascade of replacements on a 50M character string, which creates 500MB of string garbage, presumably in LOH

Do you have an estimate on how much faster is the benchmark going to run if you get rid of garbage?

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

char[] buffer1Arr = ArrayPool<char>.Shared.Rent(InputFileSize);
char[] buffer2Arr = ArrayPool<char>.Shared.Rent(InputFileSize);

Pooling is only worth it if there is something running a loop. Does the benchmark run in a loop? It does not look like it to me.

@ViktorHofer
Copy link
Member Author

Do you have an estimate on how much faster is the benchmark going to run if you get rid of garbage?

GC Heap Analyzer shows 825ms was paused in GC. The benchmark currently takes ~15 seconds.

@danmoseley
Copy link
Member

There is ~800ms on ~15 sec scenario spent in GC. I would imagine some part of that.

@ViktorHofer
Copy link
Member Author

Pooling is only worth it if there is something running a loop. Does the benchmark run in a loop? It does not look like it to me.

The benchmark does five Regex.Replace calls subsequently (a chain of pattern replacements). Even though it's currently not designed as, this is the loop scenario.

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

five Regex.Replace calls subsequently (a chain of pattern replacements)

That's a streaming scenario, given that the dataset is 50MB. What you would really want to make it go really fast is a streaming RegEx that you pass a stream in and it gives you a stream back.

@danmoseley
Copy link
Member

danmoseley commented Feb 20, 2018

@jkotas interesting, how would a streaming API look? Note that the first replace removes newlines, so all subsequent regexes must operate on a single logical buffer. It would need a clever engine indeed to match against fragments of the buffer such that it had the same result.

BTW, a .Replace that accepts a ROS as input seems uncontroversial - that would eliminate the string and StringBuilder involved in reading from the file..?

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

how would a streaming API look?

Using the new shiny pipelines, it can be something like this:

public class RegexReplacePipeline
{
    public RegexReplacePipeline(string pattern, string replacement, RegexOptions options = RegexOptions.None, TimeSpan matchTimeout = default);

    public PipeReader Create(PipeReader reader, PipeOptions options);
}

This API would give you a great control over scheduling the 5 chained regex replace operations. E.g. it allows you to set it up such that second thread will start processing the second replace as soon as the first thread running the first replace starts producing output.

@danmoseley
Copy link
Member

Interesting. The regex work itself, in the general case, would need the entire input buffer to start work,
right? (Since regexes in general operate on the entire input -- notwithstanding some exotic regex engine changes)

@jkotas
Copy link
Member

jkotas commented Feb 20, 2018

The point of streaming regex engine is that it does not need to have the whole buffer available all the time. I understand that there may be non-trivial changes require in the current corefx regex engine to make it compatible streaming.

BTW: Anybody happens to know whether there are good streaming engines out there? I would thing that grep or find/replace in Visual Studio may be implemented as streaming.

@danmoseley
Copy link
Member

PCRE has partial matching as does Boost (here). The idea seems to be if it says partial match, you feed it another block, and ask it to try again at the start position of the partial match.

I could imagine that we could do such a thing - possibly substantial work. We would want to actually consume a stream, rather than just flag a partial match and let the caller deal with ti.

@danmoseley
Copy link
Member

A first implementation of a regex over stream could be done that was efficient for the case where there are reasonable length input lines and RegexOptions.Multiline is off (the default). It would read a line at a time from the stream. Of course that's fairly trivial to achieve with the current API. The problem with regex-redux is everything is one huge line.

@danmoseley
Copy link
Member

@powercode in #23602 you mentioned

This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs.

Can you clarify what API you feel would improve that? Regex.Match creates strings, as you know - in the Match objects. Is merely accepting a Span still a win?

@powercode
Copy link

powercode commented Feb 21, 2018

Since our use case is the grep scenario, the big cost comes from allocating strings for each line in a file.
The most common case is finding text in a directory structure, which means allocating a string for every line in every file in every directory :)
That can be a very large number of strings. My hope was to get to a scenario where we could do memory mapping and map Spans for doing the matching, and only allocate strings for the lines where there is an actual match.

If there can be less allocations within the Match function, that is a bonus.

So my scenario would benefit the most from having a Span for the match input.

I haven't worked with the Span APIs yet, so I feel a bit reluctant to provide suggestions for API design, but as long as the scenario of matching every line in a file is considered, I am happy.
We will probably parallelize on the file level, but maybe it would be interesting to be able to find matches in huge files in parallel too.

By the way, PowerShell is currently lacking a sed like cmdlet, and the Replacement API suggestions above would certainly make the implementation of that a lot easier.

@4creators
Copy link
Contributor

BTW: Anybody happens to know whether there are good streaming engines out there?

One of the best streaming, high performance regex libraries is Hyperscan by Intel, although, it does not support complete regex syntax. It is licensed under BSD 3 clause license.

@ViktorHofer
Copy link
Member Author

Thank you all for your feedback. I will close this work item in favor of https://github.com/dotnet/corefx/issues/24145 where I'm currently drafting an API proposal for Span overloads on the existing API set.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests

7 participants