RegexPrefix struct, RegexFCD buffer to ValueListBuilder & ArrayPool#27314
Conversation
| <Compile Include="System\Text\RegularExpressions\CompiledRegexRunner.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\RegexCompiler.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\RegexLWCGCompiler.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\Runner\CompiledRegexRunnerFactory.cs" /> |
There was a problem hiding this comment.
Coding conventions says that directory should match the namespace. There is no System.Text.RegularExpressions.Runner namespace...
There was a problem hiding this comment.
I didn't know that we are enforcing that. Folders make sense here (as we don't have an sub namespaces) but I'm fine with flattening it back.
| <Compile Include="$(CommonPath)\CoreLib\System\Collections\Generic\ValueListBuilder.cs"> | ||
| <Link>Common\System\Collections\Generic\ValueListBuilder.cs</Link> | ||
| </Compile> | ||
| <Compile Include="$(CommonPath)\CoreLib\System\Collections\HashtableExtensions.cs"> |
There was a problem hiding this comment.
It is not used by anything else. Why move it to common?
4673b38 to
b2204db
Compare
b2204db to
9acd77c
Compare
|
Thanks Jan. Updated the PR. Also cc @stephentoub for perf |
|
YOu might grab the comments in https://github.com/dotnet/corefx/compare/master...danmosemsft:regex.commenting?expand=1 also. |
| <PropertyGroup Condition="'$(Configuration)|$(Platform)' == 'uap-Windows_NT-Release|AnyCPU'" /> | ||
| <ItemGroup> | ||
| <Compile Include="System\Text\RegularExpressions\HashtableExtensions.cs" /> | ||
| <Compile Include="System\Collections\Generic\ValueListBuilder.Pop.cs" /> |
There was a problem hiding this comment.
This should probably merge into the rest of VLB.
There was a problem hiding this comment.
Stephen advised to keep the Pop method out
There was a problem hiding this comment.
Hmm I believe it was Stephen, or Jan. But that was before we had the Indexer in it. I don't have a strong opinion here...
There was a problem hiding this comment.
IMHO, Pop as an operation doesn't really belong on the type; it's just how Regex happens to use it. My original suggestion had actually been (or what I'd intended) for Regex to just have a static helper that took a ValueListBuilder by ref, rather than having it be a partial class instance method like this.
| <Compile Include="System\Text\RegularExpressions\RegexCode.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\RegexCollectionDebuggerProxy.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" /> | ||
| <Compile Include="System\Text\RegularExpressions\RegexFCD.cs" /> |
There was a problem hiding this comment.
Did you figure out what FCD stands for? Probably First Child D... ?
There was a problem hiding this comment.
hmm. not really. comments in its file:
// This RegexFCD class is internal to the Regex package.
// It builds a bunch of FC information (RegexFC) about
| } | ||
|
|
||
| internal sealed class RegexFC | ||
| internal struct RegexFC : IDisposable |
There was a problem hiding this comment.
IDisposable may not be a good fit for a struct especially when forgetting to dispose is not a resourcel eak.
There was a problem hiding this comment.
You are not using it either you are calling Dispose directly.
There was a problem hiding this comment.
it's not a ref type but I agree, we don't need it here. I had a different implementation for ValueListBuilder which detects if the type implements IDisposable and then calls it but there was a perf regression with the type is IDisposable check (reflection).
|
Updated with RegexFC ArrayPool usage. With our test data the RegexFC's buffer is always < 32 which means it never grows. As it was implemented as a growing stack I did the same here. I looked at the code but am not 100% sure if it could ever get larger than its initial size. Not really important here. The more important change is that I now rent RegexFC (struct) arrays from ArrayPool. I don't know if we pass a managed type to ArrayPool anywhere else in Core but after talking with Stephen I believe the change is safe. To avoid memory leaks I clear the reference to RegexCharClass when disposing. |
|
|
||
| // Return rented buffers | ||
| RegexFC[] fcSpan = ArrayPool<RegexFC>.Shared.Rent(StackBufferSize); | ||
| var fcStack = new ValueListBuilder<RegexFC>(fcSpan, clearArray: true); |
There was a problem hiding this comment.
why is clearArray important? we almost never clear on returning.
There was a problem hiding this comment.
the clearArray here is wrong as I do it manually some lines below but in general clearArray is important if you store managed objects with references to other objects (in our case RegexFC --> RegexCharClass). We usually use ArrayPool in combination with primitive types like int & char.
There was a problem hiding this comment.
I accidently merged two commits, don't wonder :)
9c63373 to
4f7a83d
Compare
|
OK so far but please please next time you do a big refactoring keep each refactoring into its own commit. I don't care how small the commits are. Eg., move A to B, rename C to D, change E to struct, split F into partial classes G and H. The split into partial classes is particularly difficult to review because I would have to manually set up the diff between old and new files. I did not see anything much but doing it like that makes it possible to be much more confident and do the review FAR more quickly and accurately. Ideally you would have the refactoring first and behavior changes in separate PR's. Since you've started this though, let's at least leave them in separate commits (ie, rebase not squash) so that we have some hope of bisecting if a problem emerges. When refactoring good test coverage is critical. As we noted offline the block code coverage of regex is very good, much of the unhit blocks are debug only code. Of course by the nature of regexes, block coverage is hardly sufficient, but it could be worse. |
|
Great to see the improvements though! |
|
Thanks Dan, I appreciate the feedback (as discussed offline) and will try to avoid this behavior in future. I will now continue with the next candidate and will open a new PR after this one is approved and merged (not squashed) into master. |
| { | ||
| return _length; | ||
| } | ||
| return Text.Substring(Index + Length, Text.Length - Index - Length); |
There was a problem hiding this comment.
How about slicing rather than substringing? It looks like at least one of the callers then appends the result to a StringBuilder, so if you do it as a slice, you can avoid the intermediary string.
| { | ||
| return _index; | ||
| } | ||
| return Text.Substring(0, Index); |
There was a problem hiding this comment.
Same slicing comment/question.
|
|
||
| int[] matches = _matches[groupnum]; | ||
|
|
||
| return Text.Substring(matches[(c - 1) * 2], matches[(c * 2) - 1]); |
There was a problem hiding this comment.
Same slicing comment/question.
| /// </summary> | ||
| internal sealed class SharedReference | ||
| { | ||
| private WeakReference _ref = new WeakReference(null); |
| internal sealed class SharedReference | ||
| { | ||
| private WeakReference _ref = new WeakReference(null); | ||
| private int _locked; |
| { | ||
| private RegexRunner _ref; | ||
| private object _obj; | ||
| private int _locked; |
|
The first commit is intended to just do refactoring / code styling stuff. The 2nd and 3rd commits are about performance of the RegexFCD and RegexFC types and about making RegexPrefix a struct. I touched a lot of files in the first commit but intend to work on them in separate PRs as there are tons of possible improvements there. E.g. the slicing you pointed out. |
| internal object Get() | ||
| { | ||
| // try to obtain the lock | ||
| if (0 == Interlocked.Exchange(ref _locked, 1)) |
There was a problem hiding this comment.
Why not just use a normal lock?
There was a problem hiding this comment.
No idea. I think that change was made by @justinvp. Maybe he can answer?
There was a problem hiding this comment.
I didn't change it. This is how it was when the sources were added to corefx.
There was a problem hiding this comment.
thanks and sorry for the noise!
4f7a83d to
e77c9fc
Compare
| // Create/rent buffers | ||
| Span<int> intSpan = stackalloc int[StackBufferSize]; | ||
| var intStack = new ValueListBuilder<int>(intSpan); | ||
| RegexFC[] fcSpan = ArrayPool<RegexFC>.Shared.Rent(StackBufferSize); |
There was a problem hiding this comment.
We're renting an array of StackBufferSize? I'd have thought we'd start with stack space of StackBufferSize. I guess we can't as these aren't primitives?
There was a problem hiding this comment.
correct. it's a managed type which can't be stackalloced. I don't know why these restriction even apply to ref structs that always live on the stack. maybe @AndyAyersMS can answer?
There was a problem hiding this comment.
Right, stackalloc works for primitive/blitable types only.
There was a problem hiding this comment.
Stackalloc storage can't contain gc refs. Among other reasons, there is no mechanism for gc reporting from stackalloc buffers.
On the surface, the idea of allowing GC refs in a stackalloc buffer seems viable, if we can get sufficient safety guarantees. Say if the stackalloc is zero-init and immediately span wrapped and there is no safe way to view it as raw memory. With the right implementation (say a jit intrisnic Span ctor) the jit could know the type of the elements and create a new kind of gc info report for it. But we'd have to look into it in more detail to see if other problems crop up.
There was a problem hiding this comment.
Sounds interesting. Andy could you please open an issue for that? I think you could describe the matter far better than me.
There was a problem hiding this comment.
https://github.com/dotnet/csharplang/blob/master/proposals/fixed-sized-buffers.md and related issues are pretty much this. @VSadov has been working on this recently.
There was a problem hiding this comment.
There is a proposal and a prototype where you could just say
RegexFC[StackBufferSize] fcSpan = default;
... just use fcSpan as an indexable array-like struct ...In reply to: 170391575 [](ancestors = 170391575)
There was a problem hiding this comment.
It could be useful in scenarios like this, but for now, I guess, array renting is a solution.
In reply to: 170393377 [](ancestors = 170393377,170391575)
There was a problem hiding this comment.
Supporting GC refs in span-wrapped stackallocs is similar to the fixed-sized buffer proposal but also seems slightly different, in perhaps interesting ways (?). It would allow for fixed or variable sized buffers and would be stack only, instead of supporting only fixed sized buffers but being storage agnostic. Would not require any language changes, just a new magic span constructor that the jit turns into stackalloc under the covers...
There was a problem hiding this comment.
Yes. The fixed size arrays are a syntactical sugar for ordinary struct types of appropriate size. You can have locals, parameters, fields of those types. They can even nest.
GC-tracked stackalloc allows for variable number of elements, but values must be stack-only. That would be basically the tradeoff.
In reply to: 170400977 [](ancestors = 170400977)
| private Span<T> _span; | ||
| private T[] _arrayFromPool; | ||
| private int _pos; | ||
| private readonly bool _clearArray; |
There was a problem hiding this comment.
We should not be making ValueListBuilder more expensive by adding cruft like this to it.
There was a problem hiding this comment.
As I'm not using these changes here I will revert them.
There was a problem hiding this comment.
I already reverted the change but I wouldn't call that addition cruft. In cases where you want to use ValueListBuilder with types that need to cleaned up, the clearArray overload is handy. In my case I didn't use it as it is faster to partially clean it up instead of using Array.Clear.
There was a problem hiding this comment.
Point is it's not pay for play though, it makes everyone use more stack
There was a problem hiding this comment.
You cannot really build common helper classes like these to support everything that one might potentially want to do. It kills their value. They need to stay focused on the mainstream use.
| { | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public T Pop() | ||
| public ref T Pop() |
There was a problem hiding this comment.
This is pretty dangerous. Very easy to make a mistake by using the value after it has been overwritten.
There was a problem hiding this comment.
I'm not using the Pop function with the ref keyword, I just wanted to add the possibility to do so. I understand the concerns. As I'm not using it either here, I will revert them.
There was a problem hiding this comment.
If you do not actually needs this, this change should be undone.
1379969 to
6e3c949
Compare
|
Thanks, I addressed feedback. PTAL/approve for merging. |
e15cf7d to
aa746e0
Compare
|
fixed failing tests |
|
@dotnet-bot test Linux x64 Release Build |
|
@danmosemsft @stephentoub I think this PR is now ready to be merged. Please review/approve. |
|
@dotnet/dnceng Not the first time that I notice that some/all legs mentioned in the previous comment restart when posting a new comment. This leg restarted: https://ci3.dot.net/job/dotnet_corefx/job/master/job/osx-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/9152/ after posting #27314 (comment) |
|
@ViktorHofer Huh. That seems strange. We'll look into it. |
aa746e0 to
0a73fc4
Compare
0a73fc4 to
5201ee7
Compare
| if (0 != (anchors & Eol)) sb.Append(", Eol"); | ||
| if (0 != (anchors & End)) sb.Append(", End"); | ||
| if (0 != (anchors & EndZ)) sb.Append(", EndZ"); | ||
| if (0 != (anchors & Beginning)) |
There was a problem hiding this comment.
VS did that change...
| RegexFC child = PopFC(); | ||
| RegexFC cumul = TopFC(); | ||
| RegexFC child = FCPop(); | ||
| RegexFC cumul = _fcStack[_fcStack.Count - 1]; |
There was a problem hiding this comment.
Why did you change all of these away from the helper method?
There was a problem hiding this comment.
When I was trying a different data model I probably removed those. When I prepared the PR today for review today I noticed that I should revert these changes but was too lazy. Will do that now ;)
stephentoub
left a comment
There was a problem hiding this comment.
LGTM. Do we have any tests that stress using regexes in parallel? If not, it'd be good to add some, now that we're using pooled objects.
|
Thanks Stephen! Yes I have some parallel tests locally but we don't have any in corefx yet. I will create an issue to track that. |
| } | ||
|
|
||
| /// <summary> | ||
| /// Return rented buffers. Clear RegexFC buffer manually as only |
There was a problem hiding this comment.
What code is the "Clear RegexFC buffer manually" referring to?
…otnet/corefx#27314) * Change RegexPrefix to struct & use ValueListBuilder in RegexFCD * Move ValueStringConstruction from the caller to the callee * Store RegexFC in List<T>, better perf for RegexPrefix access Commit migrated from dotnet/corefx@7520e80
Updated results from yesterday (couldn't repro the better number in commit 2).
before:
after:
Results
After the first regex construction:
~ 8% increased throughput
~ 4% less allocation