Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

RegexPrefix struct, RegexFCD buffer to ValueListBuilder & ArrayPool#27314

Merged
ViktorHofer merged 7 commits into
dotnet:masterfrom
ViktorHofer:RegexRefactoring
Mar 1, 2018
Merged

RegexPrefix struct, RegexFCD buffer to ValueListBuilder & ArrayPool#27314
ViktorHofer merged 7 commits into
dotnet:masterfrom
ViktorHofer:RegexRefactoring

Conversation

@ViktorHofer
Copy link
Copy Markdown
Member

@ViktorHofer ViktorHofer commented Feb 21, 2018

Updated results from yesterday (couldn't repro the better number in commit 2).

before:

Method Mean Error StdDev Gen 0 Allocated
RegexCtor 339.5 ms 0.9025 ms 0.8442 ms 201000.0000 402.07 MB

after:

Method Mean Error StdDev Gen 0 Allocated
RegexCtor 312.0 ms 1.447 ms 1.1258 ms 189125.0000 378.36 MB

Results

After the first regex construction:
~ 8% increased throughput
~ 4% less allocation

<Compile Include="System\Text\RegularExpressions\CompiledRegexRunner.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCompiler.cs" />
<Compile Include="System\Text\RegularExpressions\RegexLWCGCompiler.cs" />
<Compile Include="System\Text\RegularExpressions\Runner\CompiledRegexRunnerFactory.cs" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coding conventions says that directory should match the namespace. There is no System.Text.RegularExpressions.Runner namespace...

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that we are enforcing that. Folders make sense here (as we don't have an sub namespaces) but I'm fine with flattening it back.

<Compile Include="$(CommonPath)\CoreLib\System\Collections\Generic\ValueListBuilder.cs">
<Link>Common\System\Collections\Generic\ValueListBuilder.cs</Link>
</Compile>
<Compile Include="$(CommonPath)\CoreLib\System\Collections\HashtableExtensions.cs">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not used by anything else. Why move it to common?

@ViktorHofer ViktorHofer force-pushed the RegexRefactoring branch 3 times, most recently from 4673b38 to b2204db Compare February 21, 2018 05:34
@ViktorHofer ViktorHofer changed the title Move HashtableExtensions to common, Regex refactoring Regex refactoring & perf improvements Feb 21, 2018
@ViktorHofer
Copy link
Copy Markdown
Member Author

Thanks Jan. Updated the PR. Also cc @stephentoub for perf

@danmoseley
Copy link
Copy Markdown
Member

<PropertyGroup Condition="'$(Configuration)|$(Platform)' == 'uap-Windows_NT-Release|AnyCPU'" />
<ItemGroup>
<Compile Include="System\Text\RegularExpressions\HashtableExtensions.cs" />
<Compile Include="System\Collections\Generic\ValueListBuilder.Pop.cs" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably merge into the rest of VLB.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stephen advised to keep the Pop method out

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I believe it was Stephen, or Jan. But that was before we had the Indexer in it. I don't have a strong opinion here...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, Pop as an operation doesn't really belong on the type; it's just how Regex happens to use it. My original suggestion had actually been (or what I'd intended) for Regex to just have a static helper that took a ValueListBuilder by ref, rather than having it be a partial class instance method like this.

<Compile Include="System\Text\RegularExpressions\RegexCode.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCollectionDebuggerProxy.cs" />
<Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" />
<Compile Include="System\Text\RegularExpressions\RegexFCD.cs" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out what FCD stands for? Probably First Child D... ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. not really. comments in its file:

// This RegexFCD class is internal to the Regex package.
// It builds a bunch of FC information (RegexFC) about

}

internal sealed class RegexFC
internal struct RegexFC : IDisposable
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDisposable may not be a good fit for a struct especially when forgetting to dispose is not a resourcel eak.

#27250 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not using it either you are calling Dispose directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not a ref type but I agree, we don't need it here. I had a different implementation for ValueListBuilder which detects if the type implements IDisposable and then calls it but there was a perf regression with the type is IDisposable check (reflection).

@ViktorHofer
Copy link
Copy Markdown
Member Author

Updated with RegexFC ArrayPool usage.

With our test data the RegexFC's buffer is always < 32 which means it never grows. As it was implemented as a growing stack I did the same here. I looked at the code but am not 100% sure if it could ever get larger than its initial size. Not really important here.

The more important change is that I now rent RegexFC (struct) arrays from ArrayPool. I don't know if we pass a managed type to ArrayPool anywhere else in Core but after talking with Stephen I believe the change is safe. To avoid memory leaks I clear the reference to RegexCharClass when disposing.


// Return rented buffers
RegexFC[] fcSpan = ArrayPool<RegexFC>.Shared.Rent(StackBufferSize);
var fcStack = new ValueListBuilder<RegexFC>(fcSpan, clearArray: true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is clearArray important? we almost never clear on returning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the clearArray here is wrong as I do it manually some lines below but in general clearArray is important if you store managed objects with references to other objects (in our case RegexFC --> RegexCharClass). We usually use ArrayPool in combination with primitive types like int & char.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidently merged two commits, don't wonder :)

@danmoseley
Copy link
Copy Markdown
Member

danmoseley commented Feb 21, 2018

OK so far but please please next time you do a big refactoring keep each refactoring into its own commit. I don't care how small the commits are. Eg., move A to B, rename C to D, change E to struct, split F into partial classes G and H. The split into partial classes is particularly difficult to review because I would have to manually set up the diff between old and new files. I did not see anything much but doing it like that makes it possible to be much more confident and do the review FAR more quickly and accurately.

Ideally you would have the refactoring first and behavior changes in separate PR's. Since you've started this though, let's at least leave them in separate commits (ie, rebase not squash) so that we have some hope of bisecting if a problem emerges.

When refactoring good test coverage is critical. As we noted offline the block code coverage of regex is very good, much of the unhit blocks are debug only code. Of course by the nature of regexes, block coverage is hardly sufficient, but it could be worse.

@danmoseley
Copy link
Copy Markdown
Member

Great to see the improvements though!

@ViktorHofer
Copy link
Copy Markdown
Member Author

ViktorHofer commented Feb 22, 2018

Thanks Dan, I appreciate the feedback (as discussed offline) and will try to avoid this behavior in future. I will now continue with the next candidate and will open a new PR after this one is approved and merged (not squashed) into master.

{
return _length;
}
return Text.Substring(Index + Length, Text.Length - Index - Length);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about slicing rather than substringing? It looks like at least one of the callers then appends the result to a StringBuilder, so if you do it as a slice, you can avoid the intermediary string.

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

{
return _index;
}
return Text.Substring(0, Index);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same slicing comment/question.


int[] matches = _matches[groupnum];

return Text.Substring(matches[(c - 1) * 2], matches[(c * 2) - 1]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same slicing comment/question.

/// </summary>
internal sealed class SharedReference
{
private WeakReference _ref = new WeakReference(null);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readonly

internal sealed class SharedReference
{
private WeakReference _ref = new WeakReference(null);
private int _locked;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_locked should be volatile

{
private RegexRunner _ref;
private object _obj;
private int _locked;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_locked should be volatile

@ViktorHofer
Copy link
Copy Markdown
Member Author

@stephentoub

The first commit is intended to just do refactoring / code styling stuff. The 2nd and 3rd commits are about performance of the RegexFCD and RegexFC types and about making RegexPrefix a struct.

I touched a lot of files in the first commit but intend to work on them in separate PRs as there are tons of possible improvements there. E.g. the slicing you pointed out.

internal object Get()
{
// try to obtain the lock
if (0 == Interlocked.Exchange(ref _locked, 1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use a normal lock?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. I think that change was made by @justinvp. Maybe he can answer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change it. This is how it was when the sources were added to corefx.

cc07fbf#diff-18063ba409511b10aec57135398e72d3R1056

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks and sorry for the noise!

// Create/rent buffers
Span<int> intSpan = stackalloc int[StackBufferSize];
var intStack = new ValueListBuilder<int>(intSpan);
RegexFC[] fcSpan = ArrayPool<RegexFC>.Shared.Rent(StackBufferSize);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're renting an array of StackBufferSize? I'd have thought we'd start with stack space of StackBufferSize. I guess we can't as these aren't primitives?

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. it's a managed type which can't be stackalloced. I don't know why these restriction even apply to ref structs that always live on the stack. maybe @AndyAyersMS can answer?

Copy link
Copy Markdown
Member

@jkotas jkotas Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, stackalloc works for primitive/blitable types only.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stackalloc storage can't contain gc refs. Among other reasons, there is no mechanism for gc reporting from stackalloc buffers.

On the surface, the idea of allowing GC refs in a stackalloc buffer seems viable, if we can get sufficient safety guarantees. Say if the stackalloc is zero-init and immediately span wrapped and there is no safe way to view it as raw memory. With the right implementation (say a jit intrisnic Span ctor) the jit could know the type of the elements and create a new kind of gc info report for it. But we'd have to look into it in more detail to see if other problems crop up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds interesting. Andy could you please open an issue for that? I think you could describe the matter far better than me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/dotnet/csharplang/blob/master/proposals/fixed-sized-buffers.md and related issues are pretty much this. @VSadov has been working on this recently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a proposal and a prototype where you could just say

RegexFC[StackBufferSize] fcSpan = default; 

... just use fcSpan  as an indexable array-like struct ...

In reply to: 170391575 [](ancestors = 170391575)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful in scenarios like this, but for now, I guess, array renting is a solution.


In reply to: 170393377 [](ancestors = 170393377,170391575)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting GC refs in span-wrapped stackallocs is similar to the fixed-sized buffer proposal but also seems slightly different, in perhaps interesting ways (?). It would allow for fixed or variable sized buffers and would be stack only, instead of supporting only fixed sized buffers but being storage agnostic. Would not require any language changes, just a new magic span constructor that the jit turns into stackalloc under the covers...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The fixed size arrays are a syntactical sugar for ordinary struct types of appropriate size. You can have locals, parameters, fields of those types. They can even nest.
GC-tracked stackalloc allows for variable number of elements, but values must be stack-only. That would be basically the tradeoff.


In reply to: 170400977 [](ancestors = 170400977)

private Span<T> _span;
private T[] _arrayFromPool;
private int _pos;
private readonly bool _clearArray;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be making ValueListBuilder more expensive by adding cruft like this to it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm not using these changes here I will revert them.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already reverted the change but I wouldn't call that addition cruft. In cases where you want to use ValueListBuilder with types that need to cleaned up, the clearArray overload is handy. In my case I didn't use it as it is faster to partially clean it up instead of using Array.Clear.

Copy link
Copy Markdown
Member

@danmoseley danmoseley Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point is it's not pay for play though, it makes everyone use more stack

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot really build common helper classes like these to support everything that one might potentially want to do. It kills their value. They need to stay focused on the mainstream use.

{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public T Pop()
public ref T Pop()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty dangerous. Very easy to make a mistake by using the value after it has been overwritten.

Copy link
Copy Markdown
Member Author

@ViktorHofer ViktorHofer Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not using the Pop function with the ref keyword, I just wanted to add the possibility to do so. I understand the concerns. As I'm not using it either here, I will revert them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not actually needs this, this change should be undone.

@ViktorHofer
Copy link
Copy Markdown
Member Author

Thanks, I addressed feedback. PTAL/approve for merging.

@ViktorHofer
Copy link
Copy Markdown
Member Author

fixed failing tests

@ViktorHofer
Copy link
Copy Markdown
Member Author

@dotnet-bot test Linux x64 Release Build
@dotnet-bot test OSX x64 Debug Build

@ViktorHofer
Copy link
Copy Markdown
Member Author

@danmosemsft @stephentoub I think this PR is now ready to be merged. Please review/approve.

@ViktorHofer
Copy link
Copy Markdown
Member Author

ViktorHofer commented Feb 27, 2018

@dotnet/dnceng Not the first time that I notice that some/all legs mentioned in the previous comment restart when posting a new comment.

This leg restarted: https://ci3.dot.net/job/dotnet_corefx/job/master/job/osx-TGroup_netcoreapp+CGroup_Debug+AGroup_x64+TestOuter_false_prtest/9152/ after posting #27314 (comment)

@jonfortescue
Copy link
Copy Markdown
Contributor

@ViktorHofer Huh. That seems strange. We'll look into it.

if (0 != (anchors & Eol)) sb.Append(", Eol");
if (0 != (anchors & End)) sb.Append(", End");
if (0 != (anchors & EndZ)) sb.Append(", EndZ");
if (0 != (anchors & Beginning))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS did that change...

RegexFC child = PopFC();
RegexFC cumul = TopFC();
RegexFC child = FCPop();
RegexFC cumul = _fcStack[_fcStack.Count - 1];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change all of these away from the helper method?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was trying a different data model I probably removed those. When I prepared the PR today for review today I noticed that I should revert these changes but was too lazy. Will do that now ;)

Copy link
Copy Markdown
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Do we have any tests that stress using regexes in parallel? If not, it'd be good to add some, now that we're using pooled objects.

@ViktorHofer
Copy link
Copy Markdown
Member Author

Thanks Stephen! Yes I have some parallel tests locally but we don't have any in corefx yet. I will create an issue to track that.

}

/// <summary>
/// Return rented buffers. Clear RegexFC buffer manually as only
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code is the "Clear RegexFC buffer manually" referring to?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

@ViktorHofer ViktorHofer merged commit 7520e80 into dotnet:master Mar 1, 2018
@ViktorHofer ViktorHofer deleted the RegexRefactoring branch March 1, 2018 14:08
@karelz karelz added this to the 2.1.0 milestone Mar 10, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…otnet/corefx#27314)

* Change RegexPrefix to struct & use ValueListBuilder in RegexFCD

* Move ValueStringConstruction from the caller to the callee

* Store RegexFC in List<T>, better perf for RegexPrefix access


Commit migrated from dotnet/corefx@7520e80
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants