-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Do you have any tests that show how well this works? |
I'm still manually testing it. It works, but I'd love to have a real-world scenario to test against. Would also love suggestions... |
Note that I have run CoreFX tests against this change. |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
|
I'll take a look tomorrow but including @vancem as he originally implemented the pinnable buffer. he may have comments. |
Some more details on my thought process: I tried to come up with something that would eventually clear most of the buffers without creating a lot of callbacks or complicated logic (this attempt has one callback per data type). If you're heavily using a particular bucket stack you should be able to both keep it stocked and not permanently rot items at the bottom of the stack. Given the size of the stacks it should take around 8 Gen2 GCs to completely clear a completely idle stack. My assumption is that letting the frequency of Gen2 collections drive the speed of clearing was the best way to go. You can also manually force the pressure down by manually kicking GCs as a workaround if we're not being aggressive enough for a given end user scenario. I'm also presuming there is no rational way to clear the thread locals and that it is enough to reduce the long-term overhead to one buffer per bucket per data type (e.g. the thread local). |
You can do something like what Not saying you should do that, just pointed out an approach. You'd need to measure, but I'd be concerned potentially about the perf impact to the fast path of renting/returning that first array. |
Related, any measurements for how it affects rent/return performance? |
Are we sure we want to do this? We had similar logic in our memory cache and ended up deleting it aspnet/Caching#221 (comment). Before we merge something like this we should performance test our ASP.NET scenarios to understand the impact. |
So would I, which is why I didn't cross that bridge. Thanks for the pattern.
I ran through some benchmarks @stephentoub used and put them here: https://gist.github.com/JeremyKuhne/2b503df20e61f143341d32d8c247837f It didn't have any impact (any variance was within the range of what I got within multiple runs of a specific implementation). (Run on a 4 core i7-4770K.) Sample before run:
Sample after run:
We want to do something. Customers are already falling over (dotnet/corefx#25841) and we expect that with our expanded ArrayPool usage this is going to get worse- to the point that we have to hotfix 2.1 if we don't. Expectation is that we'll do a more comprehensive change post 2.1 where we tie in closely/directly with the GC. This PR was my best first attempt at a targeted fix. :)
Yep. Can someone help with this? :) |
You can associate
|
Yeah, that's a better approach. Although, looking at the implementation again now, the previously mentioned approach also wouldn't have indirections on the hot path... essentially the CWT is the previously proposed solution, just without the need for the finalizer to clean up, as that's the whole purpose of the CWT. |
I'm not sure I'm following- can you rough it out a bit? Are you suggesting I also keep a single ConditionalWeakTable and stick a reference to the array there as well as in the thread local (as the key)? (Sorry, a little under the weather, so I'm running a little slow.) |
Right. That then gives you the ability to enumerate the CWT, where the keys are the T[][]s from each thread... you wouldn't be able to null out the T[][] on the thread, but you could null out the T[]s it contains. |
Ah, ok, that was the mental block I had. I'll try this and see what difference it makes. One of the other assumptions I was making was that we would have a limited number of threads that never terminate over a long period, but that is probably naive. |
Using the ConditionalWeakTable made things consistently a little slower, but up to 5-10% on some cases:
Here is essentially what I tried: [ThreadStatic]
private static T[][] t_tlsBuckets;
private static ConditionalWeakTable<T[][], uint[]> s_tlsStashTimes = new ConditionalWeakTable<T[][], uint[]>();
// ... push times into the weak table every time we stash an array ...
public bool Trim()
{
uint tickCount = (uint)Environment.TickCount;
foreach (var bucket in _buckets)
{
bucket?.Trim(tickCount);
}
foreach (var tlsBuckets in s_tlsStashTimes)
{
T[][] buckets = tlsBuckets.Key;
uint[] bucketTicks = tlsBuckets.Value;
for (int i = 0; i < NumBuckets; i++)
{
uint stockTicks = bucketTicks[i];
if (stockTicks > tickCount || (tickCount - stockTicks) > TrimAfterTicks)
{
// We've wrapped the tick count or elapsed enough time since the
// first item went into this thread local bucket. Clear it.
buckets[i] = null;
}
}
}
return true;
} It is interesting that it was that dramatic in some cases as the cleanup path never gets hit in these metrics. |
What does this mean? It should only be adding to the CWT when the T[][] is first created for each thread here: coreclr/src/mscorlib/shared/System/Buffers/TlsOverPerCoreLockedStacksArrayPool.cs Line 183 in a346837
Does this comment mean you're mutating the CWT on every return? |
/// </summary> | ||
internal sealed class Gen2GcCallback : CriticalFinalizerObject | ||
{ | ||
public Gen2GcCallback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be? This is just a straight copy of the existing class that was sitting in the PinnableBufferCache.cs file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, make it private. I wrote this code originally and it was a mistake not to make it private. .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
T[] prev = tlsBuckets[bucketIndex];
tlsBuckets[bucketIndex] = array;
s_tlsStashTimes.GetValue(tlsBuckets, (T[][] key) => new uint[NumBuckets])[bucketIndex] = (uint)Environment.TickCount;
if (prev != null)
{
PerCoreLockedStacks bucket = _buckets[bucketIndex] ?? CreatePerCoreLockedStacks(bucketIndex);
bucket.TryPush(prev);
} I'm not sure how we'd age it properly without updating the timestamp every time we put something back. For the PerCoreLockedStacks I did it just for the bottom of the stack. I could keep one time for each set of buckets, but I don't think that would help much. It seems desirable to drop just the idle buckets in any case. I could also just track the "big" buckets? (>= 4K? I suspect that 4K is a big source of the real world weight, but maybe the PCLS cleanup is enough at that size?) |
@JeremyKuhne - It sound like you are getting more than enough advice about implementation details, so I will refrain there. I have a question:
And a suggestion.
This is especially important since this is a problem we are seeing in the field (leakage), and we probably will need to iterate on the heuristics. |
I just broke it into separate files so I could use Gen2GCCallback in my implementation. I put the files where they are supposed to be since I was touching them.
Yes, I'll add logging. Thanks for the pointers. |
This should unblock "Devirtualize ArrayPool.Shared" #15743 (comment)? |
namespace System | ||
{ | ||
[EventSource(Guid = "38ed3633-5e3f-5989-bf25-f0b1b3318c9b", Name = "Microsoft-DotNETRuntime-PinnableBufferCache-System")] | ||
internal sealed class PinnableBufferCacheEventSource : EventSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also nice-to-have change unrelated to the ArrayPool problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a perf implication here though- do we really need to roll this one back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Nobody missed this tracing in CoreCLR.
- We do not have a test coverage for it.
- Tracing related changes have been root cause of quite a few issues lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can pull the content, but what about the explicit guid construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't inherit from EventSource
before so would be no cost? (Presumably would want to keep Guid for 2.2 though?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't inherit from EventSource before so would be no cost?
Derp, missed that. :)
I have given it a quick look and it looks OK. |
I think so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
I'll put the tests up in CoreFX tomorrow (and link to this). Thanks for all the input! |
* Simple trim of ArrayPool buffers Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets. * Collect thread locals as well * Add event * Incorporate memory pressure into trimming. Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min. * Add implementation back for PinnableBufferCacheEventSource. * Remove security attribute. Fix GetMemoryInfo signature * Always use Tls* for shared pools Add environment variable switch * Add guid to PinnableBufferCacheEventSource * Address feedback - move setting code to CLRConfig - add constructor to PBCES - trim large arrays more aggressively - tweak names (ticks to ms, etc.) - interlock creating the cleanup callback - fix project file Rent/return perf numbers are unchanged * Remove static constructor Inline Unsafe.SizeOf * Fix spacing issue * Trim all thread locals when memory pressure is high. Move constants inline. * Undo formatting changes * Add back the internal call * Put the right bits back *sigh* * Missing the line feed * Add event for trim polling * Undo PinnableBufferCacheEventSource reimplementation Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Simple trim of ArrayPool buffers Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets. * Collect thread locals as well * Add event * Incorporate memory pressure into trimming. Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min. * Add implementation back for PinnableBufferCacheEventSource. * Remove security attribute. Fix GetMemoryInfo signature * Always use Tls* for shared pools Add environment variable switch * Add guid to PinnableBufferCacheEventSource * Address feedback - move setting code to CLRConfig - add constructor to PBCES - trim large arrays more aggressively - tweak names (ticks to ms, etc.) - interlock creating the cleanup callback - fix project file Rent/return perf numbers are unchanged * Remove static constructor Inline Unsafe.SizeOf * Fix spacing issue * Trim all thread locals when memory pressure is high. Move constants inline. * Undo formatting changes * Add back the internal call * Put the right bits back *sigh* * Missing the line feed * Add event for trim polling * Undo PinnableBufferCacheEventSource reimplementation Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@@ -48,6 +48,15 @@ internal sealed partial class TlsOverPerCoreLockedStacksArrayPool<T> : ArrayPool | |||
[ThreadStatic] | |||
private static T[][] t_tlsBuckets; | |||
|
|||
private int _callbackCreated; | |||
|
|||
private readonly static bool s_TrimBuffers = GetTrimBuffers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s_trimBuffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing
/// <summary> | ||
/// Used to keep track of all thread local buckets for trimming if needed | ||
/// </summary> | ||
private static readonly ConditionalWeakTable<T[][], object> s_AllTlsBuckets = s_TrimBuffers ? new ConditionalWeakTable<T[][], object>() : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s_allTlsBuckets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing
s_AllTlsBuckets.Add(tlsBuckets, null); | ||
if (Interlocked.Exchange(ref _callbackCreated, 1) != 1) | ||
{ | ||
Gen2GcCallback.Register(Gen2GcCallbackFunc, this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have already asked this, but we're now going to end up with one of these Gen2 callback object things per T. We decided that's ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetGC2Callback is pretty efficient (it holds just the callback and (weak) pointer to 'this'. The main 'overhead' is that the object is finalizable. You could imagine fixing that by sharing the finalizable object, but that is something that arguably should be done in the implementation of Gen2GcCallback.
It think it is OK to wait in see (if it is a problem we can fix it).
Also we may ditch Gen2GCCallback (since we really want to do this as we APPROACH a gen2 GC not after it), which is another reason not to invest alot of effort in it now.
|
||
if (log.IsEnabled() && buffer != null) | ||
{ | ||
log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be wrong: it's possible a thread grabbed the buffer at the same time we did here, in which case we'll report having trimmed a buffer even though we didn't. I assume we're ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unless you can think of a way to do this without incurring a hit. I'll clarify in comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, interlocked here isn't a bad idea since we're in cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, interlocked here isn't a bad idea since we're in cleanup.
That might reduce the chances, but it won't remove them, unless you also used an interlocked on the rent path, which we don't want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? Worth the interlocked for a slimmer chance of false positive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, but if you wanted to, I'd suggest changing it to something like:
if (log.IsEnabled())
{
T[] buffer = Interlocked.Exchange(ref buckets[i], null);
if (buffer != null)
{
log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id);
}
}
else
{
buckets[i] = null;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, actually, better yet:
if (log.IsEnabled())
{
foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets)
{
T[][] buckets = tlsBuckets.Key;
for (int i = 0; i < buckets.Length; i++)
{
T[] bucket = Interlocked.Exchange(ref buckets[i], null);
if (bucket != null)
{
log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id);
}
}
}
}
else
{
foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets)
{
T[][] buckets = tlsBuckets.Key;
Array.Clear(buckets, 0, buckets.Length);
}
}
foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets) | ||
{ | ||
T[][] buckets = tlsBuckets.Key; | ||
for (int i = 0; i < NumBuckets; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why NumBuckets rather than buckets.Length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using buckets.Length
here would be faster (the bounds checks would be eliminated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing
|
||
if (pressure == MemoryPressure.High) | ||
{ | ||
// Under high pressure, release all thread locals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little counter-intuitive to me. The array stored in the TLS slot is going to be the most recently used one; presumably we're more interested in keeping that one around than ones in the per-core stacks. But maybe this is the best we can do right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best for now. We didn't want to risk a perf hit from GetTicks or code complexity at this point.
// enabling/disabling for now. | ||
return true; | ||
#else | ||
return CLRConfig.GetBoolValueWithFallbacks("System.Buffers.TrimBuffers", "DOTNET_SYSTEM_BUFFERS_TRIMBUFFERS", defaultValue: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a name that's specific to ArrayPool, or even to ArrayPool.Shared, rather than for the general System.Buffers namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
|
||
lock (this) | ||
{ | ||
uint trimTicks = pressure == MemoryPressure.High ? StackHighTrimAfterMS : StackTrimAfterMS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be moved to before the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving
* Simple trim of ArrayPool buffers Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets. * Collect thread locals as well * Add event * Incorporate memory pressure into trimming. Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min. * Add implementation back for PinnableBufferCacheEventSource. * Remove security attribute. Fix GetMemoryInfo signature * Always use Tls* for shared pools Add environment variable switch * Add guid to PinnableBufferCacheEventSource * Address feedback - move setting code to CLRConfig - add constructor to PBCES - trim large arrays more aggressively - tweak names (ticks to ms, etc.) - interlock creating the cleanup callback - fix project file Rent/return perf numbers are unchanged * Remove static constructor Inline Unsafe.SizeOf * Fix spacing issue * Trim all thread locals when memory pressure is high. Move constants inline. * Undo formatting changes * Add back the internal call * Put the right bits back *sigh* * Missing the line feed * Add event for trim polling * Undo PinnableBufferCacheEventSource reimplementation Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Simple trim of ArrayPool buffers Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets. * Collect thread locals as well * Add event * Incorporate memory pressure into trimming. Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min. * Add implementation back for PinnableBufferCacheEventSource. * Remove security attribute. Fix GetMemoryInfo signature * Always use Tls* for shared pools Add environment variable switch * Add guid to PinnableBufferCacheEventSource * Address feedback - move setting code to CLRConfig - add constructor to PBCES - trim large arrays more aggressively - tweak names (ticks to ms, etc.) - interlock creating the cleanup callback - fix project file Rent/return perf numbers are unchanged * Remove static constructor Inline Unsafe.SizeOf * Fix spacing issue * Trim all thread locals when memory pressure is high. Move constants inline. * Undo formatting changes * Add back the internal call * Put the right bits back *sigh* * Missing the line feed * Add event for trim polling * Undo PinnableBufferCacheEventSource reimplementation Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* Simple trim of ArrayPool buffers Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets. * Collect thread locals as well * Add event * Incorporate memory pressure into trimming. Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min. * Add implementation back for PinnableBufferCacheEventSource. * Remove security attribute. Fix GetMemoryInfo signature * Always use Tls* for shared pools Add environment variable switch * Add guid to PinnableBufferCacheEventSource * Address feedback - move setting code to CLRConfig - add constructor to PBCES - trim large arrays more aggressively - tweak names (ticks to ms, etc.) - interlock creating the cleanup callback - fix project file Rent/return perf numbers are unchanged * Remove static constructor Inline Unsafe.SizeOf * Fix spacing issue * Trim all thread locals when memory pressure is high. Move constants inline. * Undo formatting changes * Add back the internal call * Put the right bits back *sigh* * Missing the line feed * Add event for trim polling * Undo PinnableBufferCacheEventSource reimplementation Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom buffer returned the clock resets.
This is taking clues from PinnableBufferCache, which has been split into separate files and tweaked for coding standards. Trying to introduce as little overhead/complexity as possible...
If this is a reasonable approach we may want to consider unifying on the Tls pool. I can potentially make a similar change to the other pool implementation.
cc: @danmosemsft