Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Simple trim of ArrayPool buffers #17078

Merged
merged 18 commits into from
Apr 11, 2018
Merged

Conversation

JeremyKuhne
Copy link
Member

@JeremyKuhne JeremyKuhne commented Mar 20, 2018

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom buffer returned the clock resets.

This is taking clues from PinnableBufferCache, which has been split into separate files and tweaked for coding standards. Trying to introduce as little overhead/complexity as possible...

If this is a reasonable approach we may want to consider unifying on the Tls pool. I can potentially make a similar change to the other pool implementation.

cc: @danmosemsft

@JeremyKuhne
Copy link
Member Author

JeremyKuhne commented Mar 20, 2018

@jkotas jkotas requested a review from brianrob March 20, 2018 23:14
@jkotas
Copy link
Member

jkotas commented Mar 20, 2018

Do you have any tests that show how well this works?

@JeremyKuhne
Copy link
Member Author

Do you have any tests that show how well this works?

I'm still manually testing it. It works, but I'd love to have a real-world scenario to test against. Would also love suggestions...

@JeremyKuhne
Copy link
Member Author

Note that I have run CoreFX tests against this change.

@JeremyKuhne
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

16:57:42 FATAL: Remote call on JNLP4-connect connection from 131.107.19.193/131.107.19.193:49703 failed. The channel is closing down or has closed down
16:57:42 java.nio.channels.ClosedChannelException
16:57:42 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
16:57:42 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
16:57:42 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
16:57:42 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
16:57:42 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
16:57:42 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
16:57:42 	at java.lang.Thread.run(Thread.java:748)

@Maoni0
Copy link
Member

Maoni0 commented Mar 21, 2018

I'll take a look tomorrow but including @vancem as he originally implemented the pinnable buffer. he may have comments.

@JeremyKuhne
Copy link
Member Author

Some more details on my thought process:

I tried to come up with something that would eventually clear most of the buffers without creating a lot of callbacks or complicated logic (this attempt has one callback per data type). If you're heavily using a particular bucket stack you should be able to both keep it stocked and not permanently rot items at the bottom of the stack. Given the size of the stacks it should take around 8 Gen2 GCs to completely clear a completely idle stack. My assumption is that letting the frequency of Gen2 collections drive the speed of clearing was the best way to go. You can also manually force the pressure down by manually kicking GCs as a workaround if we're not being aggressive enough for a given end user scenario.

I'm also presuming there is no rational way to clear the thread locals and that it is enough to reduce the long-term overhead to one buffer per bucket per data type (e.g. the thread local).

@stephentoub
Copy link
Member

I'm also presuming there is no rational way to clear the thread locals

You can do something like what ThreadLocal<T> does, which is rather than storing the array directly in the [ThreadStatic] slot, you store a wrapper object that's also a node in a linked list or an entry in a dictionary or some such thing, which lets you clear out the wrapped slot. But then you also need to be able to remove the entry from the list/dictionary when the thread goes away, which necessitates some other finalizable object in a ThreadStatic that can remove the entry when it's finalizer runs.

Not saying you should do that, just pointed out an approach. You'd need to measure, but I'd be concerned potentially about the perf impact to the fast path of renting/returning that first array.

@stephentoub
Copy link
Member

Do you have any tests that show how well this works?

Related, any measurements for how it affects rent/return performance?

@davidfowl
Copy link
Member

Are we sure we want to do this? We had similar logic in our memory cache and ended up deleting it aspnet/Caching#221 (comment).

Before we merge something like this we should performance test our ASP.NET scenarios to understand the impact.

@JeremyKuhne
Copy link
Member Author

I'd be concerned potentially about the perf impact to the fast path of renting/returning that first array.

So would I, which is why I didn't cross that bridge. Thanks for the pattern.

Related, any measurements for how it affects rent/return performance?

I ran through some benchmarks @stephentoub used and put them here: https://gist.github.com/JeremyKuhne/2b503df20e61f143341d32d8c247837f It didn't have any impact (any variance was within the range of what I got within multiple runs of a specific implementation). (Run on a 4 core i7-4770K.)

Sample before run:

 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.1080318, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.3951266, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.3813456, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , NoYield    ) => Time: 00:00:01.5944227, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , LocalYield ) => Time: 00:00:02.0088442, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , GlobalYield) => Time: 00:00:01.8930700, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, NoYield    ) => Time: 00:00:23.9250718, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, LocalYield ) => Time: 00:00:23.9242211, GC0/1/2:    49 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, GlobalYield) => Time: 00:00:23.7578841, GC0/1/2:    24 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0494962, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.1786280, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.1839980, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.8073318, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.9102544, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.9044040, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, NoYield    ) => Time: 00:00:11.9661146, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, LocalYield ) => Time: 00:00:12.1780141, GC0/1/2:    24 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, GlobalYield) => Time: 00:00:11.8455675, GC0/1/2:    12 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0184491, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.0761201, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.0894149, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.3901609, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.5147964, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.5077650, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, NoYield    ) => Time: 00:00:05.9948266, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, LocalYield ) => Time: 00:00:06.0055983, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, GlobalYield) => Time: 00:00:05.9375291, GC0/1/2:     6 /     0 /   0

Sample after run:

 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.1195675, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.3908064, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.3978614, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , NoYield    ) => Time: 00:00:01.6054813, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , LocalYield ) => Time: 00:00:01.7660182, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , GlobalYield) => Time: 00:00:01.6995702, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, NoYield    ) => Time: 00:00:23.9186584, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, LocalYield ) => Time: 00:00:24.1897661, GC0/1/2:    49 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, GlobalYield) => Time: 00:00:23.6598582, GC0/1/2:    24 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0489640, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.2111178, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.1973525, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.8012958, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.9155817, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.9124632, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, NoYield    ) => Time: 00:00:11.9451898, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, LocalYield ) => Time: 00:00:11.9470689, GC0/1/2:    24 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, GlobalYield) => Time: 00:00:12.0564793, GC0/1/2:    12 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0180293, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.1049536, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.1014732, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.3998225, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.4733777, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.4913669, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, NoYield    ) => Time: 00:00:06.0523083, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, LocalYield ) => Time: 00:00:05.8928103, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, GlobalYield) => Time: 00:00:05.9300920, GC0/1/2:     6 /     0 /   0

Are we sure we want to do this?

We want to do something. Customers are already falling over (dotnet/corefx#25841) and we expect that with our expanded ArrayPool usage this is going to get worse- to the point that we have to hotfix 2.1 if we don't. Expectation is that we'll do a more comprehensive change post 2.1 where we tie in closely/directly with the GC. This PR was my best first attempt at a targeted fix. :)

Before we merge something like this we should performance test our ASP.NET scenarios to understand the impact.

Yep. Can someone help with this? :)

@jkotas
Copy link
Member

jkotas commented Mar 21, 2018

I'm also presuming there is no rational way to clear the thread locals

You can associate T[][] t_tlsBuckets with extra state using ConditonalWeakTable. It will give you:

  • No extra indirections on the hot path
  • A way to enumerate all cached arrays (ConditonalWeakTable has a nice lock-free enumerator)
  • A space to store extra bookeeping information (Value in ConditonalWeakTable)

@stephentoub
Copy link
Member

stephentoub commented Mar 21, 2018

You can associate T[][] t_tlsBuckets with extra state using ConditonalWeakTable

Yeah, that's a better approach. Although, looking at the implementation again now, the previously mentioned approach also wouldn't have indirections on the hot path... essentially the CWT is the previously proposed solution, just without the need for the finalizer to clean up, as that's the whole purpose of the CWT.

@JeremyKuhne
Copy link
Member Author

JeremyKuhne commented Mar 21, 2018

You can associate T[][] t_tlsBuckets with extra state using ConditonalWeakTable

Yeah, that's a better approach.

I'm not sure I'm following- can you rough it out a bit? Are you suggesting I also keep a single ConditionalWeakTable and stick a reference to the array there as well as in the thread local (as the key)? (Sorry, a little under the weather, so I'm running a little slow.)

@stephentoub
Copy link
Member

Are you suggesting I also keep a single ConditionalWeakTable and stick a reference to the array there as well as the thread local (as the key)?

Right. That then gives you the ability to enumerate the CWT, where the keys are the T[][]s from each thread... you wouldn't be able to null out the T[][] on the thread, but you could null out the T[]s it contains.

@JeremyKuhne
Copy link
Member Author

the keys are the T[][]s

Ah, ok, that was the mental block I had. I'll try this and see what difference it makes.

One of the other assumptions I was making was that we would have a limited number of threads that never terminate over a long period, but that is probably naive.

@JeremyKuhne
Copy link
Member Author

Using the ConditionalWeakTable made things consistently a little slower, but up to 5-10% on some cases:

 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.1760916, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.4494561, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.4065898, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , NoYield    ) => Time: 00:00:01.6498429, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , LocalYield ) => Time: 00:00:02.0986796, GC0/1/2:    49 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:256 , GlobalYield) => Time: 00:00:01.9819159, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, NoYield    ) => Time: 00:00:23.9929715, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, LocalYield ) => Time: 00:00:24.3178888, GC0/1/2:    49 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:4 ,Work/Arr:4096, GlobalYield) => Time: 00:00:24.2158338, GC0/1/2:    24 /     2 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0850347, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.2207018, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.2217390, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.8228801, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.9999651, GC0/1/2:    24 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.9657191, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, NoYield    ) => Time: 00:00:12.1093532, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, LocalYield ) => Time: 00:00:12.7586391, GC0/1/2:    24 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:2 ,Work/Arr:4096, GlobalYield) => Time: 00:00:12.3465542, GC0/1/2:    12 /     1 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , NoYield    ) => Time: 00:00:00.0257345, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , LocalYield ) => Time: 00:00:00.1541054, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:1   , GlobalYield) => Time: 00:00:00.0953629, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , NoYield    ) => Time: 00:00:00.3994841, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , LocalYield ) => Time: 00:00:00.5433442, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:256 , GlobalYield) => Time: 00:00:00.4899406, GC0/1/2:     6 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, NoYield    ) => Time: 00:00:06.1813209, GC0/1/2:     0 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, LocalYield ) => Time: 00:00:06.0829667, GC0/1/2:    12 /     0 /   0
 TlsOverPerCoreLockedStacksArrayPool`1(#Arr:1 ,Work/Arr:4096, GlobalYield) => Time: 00:00:06.0007151, GC0/1/2:     6 /     0 /   0

Here is essentially what I tried:

        [ThreadStatic]
        private static T[][] t_tlsBuckets;
        private static ConditionalWeakTable<T[][], uint[]> s_tlsStashTimes = new ConditionalWeakTable<T[][], uint[]>();

        // ... push times into the weak table every time we stash an array ...

        public bool Trim()
        {
            uint tickCount = (uint)Environment.TickCount;

            foreach (var bucket in _buckets)
            {
                bucket?.Trim(tickCount);
            }

            foreach (var tlsBuckets in s_tlsStashTimes)
            {
                T[][] buckets = tlsBuckets.Key;
                uint[] bucketTicks = tlsBuckets.Value;

                for (int i = 0; i < NumBuckets; i++)
                {
                    uint stockTicks = bucketTicks[i];
                    if (stockTicks > tickCount || (tickCount - stockTicks) > TrimAfterTicks)
                    {
                        // We've wrapped the tick count or elapsed enough time since the
                        // first item went into this thread local bucket. Clear it.
                        buckets[i] = null;
                    }
                }
            }

            return true;
        }       

It is interesting that it was that dramatic in some cases as the cleanup path never gets hit in these metrics.

@stephentoub
Copy link
Member

// ... push times into the weak table every time we stash an array ...

What does this mean? It should only be adding to the CWT when the T[][] is first created for each thread here:

t_tlsBuckets = tlsBuckets = new T[NumBuckets][];

Does this comment mean you're mutating the CWT on every return?

/// </summary>
internal sealed class Gen2GcCallback : CriticalFinalizerObject
{
public Gen2GcCallback()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be? This is just a straight copy of the existing class that was sitting in the PinnableBufferCache.cs file.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, make it private. I wrote this code originally and it was a mistake not to make it private. .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JeremyKuhne
Copy link
Member Author

Does this comment mean you're mutating the CWT on every return?

 T[] prev = tlsBuckets[bucketIndex];
 tlsBuckets[bucketIndex] = array;
 s_tlsStashTimes.GetValue(tlsBuckets, (T[][] key) => new uint[NumBuckets])[bucketIndex] = (uint)Environment.TickCount;
 if (prev != null)
 {
     PerCoreLockedStacks bucket = _buckets[bucketIndex] ?? CreatePerCoreLockedStacks(bucketIndex);
     bucket.TryPush(prev);
 }

I'm not sure how we'd age it properly without updating the timestamp every time we put something back. For the PerCoreLockedStacks I did it just for the bottom of the stack.

I could keep one time for each set of buckets, but I don't think that would help much. It seems desirable to drop just the idle buckets in any case. I could also just track the "big" buckets? (>= 4K? I suspect that 4K is a big source of the real world weight, but maybe the PCLS cleanup is enough at that size?)

@vancem
Copy link

vancem commented Mar 21, 2018

@JeremyKuhne - It sound like you are getting more than enough advice about implementation details, so I will refrain there.

I have a question:

  • I don't actually understand what is going on with the PinnableBufferCache.cs code, I see you pulled out selected pieces. What's with the pulling this file in? It is effectively unrelated cleanup (because it is no longer needed in coreFX?

And a suggestion.

  • ArrayPools already have an EventSource. You should be logging things. If the PinnableBufferCache taught me anything it is that dynamic systems can display very surprising behavior, and without some sort of detailed logging available, you are screwed. Since all the new logic only comes into play rarely, we should be generous with our logging. The really should be on a separate EventSource Keyword so you can turn on just the 'slow' events without turning on the 'expensive ones (the rent/release).

This is especially important since this is a problem we are seeing in the field (leakage), and we probably will need to iterate on the heuristics.

@JeremyKuhne
Copy link
Member Author

I don't actually understand what is going on with the PinnableBufferCache.cs code, I see you pulled out selected pieces. What's with the pulling this file in? It is effectively unrelated cleanup (because it is no longer needed in coreFX?

I just broke it into separate files so I could use Gen2GCCallback in my implementation. I put the files where they are supposed to be since I was touching them.

You should be logging things.

Yes, I'll add logging. Thanks for the pointers.

@benaadams
Copy link
Member

benaadams commented Apr 10, 2018

This should unblock "Devirtualize ArrayPool.Shared" #15743 (comment)?

namespace System
{
[EventSource(Guid = "38ed3633-5e3f-5989-bf25-f0b1b3318c9b", Name = "Microsoft-DotNETRuntime-PinnableBufferCache-System")]
internal sealed class PinnableBufferCacheEventSource : EventSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also nice-to-have change unrelated to the ArrayPool problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a perf implication here though- do we really need to roll this one back?

Copy link
Member

@jkotas jkotas Apr 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Nobody missed this tracing in CoreCLR.
  • We do not have a test coverage for it.
  • Tracing related changes have been root cause of quite a few issues lately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can pull the content, but what about the explicit guid construction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't inherit from EventSource before so would be no cost? (Presumably would want to keep Guid for 2.2 though?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't inherit from EventSource before so would be no cost?

Derp, missed that. :)

@vancem
Copy link

vancem commented Apr 10, 2018

I have given it a quick look and it looks OK.

@JeremyKuhne
Copy link
Member Author

This should unblock "Devirtualize ArrayPool.Shared" #15743 (comment)?

I think so

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@JeremyKuhne JeremyKuhne merged commit a2508f0 into dotnet:master Apr 11, 2018
@JeremyKuhne
Copy link
Member Author

I'll put the tests up in CoreFX tomorrow (and link to this). Thanks for all the input!

dotnet-bot pushed a commit to dotnet/corert that referenced this pull request Apr 11, 2018
* Simple trim of ArrayPool buffers

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets.

* Collect thread locals as well

* Add event

* Incorporate memory pressure into trimming.

Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min.

* Add implementation back for PinnableBufferCacheEventSource.

* Remove security attribute.
Fix GetMemoryInfo signature

* Always use Tls* for shared pools
Add environment variable switch

* Add guid to PinnableBufferCacheEventSource

* Address feedback

- move setting code to CLRConfig
- add constructor to PBCES
- trim large arrays more aggressively
- tweak names (ticks to ms, etc.)
- interlock creating the cleanup callback
- fix project file

Rent/return perf numbers are unchanged

* Remove static constructor
Inline Unsafe.SizeOf

* Fix spacing issue

* Trim all thread locals when memory pressure is high.
Move constants inline.

* Undo formatting changes

* Add back the internal call

* Put the right bits back *sigh*

* Missing the line feed

* Add event for trim polling

* Undo PinnableBufferCacheEventSource reimplementation

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corert that referenced this pull request Apr 11, 2018
* Simple trim of ArrayPool buffers

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets.

* Collect thread locals as well

* Add event

* Incorporate memory pressure into trimming.

Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min.

* Add implementation back for PinnableBufferCacheEventSource.

* Remove security attribute.
Fix GetMemoryInfo signature

* Always use Tls* for shared pools
Add environment variable switch

* Add guid to PinnableBufferCacheEventSource

* Address feedback

- move setting code to CLRConfig
- add constructor to PBCES
- trim large arrays more aggressively
- tweak names (ticks to ms, etc.)
- interlock creating the cleanup callback
- fix project file

Rent/return perf numbers are unchanged

* Remove static constructor
Inline Unsafe.SizeOf

* Fix spacing issue

* Trim all thread locals when memory pressure is high.
Move constants inline.

* Undo formatting changes

* Add back the internal call

* Put the right bits back *sigh*

* Missing the line feed

* Add event for trim polling

* Undo PinnableBufferCacheEventSource reimplementation

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@@ -48,6 +48,15 @@ internal sealed partial class TlsOverPerCoreLockedStacksArrayPool<T> : ArrayPool
[ThreadStatic]
private static T[][] t_tlsBuckets;

private int _callbackCreated;

private readonly static bool s_TrimBuffers = GetTrimBuffers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s_trimBuffers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing

/// <summary>
/// Used to keep track of all thread local buckets for trimming if needed
/// </summary>
private static readonly ConditionalWeakTable<T[][], object> s_AllTlsBuckets = s_TrimBuffers ? new ConditionalWeakTable<T[][], object>() : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s_allTlsBuckets

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing

s_AllTlsBuckets.Add(tlsBuckets, null);
if (Interlocked.Exchange(ref _callbackCreated, 1) != 1)
{
Gen2GcCallback.Register(Gen2GcCallbackFunc, this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have already asked this, but we're now going to end up with one of these Gen2 callback object things per T. We decided that's ok?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetGC2Callback is pretty efficient (it holds just the callback and (weak) pointer to 'this'. The main 'overhead' is that the object is finalizable. You could imagine fixing that by sharing the finalizable object, but that is something that arguably should be done in the implementation of Gen2GcCallback.

It think it is OK to wait in see (if it is a problem we can fix it).

Also we may ditch Gen2GCCallback (since we really want to do this as we APPROACH a gen2 GC not after it), which is another reason not to invest alot of effort in it now.


if (log.IsEnabled() && buffer != null)
{
log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id);
Copy link
Member

@stephentoub stephentoub Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be wrong: it's possible a thread grabbed the buffer at the same time we did here, in which case we'll report having trimmed a buffer even though we didn't. I assume we're ok with that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unless you can think of a way to do this without incurring a hit. I'll clarify in comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, interlocked here isn't a bad idea since we're in cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, interlocked here isn't a bad idea since we're in cleanup.

That might reduce the chances, but it won't remove them, unless you also used an interlocked on the rent path, which we don't want to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? Worth the interlocked for a slimmer chance of false positive?

Copy link
Member

@stephentoub stephentoub Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but if you wanted to, I'd suggest changing it to something like:

if (log.IsEnabled())
{
    T[] buffer = Interlocked.Exchange(ref buckets[i], null);
    if (buffer != null)
    {
        log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id);
    }
}
else
{
    buckets[i] = null;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, actually, better yet:

if (log.IsEnabled())
{
    foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets)
    {
        T[][] buckets = tlsBuckets.Key;
        for (int i = 0; i < buckets.Length; i++)
        {
            T[] bucket = Interlocked.Exchange(ref buckets[i], null);
            if (bucket != null)
            {
                log.BufferTrimmed(buffer.GetHashCode(), buffer.Length, Id);
            }
        }
    }
}
else
{
    foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets)
    {
         T[][] buckets = tlsBuckets.Key;
         Array.Clear(buckets, 0, buckets.Length);
    }
}

foreach (KeyValuePair<T[][], object> tlsBuckets in s_AllTlsBuckets)
{
T[][] buckets = tlsBuckets.Key;
for (int i = 0; i < NumBuckets; i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why NumBuckets rather than buckets.Length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using buckets.Length here would be faster (the bounds checks would be eliminated)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing


if (pressure == MemoryPressure.High)
{
// Under high pressure, release all thread locals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little counter-intuitive to me. The array stored in the TLS slot is going to be the most recently used one; presumably we're more interested in keeping that one around than ones in the per-core stacks. But maybe this is the best we can do right now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best for now. We didn't want to risk a perf hit from GetTicks or code complexity at this point.

// enabling/disabling for now.
return true;
#else
return CLRConfig.GetBoolValueWithFallbacks("System.Buffers.TrimBuffers", "DOTNET_SYSTEM_BUFFERS_TRIMBUFFERS", defaultValue: true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a name that's specific to ArrayPool, or even to ArrayPool.Shared, rather than for the general System.Buffers namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


lock (this)
{
uint trimTicks = pressure == MemoryPressure.High ? StackHighTrimAfterMS : StackTrimAfterMS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be moved to before the lock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving

dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Apr 11, 2018
* Simple trim of ArrayPool buffers

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets.

* Collect thread locals as well

* Add event

* Incorporate memory pressure into trimming.

Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min.

* Add implementation back for PinnableBufferCacheEventSource.

* Remove security attribute.
Fix GetMemoryInfo signature

* Always use Tls* for shared pools
Add environment variable switch

* Add guid to PinnableBufferCacheEventSource

* Address feedback

- move setting code to CLRConfig
- add constructor to PBCES
- trim large arrays more aggressively
- tweak names (ticks to ms, etc.)
- interlock creating the cleanup callback
- fix project file

Rent/return perf numbers are unchanged

* Remove static constructor
Inline Unsafe.SizeOf

* Fix spacing issue

* Trim all thread locals when memory pressure is high.
Move constants inline.

* Undo formatting changes

* Add back the internal call

* Put the right bits back *sigh*

* Missing the line feed

* Add event for trim polling

* Undo PinnableBufferCacheEventSource reimplementation

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
Anipik pushed a commit to dotnet/corefx that referenced this pull request Apr 11, 2018
* Simple trim of ArrayPool buffers

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets.

* Collect thread locals as well

* Add event

* Incorporate memory pressure into trimming.

Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min.

* Add implementation back for PinnableBufferCacheEventSource.

* Remove security attribute.
Fix GetMemoryInfo signature

* Always use Tls* for shared pools
Add environment variable switch

* Add guid to PinnableBufferCacheEventSource

* Address feedback

- move setting code to CLRConfig
- add constructor to PBCES
- trim large arrays more aggressively
- tweak names (ticks to ms, etc.)
- interlock creating the cleanup callback
- fix project file

Rent/return perf numbers are unchanged

* Remove static constructor
Inline Unsafe.SizeOf

* Fix spacing issue

* Trim all thread locals when memory pressure is high.
Move constants inline.

* Undo formatting changes

* Add back the internal call

* Put the right bits back *sigh*

* Missing the line feed

* Add event for trim polling

* Undo PinnableBufferCacheEventSource reimplementation

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Apr 11, 2018
* Simple trim of ArrayPool buffers

Trim ArrayPool buffers on Gen2 GC if the buffer stack hasn't been emptied for awhile. If you haven't pulled all of the buffers in the past 10 seconds, let loose the top buffer on the stack and give the stack another 2 seconds of potential life. When the stack gets it's bottom bufferr returned the clock resets.

* Collect thread locals as well

* Add event

* Incorporate memory pressure into trimming.

Idea is that we normally give buckets a minute of age time unless the pressure starts to ramp up. As it ramps up we'll trim more off the stacks. If it gets really high we'll consider stale to be 10s instead of 1 min.

* Add implementation back for PinnableBufferCacheEventSource.

* Remove security attribute.
Fix GetMemoryInfo signature

* Always use Tls* for shared pools
Add environment variable switch

* Add guid to PinnableBufferCacheEventSource

* Address feedback

- move setting code to CLRConfig
- add constructor to PBCES
- trim large arrays more aggressively
- tweak names (ticks to ms, etc.)
- interlock creating the cleanup callback
- fix project file

Rent/return perf numbers are unchanged

* Remove static constructor
Inline Unsafe.SizeOf

* Fix spacing issue

* Trim all thread locals when memory pressure is high.
Move constants inline.

* Undo formatting changes

* Add back the internal call

* Put the right bits back *sigh*

* Missing the line feed

* Add event for trim polling

* Undo PinnableBufferCacheEventSource reimplementation

Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet