Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf followup for Pinned Object Heap #34215

Merged
6 commits merged into from
Apr 1, 2020
Merged

Perf followup for Pinned Object Heap #34215

6 commits merged into from
Apr 1, 2020

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Mar 27, 2020

A follow up for items tracked by #13739

Verified, adjusted or otherwise addressed the following items:

  • NUM_POH_ALIST
    Since POH covers whole possible range of object sizes, we need a few more buckets than in SOH or LOH - to cover combined SOH and LOH range.
  • static_data static_data_table
    POH has behaviors similar to LOH here - For example it is not very likely that GC will result in reducing the generation size (since there is no compaction and objects are relatively long lived).
    LOH tuning seems to be reasonable enough.
  • initial allocation sizes (also min segment size)
    I did not see obvious reasons for the above to be different from LOH.
    It is possible we could be ok with lower initial/minimum segment size, but that seems to be making very little difference, since this is not about committed size. We commit in smaller chunks (2 OS pages) anyways.
  • size_t gc_heap::get_uoh_seg_size (size_t size)
    Same as above.

Also looked in debugger and VTune for any obvious perf issues.
Apart from few minor tweaks, it all looked as expected.


// skip buckets that cannot possibly fit "size" and return the next one
// there is always such bucket since the last one fits everything
unsigned int first_suitable_bucket(size_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned int first_suitable_bucket(size_t size) [](start = 4, length = 47)

have you actually observed any perf difference doing this? I doubt it.

Copy link
Member Author

@VSadov VSadov Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a microbenchmark doing a lot of POH allocations this becomes noticeable. I saw about 10% improvement from these changes.

The hot spots that I saw: taking msl lock, iterating through buckets, splitting a free chunk in two (when chunk is big, make_unused_array may cause cache misses)
The others are hard to avoid or improve though. This one was fairly easy.

Copy link
Member

@Maoni0 Maoni0 Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you share your benchmark? what sort of perf testing were done on this in general? are you running the default perf scenarios with the infra? this new scenario should be added to gcperfsim.


In reply to: 399581040 [](ancestors = 399581040)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example:

    private static void AllocateUninitializedPinned()
    {
        System.Console.Write("AllocateUninitializedPinned: ");
        var r = new Random(1234);
        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < iters; i++)
        {
            int size = r.Next(2, 10000);
            var arr = GC.AllocateUninitializedArray<long>(size, pinned: true);
            Add(arr);   // noop when not retaining, store in a static list when do
        }

        sw.Stop();
        System.Console.WriteLine(sw.ElapsedMilliseconds);
    }

I had a variety of these - with not retaining arrays, retaining until after the iteration is over, retaining randomly in an array

like:

    private static void Add(long[] arr)
    {
        if (r.Next(100) > 50)
            staticArray[arr.Length & 2047] = arr;
    }

The not retaining anything scenario was the most sensitive to the cost of the allocation codepath.

Copy link
Member Author

@VSadov VSadov Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the standard infra with these changes, but I mostly see noise. If there are improvements, they are too small in the scenarios measured.

Yes, we should add POH scenarios to infra. At least for regression testing.
We may need to settle on what would be the scenario too look at - how big allocations are. How much is retained, etc.
If we track scenarios for LOH, POH scenarios could be very similar, except for size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOH and POH are not necessarily that similar - LOH is compacted by default for hardlimit and POH can't be compacted; LOH has in general much larger object sizes. let's sync up next week on what scenarios we want to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. For regression testing we need specific scenarios.

The tweaks I did here were just for something that stood out from looking at the code and VTune.

{
return (Interlocked::CompareExchange(&spin_lock->lock, 0, -1) < 0);
}

inline
static void enter_spin_lock (GCSpinLock* spin_lock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the justification of doing this? also I would not do this with this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We take locks very frequently and they are nearly always uncontended. The codepath that handles the contention is very large. It seems wasteful to inline that into callers when the code is not expected to run.

Copy link
Member Author

@VSadov VSadov Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to reduce the cost of taking these locks in most common scenarios. These locks are particularly expensive in POH allocations.

@VSadov
Copy link
Member Author

VSadov commented Mar 31, 2020

@Maoni0 - I think I have addressed all the PR concerns. Please take a look. Thanks!

@@ -3982,26 +4012,25 @@ class gc_heap
#endif //SYNCHRONIZATION_STATS

#define NUM_LOH_ALIST (7)
#define BASE_LOH_ALIST (64*1024)
// bucket 0 contains sizes less than 64*1024
#define BASE_LOH_ALIST_BITS (14)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be 15

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!!! Because it is the highest bit in 64*1024 - 1, (zero-based as in BitScanReverse). I'll add a comment as well.

@Maoni0
Copy link
Member

Maoni0 commented Apr 1, 2020

the rest LGTM!

@ghost
Copy link

ghost commented Apr 1, 2020

Hello @VSadov!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ghost ghost merged commit 45ed8fd into dotnet:master Apr 1, 2020
@VSadov VSadov mentioned this pull request Apr 7, 2020
4 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants