-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf followup for Pinned Object Heap #34215
Conversation
|
||
// skip buckets that cannot possibly fit "size" and return the next one | ||
// there is always such bucket since the last one fits everything | ||
unsigned int first_suitable_bucket(size_t size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned int first_suitable_bucket(size_t size) [](start = 4, length = 47)
have you actually observed any perf difference doing this? I doubt it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a microbenchmark doing a lot of POH allocations this becomes noticeable. I saw about 10% improvement from these changes.
The hot spots that I saw: taking msl lock, iterating through buckets, splitting a free chunk in two (when chunk is big, make_unused_array
may cause cache misses)
The others are hard to avoid or improve though. This one was fairly easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you share your benchmark? what sort of perf testing were done on this in general? are you running the default perf scenarios with the infra? this new scenario should be added to gcperfsim.
In reply to: 399581040 [](ancestors = 399581040)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example:
private static void AllocateUninitializedPinned()
{
System.Console.Write("AllocateUninitializedPinned: ");
var r = new Random(1234);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iters; i++)
{
int size = r.Next(2, 10000);
var arr = GC.AllocateUninitializedArray<long>(size, pinned: true);
Add(arr); // noop when not retaining, store in a static list when do
}
sw.Stop();
System.Console.WriteLine(sw.ElapsedMilliseconds);
}
I had a variety of these - with not retaining arrays, retaining until after the iteration is over, retaining randomly in an array
like:
private static void Add(long[] arr)
{
if (r.Next(100) > 50)
staticArray[arr.Length & 2047] = arr;
}
The not retaining anything scenario was the most sensitive to the cost of the allocation codepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the standard infra with these changes, but I mostly see noise. If there are improvements, they are too small in the scenarios measured.
Yes, we should add POH scenarios to infra. At least for regression testing.
We may need to settle on what would be the scenario too look at - how big allocations are. How much is retained, etc.
If we track scenarios for LOH, POH scenarios could be very similar, except for size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOH and POH are not necessarily that similar - LOH is compacted by default for hardlimit and POH can't be compacted; LOH has in general much larger object sizes. let's sync up next week on what scenarios we want to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. For regression testing we need specific scenarios.
The tweaks I did here were just for something that stood out from looking at the code and VTune.
src/coreclr/src/gc/gc.cpp
Outdated
{ | ||
return (Interlocked::CompareExchange(&spin_lock->lock, 0, -1) < 0); | ||
} | ||
|
||
inline | ||
static void enter_spin_lock (GCSpinLock* spin_lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the justification of doing this? also I would not do this with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We take locks very frequently and they are nearly always uncontended. The codepath that handles the contention is very large. It seems wasteful to inline that into callers when the code is not expected to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to reduce the cost of taking these locks in most common scenarios. These locks are particularly expensive in POH allocations.
@Maoni0 - I think I have addressed all the PR concerns. Please take a look. Thanks! |
src/coreclr/src/gc/gcpriv.h
Outdated
@@ -3982,26 +4012,25 @@ class gc_heap | |||
#endif //SYNCHRONIZATION_STATS | |||
|
|||
#define NUM_LOH_ALIST (7) | |||
#define BASE_LOH_ALIST (64*1024) | |||
// bucket 0 contains sizes less than 64*1024 | |||
#define BASE_LOH_ALIST_BITS (14) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be 15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!!! Because it is the highest bit in 64*1024 - 1
, (zero-based as in BitScanReverse
). I'll add a comment as well.
the rest LGTM! |
Hello @VSadov! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
A follow up for items tracked by #13739
Verified, adjusted or otherwise addressed the following items:
Since POH covers whole possible range of object sizes, we need a few more buckets than in SOH or LOH - to cover combined SOH and LOH range.
POH has behaviors similar to LOH here - For example it is not very likely that GC will result in reducing the generation size (since there is no compaction and objects are relatively long lived).
LOH tuning seems to be reasonable enough.
I did not see obvious reasons for the above to be different from LOH.
It is possible we could be ok with lower initial/minimum segment size, but that seems to be making very little difference, since this is not about committed size. We commit in smaller chunks (2 OS pages) anyways.
Same as above.
Also looked in debugger and VTune for any obvious perf issues.
Apart from few minor tweaks, it all looked as expected.