Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better default for L3 cache size on win-arm64 and lin-arm64 #64645

Closed
wants to merge 9 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 1, 2022

While we're trying to address the L3 cache issue (#60166) for both Win-arm64 and Linux-arm64 (osx-arm64 is fine already #64576), I think it makes sense to at least use the existing heuristic as a good default (based on logical cores count) if it's bigger than what we found (e.g. L2). This heuristic looks like this:

int predictedCacheSize = Math.Min(4096, Math.Max(256, (int)logicalCPUs * 128)) * 1024;

(it's not mine, it existed in the code for cases when we can't even get any cache info at all)

Same heuristic but visualized:
image

I think it's better than the current default and it also won't hurt small CPUs or won't report something gigantic for some 80cores CPU etc.

Gen0 size is currently calculated like this: ((L3size * 3) * 5 / 8) where 5/8 is a general heuristic and * 3 is some arm64 specific, see

#if defined(TARGET_ARM64)
// Bigger gen0 size helps arm64 targets
maxSize = maxTrueSize * 3;
#endif

Also, here is a graph for Gen0 size -- RPS for Plaintext-MVC benchmark (the most GC-bound we have in PerfLab currently):
image

The red line is where we're now: 256kb L3 -> 480Kb Gen0 -> 380k RPS.
The green line is what we'll have with this heuristic: 1.5Mb L3 -> ~2.8Mb Gen0 -> 920k RPS

For this specific benchmark the best RPS (1086k RPS) corresponds to ~16Mb Gen0 (L3 cache function should report something between 12mb and 16mb)

database-fortunes benchmark:
image

I also ran a couple of simple micro-benchmarks locally on a workstation GC and it seems like performance gets steady after gen0 at least 4Mb

@Maoni0 @mangod9 @jkotas

@ghost
Copy link

ghost commented Feb 1, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

While we're trying to address the L3 cache issue for both Win-arm64 and Linux-arm64 (osx-arm64 is fine already #64576), I think it makes sense to at least use the existing heuristic as a good default (based on logical cores count) if it's bigger than what we found (e.g. L2). This heuristic looks like this:

int predictedCacheSize = Math.Min(1536, Math.Max(256, (int)logicalCPUs * 128)) * 1024;

(it's not mine, it existed in the code for cases when we can't even get any cache info at all)

Same heuristic but visualized:
image

It doesn't predict 32Mb for our 30core eMAG, but it's better than nothing and it also won't hurt small CPUs or won't report something gigantic for some 80cores CPU etc.

Gen0 size is currently calculated like this: ((L3size * 3) * 5 / 8) where 5/8 is a general heuristic and * 3 is some arm64 specific, see

#if defined(TARGET_ARM64)
// Bigger gen0 size helps arm64 targets
maxSize = maxTrueSize * 3;
#endif

Also, here is a graph for Gen0 size -- RPS for Plaintext-MVC benchmark (the most GC-bound we have in PerfLab currently):
image

The red line is where we're now: 256kb L3 -> 480Kb Gen0 -> 380k RPS.
The green line is what we'll have with this heuristic: 1.5Mb L3 -> ~2.8Mb Gen0 -> 920k RPS

For this specific benchmark the best RPS (1086k RPS) corresponds to ~16Mb Gen0 (L3 cache function should report something between 12mb and 16mb)

I also ran a couple of simple micro-benchmarks locally on a workstation GC and it seems like performance gets steady after gen0 at least 4Mb

@Maoni0 @mangod9 @jkotas

Author: EgorBo
Assignees: EgorBo
Labels:

area-GC-coreclr

Milestone: -

@mangod9
Copy link
Member

mangod9 commented Feb 1, 2022

This looks good for now, but based on your measurements should we default to at-least 4mb? @Maoni0 ?

@Maoni0
Copy link
Member

Maoni0 commented Feb 1, 2022

on a 64 proc machine

logicalCPUs * std::min(1536, std::max(256, (int)logicalCPUs * 128)) * 1024;

would return 96mb..that's way too large. why are we doing the logicalCPUs * part? this is supposed to be per CPU.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 1, 2022

Added database-fortunes aspnet benchmark. Will run more.

why are we doing the logicalCPUs * part? this is supposed to be per CPU.

I think it's just a heuristic like "if a cpu has more than 8 cores it's most likely something powerful"

logicalCPUs * std::min(1536, std::max(256, (int)logicalCPUs * 128)) * 1024;

Oops, in my formula in the issue I forgot about the leading logicalCPUs * and it actually reported something meaningful 😄 E.g. it never goes bigger than 1.5Mb for >= 12 cores. It actually makes sense(?)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 2, 2022

@Maoni0 I've just changed the formula, now it's

image

Max gen0 size is 7.5Mb (for systems with > 30 cores) , let me know if you want a smaller value

@EgorBo
Copy link
Member Author

EgorBo commented Feb 2, 2022

I am running all the aspnet benchamarks we have in perflab via crank currently, so far the best results (or rather optimal) are when Gen0 is between 6Mb and 16Mb

@tannergooding
Copy link
Member

I'm not familiar with the logic here, so sorry for the question in advance/explanation of thoughts in advance if this has already been considered/etc...

Why are we predicting the L3 size rather than just getting it from the OS (such as GetLogicalProcessorInformationEx) or from the CPUID info (x86/x64)?

There is a range of hardware and configurations here and as the core counts and layouts increase there is a lot more interesting details than just "how much L3 exists". So it seems we are potentially missing loads of potentially important information by not pulling the relevant info from the OS/hardware.

For example, lets consider just:

  • Ryzen 3950X - 4 x 16 MB of L3 cache with 16-way associativity
  • Ryzen 5950X - 2 x 32 MB of L3 cache with 16-way associativity

The Ryzen CPUs are comprised of CCX modules where each CCX has its own share of the cores, L1/L2/L3 cache, etc. The CCXs are technically distinct units and communicate with each other over the Infinity Fabric. While communicating over the Infinity Fabric is possible and fast, its also slower than accessing resources on the same CCX. Likewise, while two separate cores on the same CCX can communicate, it is slower than accessing the resources that are directly meant for that core. And finally, hyperthreading exists by basically splitting the resources of a single core in half with each thread getting roughly half of the resources available to it and so this is something that can be important to consider as well.

So while both of these CPUs provide 64MB of L3 cache and both have 16-cores/32-threads, the performance and considerations for the L3 cache here are quite a bit different. In both setups, each core has roughly 4MB L3 to itself and each thread, roughly 2MB L3. However, on the 3950X each core has access to an additional 12MB L3 at a "medium, speed" and the other 48MB of L3 over Infinity Fabric at an additional cost ("slow speed"). The 5950X on the other hand has access to 28MB of L3 in the same CCX at a "medium speed" and the other 32MB over Infinity Fabric at an additional cost ("slow speed").

There have been several articles and deep dives on the Ryzen architecture including https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 which shows some of the profiled Core-To-Core latencies and how same core access is extremely fast (6ns), accessing other cores on the same CCX was about 2-3x slower (~17ns) and accessing over infinity fabric was up to 4-5x slower than that (~80ns).

There are similar considerations in the Intel Alder Lake technology with their power/efficient split and in other upcoming CPUs like the Zen "3D" which will have up to 192MB L3 accessible.


With the introduction of the power/efficient split and CPUs with many cores/threads there are also a lot of considerations that come into play around thread scheduling and things that I think it would be good if we were considering and designing around.

https://www.intel.com/content/www/us/en/developer/articles/guide/alder-lake-developer-guide.html goes a bit in depth on some of the considerations. This specific article is somewhat game focused but many of the rules/guidelines are reiterated in the Intel and AMD optimization guides and are more generally applicable.

It calls out a lot of things that I don't believe we are accounting for today; like how caches are split/accessible by resources (called out above) or how hyper-threads share resources and so scheduling threads to the main thread of each core before scheduling to the secondary threads is important (some of which is expected to be handled by the OS, but which advanced usage scenarios may also take advantage of or provide additional hints around).

@EgorBo
Copy link
Member Author

EgorBo commented Feb 2, 2022

I'm not familiar with the logic here

I agree that just L3 size is rather a questionable metric without additional context like how many cores share it, etc. but the current problem that for Windows-ARM64 and Linux-ARM64 there is no way (we're aware of) to get any information about L3 at all, e.g. on Windows GetLogicalProcessorInformation simply only reports L1-L2 (Windows team is helping us atm) and same on Linux. On macOS we have everything we need from sysctl - L3 size, how many performance cores share it, etc...

@tannergooding
Copy link
Member

I agree that just L3 size is rather a questionable metric without additional context like how many cores share it, etc. but the current problem that for Windows-ARM64 and Linux-ARM64 there is no way (we're aware of) to get any information about L3 at all, e.g. on Windows GetLogicalProcessorInformation simply only reports L1-L2 (Windows team is helping us atm) and same on Linux. On macOS we have everything we need from sysctl - L3 size, how many performance cores share it, etc...

Sorry, are you saying that GetLogicalProcessorInformation (and the Ex variant) report L1/L2/L3 on x86/x64 but not for Arm64 so this is an Arm64 only issue (that is we do use the relevant OS APIs on x86/x64 and for Arm64 on OSX; so this as a workaround just for Arm64 on Windows/Linux)?

I'm notably not seeing the same on any of my 3 ARM64 devices (Surface Pro X, Samsung GalaxyBook2, or the Qualcomm ECS Liva dev box). A simple C++ app using GetLogicalProcessorInformationEx reports the L1, L2, and L3 caches as far as I can see.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 2, 2022

rry, are you saying that GetLogicalProcessorInformation (and the Ex variant) report L1/L2/L3 on x86/x64 but not for Arm64 so this is an Arm64 only issue (that is we do use the relevant OS APIs on x86/x64 and for Arm64 on OSX; so this as a workaround just for Arm64 on Windows/Linux)?

I'm notably not seeing the same on any of my 3 ARM64 devices (Surface Pro X, Samsung GalaxyBook2, or the Qualcomm ECS Liva dev box). A simple C++ app using GetLogicalProcessorInformationEx reports the L1, L2, and L3 caches as far as I can see.

Exactly, it doesn't report L3 on our Windows11-arm64 machines with lots-of-cores-hardware, Windows team is aware.
Also, there is no reliable way to get it on Linux at all and we also have some partners helping us here. Meanwhile, we need something better than 256Kb on a machine with 30 cores as a reported Last-level cache => Gen0 size.

So it's a reasonable workaround till we find a 100% reliable way to get the cache or switch to some other method to calculate Gen0 size.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 2, 2022

More aspnet/TechEmpower benchmarks from PerfLab:
RPS vs Gen0 and P90 vs Gen0. P90 is not a super reliable metric, for consistent results each run should be way longer but we still can see some patterns.

Vertical Axis - RPS or P90 (ms)
Horizontal Axis - Gen0 size, Mb. If you want to project source L3 size from Gen0 use L3 = (Gen0 * 8) / 15 formula

p1
p2
p4

So far, the optimal results are between 6Mb an 16Mb for Gen0, 7.5Mb as this PR proposes sounds like a good default.
Also, as I noted here, Max Gen0 result from the heuristic is 7.5Mb so no huge values anymore

@Maoni0 does it look good now? While we're looking for a better solution.
CI failures aren't related.

Plaintext-MVC baseline vs this PR (tested binaries):

| load                   |        base |          PR |          |
| ---------------------- | ----------- | ----------- | -------- |
| CPU Usage (%)          |           4 |           7 |  +75.00% |
| Cores usage (%)        |         118 |         197 |  +66.95% |
| Working Set (MB)       |          37 |          37 |    0.00% |
| Private Memory (MB)    |         358 |         358 |    0.00% |
| Start Time (ms)        |           0 |           0 |          |
| First Request (ms)     |         302 |         299 |   -0.99% |
| Requests/sec           |     448,574 |   1,008,072 | +124.73% |
| Requests               |   6,766,783 |  15,202,961 | +124.67% |
| Mean latency (ms)      |       10.61 |        3.10 |  -70.78% |
| Max latency (ms)       |      287.39 |       97.52 |  -66.07% |
| Bad responses          |           0 |           0 |          |
| Socket errors          |           0 |           0 |          |
| Read throughput (MB/s) |       56.47 |      126.90 | +124.72% |
| Latency 50th (ms)      |        5.76 |        2.44 |  -57.64% |
| Latency 75th (ms)      |       11.31 |        3.60 |  -68.17% |
| Latency 90th (ms)      |       23.21 |        9.80 |  -57.78% |
| Latency 99th (ms)      |        0.00 |        0.00 |          |

@AntonLapounov
Copy link
Member

I am running all the aspnet benchamarks we have in perflab via crank currently, so far the best results (or rather optimal) are when Gen0 is between 6Mb and 16Mb

What is the actual L3 size on those machines? If it is 32 MiB, then our 5/8 factor may be inadequate even if we remove 3x scaling. I am afraid you are optimizing for very specific hardware. To change the formula, we need to run tests on more than one type of hardware.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

I am running all the aspnet benchamarks we have in perflab via crank currently, so far the best results (or rather optimal) are when Gen0 is between 6Mb and 16Mb

What is the actual L3 size on those machines? If it is 32 MiB, then our 5/8 factor may be inadequate even if we remove 3x scaling. I am afraid you are optimizing for very specific hardware. To change the formula, we need to run tests on more than one type of hardware.

The machines have 32Mb of L3, the heuristic reports 4Mb which results in 7.5Mb for Gen0 (max possible size for this heuristic). For these benchmarks on this CPU it produces the best "RPS/working set size" ratio. Can be decreased down to 2Mb L3 (3.75Mb Gen0) without losing much benefits (~10%) if 7.5Mb is too much.

This PR is not a scientific paper, it just tries to use a reasonable default which is much better than what we have now - 256Kb (480Kb gen0). It noticeably improves all GC-intensive benchmarks, even for desktop scenarios. I propose we merge it so we can have a better ground for upcoming Preview2, the L3 cache issue was found ~3 month ago.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

This PR increases working set from ~170Mb to ~370Mb while the same benchmark reports 440Mb on our Xeon. So values after 8Mb gen0 dramatically increase working set without much benefits (e.g. Gen0=28mb == 1Gb of working set)

@AntonLapounov
Copy link
Member

How do we know this formula change does not affect negatively other types of hardware? For instance, in case of 8 cores the reported L3 size is changed by this PR from 8 MiB to just 1 MiB, which is a significant reduction. Your finding that the optimal Gen0 size is between 1/5 and 1/2 of the L3 size (instead of the currently used 3*5/8 factor) for this particular hardware is quite interesting; however, I think we should also test some other types of hardware before changing the general formula.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

How do we know this formula change does not affect negatively other types of hardware? For instance, in case of 8 cores the reported L3 size is changed by this PR from 8 MiB to just 1 MiB, which is a significant reduction. Your finding that the optimal Gen0 size is between 1/5 and 1/2 of the L3 size (instead of the currently used 3*5/8 factor) for this particular hardware is quite interesting; however, I think we should also test some other types of hardware before changing the general formula.

I think we never use that formula currently at all and just rely on whatever comes from API which is mostly something small (100% small on Linux-arm64).
We only use the formula on Linux in case if /sys/devices/system/cpu/cpu0/cache/index0/size didn't even report L2 (never happened to me on any linux-arm64 machine I tested)

@tannergooding
Copy link
Member

tannergooding commented Feb 3, 2022

We only use the formula on Linux in case if /sys/devices/system/cpu/cpu0/cache/index0/size didn't even report L2

Where is the logic for /sys/devices/* that handles knowing the type/level of cache (and number of indices) so we know what values we're actually using?

I only see logic that queries cpu0 and ignores any differences between cores (which is particularly bad for non-homogenous systems, which many ARM64 CPUs and which the new Alder Lake CPUs are). It also looks to just hardcode itself to index0 through index4 without checking how many cache indices are present.

  • Notably we also only appear to use this as a "fallback" path and prefer the sysconf data instead, which reports a "lot" less information and means we can't tune ourselves as well

I'd expect the logic to actually do things like:

  • See how many cpu* entries are under /sys/devices/system/cpu/
  • For each cpu*, see how many cache indices are under /sys/devices/system/cpu/cpu*/cache/
  • For each index*, see what the level is and query other relevant information (/sys/devices/system/cpu/cpu*/cache/index*/level)
    • There is a lot of information/data points here and it covers things like associativity, size, type, what cores have access to it, etc (all the same stuff GetLogicalProcessorInformationEx reports when it is working as intended)

For reference, this is all documented here: https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-devices-system-cpu

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

The problem that /sys/devices just doesn't say anything about L3 at all on all Linux-arm64 versions we tested making it useless. Maybe we can rely on some of its bit for the heuristic, but the heuristic itself is supposed to be a last-chance fallback.

@tannergooding
Copy link
Member

The problem that /sys/devices just doesn't say anything about L3 at all on all Linux-arm64 versions we tested making it useless. Maybe we can rely on some of its bit for the heuristic, but the heuristic itself is supposed to be a last-chance fallback.

And we can't rely on it not being reported as the heuristic that says "do the fallback"?

This is another case where on my own boxes (both WSL and directly running Linux natively -- Ubuntu 20.04.03 LTS), I am seeing the numbers accurately reported for ARM64.

@tannergooding
Copy link
Member

Also noting that there are some chips, such as the Raspberry PI, which may have no L3 and assuming it has one may also be incorrect/unoptimal.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

(both WSL and directly running Linux natively -- Ubuntu 20.04.03 LTS), I am seeing the numbers accurately reported for ARM64.

Interesting, what kind of hardware you use for it?

Also, if it reports L3 correctly then the heuristic won't be used, It is highly unlikely its value will be bigger than the real one (maybe +/- 0.5mb).

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

Also noting that there are some chips, such as the Raspberry PI, which may have no L3 and assuming it has one may also be incorrect/unoptimal.

Let's not keep the current value in sake of Raspberry PI ;-)
Also, for 4-cores Raspberry we report 512Kb cache which definitely won't hurt it. You also unlikely to use server mode gc for it or use in gc intensive workloads.

@tannergooding
Copy link
Member

Interesting, what kind of hardware you use for it?

Raspberry PI - (Booting Ubuntu) Reports no L3, because it doesn't have an L3
Surface Pro X - (Only tried WSL) Reports 4MB L3
GalaxyBook 2 - (WSL and Dual Booting Ubuntu) Reports 2MB L3
LIVA QC710 - (Only tried WSL) Reports 1MB L3

@EgorBo
Copy link
Member Author

EgorBo commented Feb 3, 2022

LIVA QC710

Thanks for data, I assume all of them (except Pi) use popular Qualcomm chips where cache is reported via a special register accessible by kernel, I even have a snippet somewhere with raw arm asm. While we mostly care about custom server/cloud hardware in this issue. The heuristic won't hurt any of the devices you listed - if L3 is reported correctly than it will be bigger than what the heuristic predicts.

@AntonLapounov
Copy link
Member

@EgorBo Have you been testing server GC only? I am wondering whether the optimal range for workstation GC might be different.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 4, 2022

@EgorBo Have you been testing server GC only? I am wondering whether the optimal range for workstation GC might be different.

not these, will schedule a run, but e.g. these #64576 were workstation ones.

Another interesting metrics is "RPS divided by working set size" vs "Gen0 size"
image

@mangod9
Copy link
Member

mangod9 commented Apr 25, 2022

@EgorBo does this need more thought or is it ready to merge?


cacheSize = logicalCPUs * std::min(1536, std::max(256, (int)logicalCPUs * 128)) * 1024;
}
// It is currently expected to be missing cache size info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of information in this comment is no longer relevant. Could you please update this comment to only include what is still relevant?

@EgorBo
Copy link
Member Author

EgorBo commented Jun 28, 2022

Closing since, apparently, was overtaken by #71029

@EgorBo EgorBo closed this Jun 28, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants