Performance regression: 6x slower array allocation on Alpine #41398

adamsitnik · 2020-08-26T14:32:49Z

From the data that I got from @danmosemsft which was collected by running dotnet/performance microbenchmarks on alpine 3.11 via WSL2, it looks like allocating arrays of both value and reference types became 6 times slower compared to 3.1.

Initially, I thought that it was just an outlier, but I can see the same pattern for other collections that internally use arrays (queue, list, stack etc). The regression is specific to alpine. Ubuntu 18.04 (with and without WSL2) is fine.

@jkotas @janvorli who would be the best person to investigate that?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

System.Collections.CtorGivenSize.Array(Size: 512)

Conclusion	Base	Diff	Base/Diff	Modality	Operating System	Bit	Processor Name	Base Runtime	Diff Runtime
Same	181.79	183.96	0.99		Windows 10.0.18363.1016	Arm	Microsoft SQ1 3.0 GHz	.NET Core 3.1.6	5.0.100-rc.1.20413.9
Same	92.89	94.47	0.98		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	5.0.100-rc.1.20404.3
Same	96.05	94.36	1.02		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	5.0.100-rc.1.20418.3
Same	114.74	111.94	1.03		Windows 10.0.19041.450	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	.NET Core 3.1.6	5.0.100-rc.1.20413.9
Same	80.49	79.98	1.01		Windows 10.0.19041.450	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	.NET Core 3.1.6	5.0.100-rc.1.20419.9
Same	67.30	67.66	0.99	bimodal	Windows 10.0.19042	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	.NET Core 3.1.6	5.0.100-rc.1.20418.3
Same	86.10	79.17	1.09	bimodal	Windows 10.0.19041.450	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	.NET Core 3.1.6	5.0.100-rc.1.20419.14
Same	97.50	98.77	0.99		Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	5.0.100-rc.1.20420.14
Slower	127.02	150.46	0.84	bimodal	Windows 10.0.19041.450	X86	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	.NET Core 3.1.6	5.0.100-rc.1.20419.5
Slower	193.61	287.83	0.67	bimodal	ubuntu 18.04	Arm64	Unknown processor	.NET Core 3.1.6	6.0.100-alpha.1.20421.6
Same	99.85	103.42	0.97		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	5.0.100-rc.1.20403.23
Slower	138.73	151.37	0.92		macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	.NET Core 3.1.6	5.0.100-rc.1.20404.2
Slower	72.85	515.56	0.14		alpine 3.11	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	.NET Core 3.1.6	6.0.100-alpha.1.20421.6
Slower	78.85	90.76	0.87		ubuntu 18.04	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	.NET Core 3.1.6	5.0.100-rc.1.20418.3

danmoseley · 2020-08-26T15:38:40Z

@billwert

ghost · 2020-08-26T17:25:15Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

jkotas · 2020-08-26T17:29:01Z

These tests are measuring raw GC allocation speed. This is GC performance regression.

@mangod9 Could you please find somebody to investigate this?

mangod9 · 2020-08-26T17:46:52Z

Ok will have someone take a look. @adamsitnik , Is the regression specifically with Alpine on WSL2, or Alpine in general?

adamsitnik · 2020-08-26T18:07:23Z

Is the regression specifically with Alpine on WSL2, or Alpine in general?

mangod9 The benchmarks were executed with Alpine on WSL2, I don't have "Alpine only" results to be able to answer. FWIW the Ubuntu on WSL2 results are fine so it might be a general Alpine problem

jkotas · 2020-08-26T18:21:49Z

It repros without WSL2 too.

The problem is that the GC is running like 100x more often than it should. It is likely problem in the budget computation. One of the places to check is PAL_GetLogicalProcessorCacheSizeFromOS.

mangod9 · 2020-08-26T19:08:09Z

ok thanks for the context. We will look if there is any Alpine specific behavior which is leading to this.

mangod9 · 2020-08-27T16:08:20Z

Here is a simple repro:

using System.Diagnostics;

namespace System.Collections
{
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < 1000000; i++)
            {
                var foo = new int[512];
            }
            sw.Stop();
            Console.WriteLine(System.GC.CollectionCount(0));
            Console.WriteLine(sw.ElapsedTicks);
        }
    }
}

	3.1 (Alpine)	5.0 p8 (Alpine)
Gen0 Collections	330	12653
ElaspedTicks	68689200	368227200

Maoni0 · 2020-08-27T18:27:10Z

yeah this is clearly related to calculating the minimal gen0 budget because this test, which allocates only temporary objects, will only trigger gen0 GCs that will always have a min gen0 budget. The min gen0 budget is calculated in get_gen0_min_size so something in that method changed.

mangod9 · 2020-08-28T06:01:27Z

I investigated this more and it appears none of the _SC_LEVEL1_DCACHE_SIZE are defined for Alpine (musl), so PAL_GetLogicalProcessorCacheSizeFromOS will always return 0.

Wonder if #34488 caused the regression, since I notice this case is missing when compared to 3.1

jkotas · 2020-08-28T13:53:09Z

I think you are right. It means that we had a performance bug for Alpine here even before. The code that I have deleted in #34488 produced provably wrong results for AMD and the logic was questionable for Intel too.

Would you like me to work on getting this fixed?

mangod9 · 2020-08-28T16:06:12Z

I notice that you had an aborted PR: #34484 to fix GetCacheSizeFromCpuId. Does that need to be resurrected? If so I think I can get that code back..

jkotas · 2020-08-28T17:27:50Z

#34484 fixed just one of the multiple different bugs that this code had. I think it needs to be written more or less from scratch using current editions of Intel and AMD manuals.

janvorli · 2020-08-28T17:39:09Z

We can also gather this information from the /sys/devices/system/cpu without doing Intel / AMD specific kung-fu. There is for example /sys/devices/system/cpu/cpu0/cache/index0/size for level 1 cache of cpu0, /sys/devices/system/cpu/cpu0/cache/index1/size for level 2 and /sys/devices/system/cpu/cpu0/cache/index2/size for level 3.

jkotas · 2020-08-28T17:48:39Z

We have this code for ARM64, so we can just remove the ifdef around it and use it unconditonally.

mangod9 · 2020-08-28T17:58:31Z

yeah just removing this seems like the best option for 5. Will create a PR, do we need to rethink for 6?

jkotas · 2020-08-28T18:02:10Z

Agree. We have a duplicate of this code in gcenv.unix.cpp for standalone GC, so please remove it there as well.

I think it is fine as a permanent fix ... until we discover the next problem with this.

am11 · 2020-08-28T21:09:20Z

imo, the better fix would be:

- #if defined(HOST_ARM64)
+ #if defined(TARGET_LINUX)

to keep the support for non-Linux Unix-like operating systems intact (macOS, FreeBSD, SunOS and since so forth).

mangod9 · 2020-08-29T05:17:51Z

Reopening, so it can ported to .net 5

mangod9 · 2020-08-29T06:05:33Z

I have validated that the simplified repro shows low gen0 collection counts. I am unable to get the benchmarks working on WSL2 (will investigate), @adamsitnik if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great (guess we need to wait for a daily build anyways). Thx.

adamsitnik · 2020-08-31T08:42:20Z

I am unable to get the benchmarks working on WSL2 (will investigate)

@mangod9 what error are you getting? You should be able to run the python script and get 3.1 results:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

And then run the benchmarks as 5.0 using your local build of dotnet/runtime by specifying the path to corerun:

python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array' \
 --corerun './runtime/artifacts/bin/testhost/net5.0-Alpine-Release-x64/shared/Microsoft.NETCore.App/5.0.0/corerun'

You can read more about benchmarking local builds of dotnet/runtime here

if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great

I currently don't have access to Alpine machine, but as soon as #41547 get propagated to SDK I can spawn a new Azure VM and run the benchmarks

mangod9 · 2020-08-31T15:34:19Z

The container was missing libgdiplus on Alpine, but was able to resolve it by adding it from a http://dl-3.alpinelinux.org/alpine/edge/testing/.

I was able to validate that the fix gets the perf to be comparable with 3.1:

Alpine	3.1	5.0	6.0
CtorGivenSize.Array	80.30 ns	737.6 ns	87.62 ns

mangod9 · 2020-08-31T16:01:23Z

Ported to 5.0 with PR: #41547

danmoseley · 2020-09-01T20:13:45Z

Thanks @adamsitnik and @mangod9 for resolving, I'm very happy that my data turned out to be rather useful!

adamsitnik · 2020-09-01T20:17:49Z

@danmosemsft @mangod9 I wonder how we could prevent from a similar problem in the future. Perhaps we should add some asserts to the method to ensure that it never returns 0 again?

mangod9 · 2020-09-01T20:20:22Z

@danmosemsft Yeah it was very helpful in pinpointing where the issue might be. @adamsitnik yeah I will create a separate issue to track how we can add a test/asserts for this.

danmoseley · 2020-09-01T20:25:55Z

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

Lxiamail · 2020-09-01T20:46:20Z

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

@danmosemsft Yes, we have an email discussion about adding additional OS coverage in perf lab. We will looks into @adamsitnik's finalized report of exercises' data, and I'm trying to get .NET Core OS usage telemetry data. Hope we can identify the commonly used OSes and add them to perf lab.

danmoseley · 2020-09-01T20:49:53Z

@Lxiamail sounds good, thanks! Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.

Lxiamail · 2020-09-01T21:23:19Z

Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.
@danmosemsft I'm not familar with the unique characteristics on different Linux OSes. It would be great if someone can articulate what's unique Alpine characteristics that we refer to? From PR: #41547, I didn't see any Alpine specific runtime code change.

Lxiamail · 2020-09-01T21:26:36Z

Just as side note, https://stackshare.io/stackups/alpine-linux-vs-ubuntu shows Alpine has a lot less users than Ubuntu. If only using OS telemetry to pick the Linux OSes for perf lab, Alpine may not pop to the top rank.

mangod9 · 2020-09-01T21:39:12Z

Alpine is different since it uses the musl libc instead of GNU, thus its different from other distros. For this particular fix constants such as _SC_LEVEL1_DCACHE_SIZE are not defined for musl. Hence the fallback to using a difference method to retrieve the cache size was mostly alpine specific (there are several such subtle differences on Alpine).

danmoseley · 2020-09-01T21:41:56Z

Right, anything related to libc can have different characteristics. Beyond musl, Alpine is the most "size optimized" distro we currently support, so it may have unique characteristics elsewhere because of those optimizations (eg possibly missing bits of /proc - I don't know, just an example). Plus, folks who deploy to Alpine are presumably particularly sensitive to perf (or at least size) so we should particularly care about the perf on Alpine.

cc @richlander in case he wants to add anything.

Lxiamail · 2020-09-01T21:44:59Z

@mangod9 @danmosemsft thanks for the info! Looks like we definitly should add Alpine to perf lab. Is there any other OSes have unique charactenstics, which should be considered to monitor in perf lab?

danmoseley · 2020-09-01T22:02:46Z

Others are better positioned to answer that one, off the top of my head I cannot remember Linux regressions that wouldn't show up in either Ubuntu or Alpine.

mangod9 · 2020-09-01T22:05:08Z

Adding Alpine is a good start, we can always re-evaluate if we find any other distro specific issues.

adamsitnik added area-System.Runtime tenet-performance Performance related issue os-alpine labels Aug 26, 2020

adamsitnik added this to the 5.0.0 milestone Aug 26, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 26, 2020

adamsitnik changed the title ~~Performance regression: 5x slower array allocation on Alpine~~ Performance regression: 6x slower array allocation on Alpine Aug 26, 2020

jkotas added the area-GC-coreclr label Aug 26, 2020

jkotas removed the area-System.Runtime label Aug 26, 2020

mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 26, 2020

mangod9 mentioned this issue Aug 28, 2020

Fix reading cpu cache size for Alpine(musl) #41532

Merged

mangod9 closed this as completed in #41532 Aug 29, 2020

mangod9 reopened this Aug 29, 2020

github-actions bot mentioned this issue Aug 29, 2020

[release/5.0] Fix reading cpu cache size for Alpine(musl) #41547

Merged

mangod9 closed this as completed Aug 31, 2020

adamsitnik mentioned this issue Sep 4, 2020

.NET 5.0 Microbenchmarks Performance Study Report #41871

Closed

21 tasks

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Performance regression: 6x slower array allocation on Alpine #41398

Performance regression: 6x slower array allocation on Alpine #41398

Comments

adamsitnik commented Aug 26, 2020 • edited Loading

Repro

System.Collections.CtorGivenSize.Array(Size: 512)

danmoseley commented Aug 26, 2020

ghost commented Aug 26, 2020

jkotas commented Aug 26, 2020

mangod9 commented Aug 26, 2020

adamsitnik commented Aug 26, 2020

jkotas commented Aug 26, 2020

mangod9 commented Aug 26, 2020

mangod9 commented Aug 27, 2020 • edited Loading

Maoni0 commented Aug 27, 2020 • edited Loading

mangod9 commented Aug 28, 2020 • edited Loading

jkotas commented Aug 28, 2020

mangod9 commented Aug 28, 2020

jkotas commented Aug 28, 2020

janvorli commented Aug 28, 2020

jkotas commented Aug 28, 2020

mangod9 commented Aug 28, 2020

jkotas commented Aug 28, 2020

am11 commented Aug 28, 2020

mangod9 commented Aug 29, 2020

mangod9 commented Aug 29, 2020

adamsitnik commented Aug 31, 2020

mangod9 commented Aug 31, 2020

mangod9 commented Aug 31, 2020

danmoseley commented Sep 1, 2020

adamsitnik commented Sep 1, 2020

mangod9 commented Sep 1, 2020

danmoseley commented Sep 1, 2020

Lxiamail commented Sep 1, 2020

danmoseley commented Sep 1, 2020

Lxiamail commented Sep 1, 2020

Lxiamail commented Sep 1, 2020

mangod9 commented Sep 1, 2020

danmoseley commented Sep 1, 2020

Lxiamail commented Sep 1, 2020

danmoseley commented Sep 1, 2020

mangod9 commented Sep 1, 2020

adamsitnik commented Aug 26, 2020 •

edited

Loading

mangod9 commented Aug 27, 2020 •

edited

Loading

Maoni0 commented Aug 27, 2020 •

edited

Loading

mangod9 commented Aug 28, 2020 •

edited

Loading