Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression: 6x slower array allocation on Alpine #41398

Closed
adamsitnik opened this issue Aug 26, 2020 · 36 comments · Fixed by #41532
Closed

Performance regression: 6x slower array allocation on Alpine #41398

adamsitnik opened this issue Aug 26, 2020 · 36 comments · Fixed by #41532
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Aug 26, 2020

From the data that I got from @danmosemsft which was collected by running dotnet/performance microbenchmarks on alpine 3.11 via WSL2, it looks like allocating arrays of both value and reference types became 6 times slower compared to 3.1.

Initially, I thought that it was just an outlier, but I can see the same pattern for other collections that internally use arrays (queue, list, stack etc). The regression is specific to alpine. Ubuntu 18.04 (with and without WSL2) is fine.

@jkotas @janvorli who would be the best person to investigate that?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

System.Collections.CtorGivenSize.Array(Size: 512)

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Same 181.79 183.96 0.99 Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz .NET Core 3.1.6 5.0.100-rc.1.20413.9
Same 92.89 94.47 0.98 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20404.3
Same 96.05 94.36 1.02 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20418.3
Same 114.74 111.94 1.03 Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20413.9
Same 80.49 79.98 1.01 Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) .NET Core 3.1.6 5.0.100-rc.1.20419.9
Same 67.30 67.66 0.99 bimodal Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3
Same 86.10 79.17 1.09 bimodal Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) .NET Core 3.1.6 5.0.100-rc.1.20419.14
Same 97.50 98.77 0.99 Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20420.14
Slower 127.02 150.46 0.84 bimodal Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20419.5
Slower 193.61 287.83 0.67 bimodal ubuntu 18.04 Arm64 Unknown processor .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Same 99.85 103.42 0.97 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 5.0.100-rc.1.20403.23
Slower 138.73 151.37 0.92 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 5.0.100-rc.1.20404.2
Slower 72.85 515.56 0.14 alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 6.0.100-alpha.1.20421.6
Slower 78.85 90.76 0.87 ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) .NET Core 3.1.6 5.0.100-rc.1.20418.3
@adamsitnik adamsitnik added this to the 5.0.0 milestone Aug 26, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 26, 2020
@adamsitnik adamsitnik changed the title Performance regression: 5x slower array allocation on Alpine Performance regression: 6x slower array allocation on Alpine Aug 26, 2020
@danmoseley
Copy link
Member

@billwert

@ghost
Copy link

ghost commented Aug 26, 2020

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Aug 26, 2020

These tests are measuring raw GC allocation speed. This is GC performance regression.

@mangod9 Could you please find somebody to investigate this?

@mangod9
Copy link
Member

mangod9 commented Aug 26, 2020

Ok will have someone take a look. @adamsitnik , Is the regression specifically with Alpine on WSL2, or Alpine in general?

@adamsitnik
Copy link
Member Author

Is the regression specifically with Alpine on WSL2, or Alpine in general?

mangod9 The benchmarks were executed with Alpine on WSL2, I don't have "Alpine only" results to be able to answer. FWIW the Ubuntu on WSL2 results are fine so it might be a general Alpine problem

@jkotas
Copy link
Member

jkotas commented Aug 26, 2020

It repros without WSL2 too.

The problem is that the GC is running like 100x more often than it should. It is likely problem in the budget computation. One of the places to check is PAL_GetLogicalProcessorCacheSizeFromOS.

@mangod9
Copy link
Member

mangod9 commented Aug 26, 2020

ok thanks for the context. We will look if there is any Alpine specific behavior which is leading to this.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 26, 2020
@mangod9
Copy link
Member

mangod9 commented Aug 27, 2020

Here is a simple repro:

using System.Diagnostics;

namespace System.Collections
{
    class Program
    {
        static void Main(string[] args)
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < 1000000; i++)
            {
                var foo = new int[512];
            }
            sw.Stop();
            Console.WriteLine(System.GC.CollectionCount(0));
            Console.WriteLine(sw.ElapsedTicks);
        }
    }
}
3.1 (Alpine) 5.0 p8 (Alpine)
Gen0 Collections 330 12653
ElaspedTicks 68689200 368227200

@Maoni0
Copy link
Member

Maoni0 commented Aug 27, 2020

yeah this is clearly related to calculating the minimal gen0 budget because this test, which allocates only temporary objects, will only trigger gen0 GCs that will always have a min gen0 budget. The min gen0 budget is calculated in get_gen0_min_size so something in that method changed.

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2020

I investigated this more and it appears none of the _SC_LEVEL1_DCACHE_SIZE are defined for Alpine (musl), so PAL_GetLogicalProcessorCacheSizeFromOS will always return 0.

Wonder if #34488 caused the regression, since I notice this case is missing when compared to 3.1

@jkotas
Copy link
Member

jkotas commented Aug 28, 2020

I think you are right. It means that we had a performance bug for Alpine here even before. The code that I have deleted in #34488 produced provably wrong results for AMD and the logic was questionable for Intel too.

Would you like me to work on getting this fixed?

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2020

I notice that you had an aborted PR: #34484 to fix GetCacheSizeFromCpuId. Does that need to be resurrected? If so I think I can get that code back..

@jkotas
Copy link
Member

jkotas commented Aug 28, 2020

#34484 fixed just one of the multiple different bugs that this code had. I think it needs to be written more or less from scratch using current editions of Intel and AMD manuals.

@janvorli
Copy link
Member

We can also gather this information from the /sys/devices/system/cpu without doing Intel / AMD specific kung-fu. There is for example /sys/devices/system/cpu/cpu0/cache/index0/size for level 1 cache of cpu0, /sys/devices/system/cpu/cpu0/cache/index1/size for level 2 and /sys/devices/system/cpu/cpu0/cache/index2/size for level 3.

@jkotas
Copy link
Member

jkotas commented Aug 28, 2020

We have this code for ARM64, so we can just remove the ifdef around it and use it unconditonally.

@mangod9
Copy link
Member

mangod9 commented Aug 28, 2020

yeah just removing this seems like the best option for 5. Will create a PR, do we need to rethink for 6?

@jkotas
Copy link
Member

jkotas commented Aug 28, 2020

Agree. We have a duplicate of this code in gcenv.unix.cpp for standalone GC, so please remove it there as well.

I think it is fine as a permanent fix ... until we discover the next problem with this.

@am11
Copy link
Member

am11 commented Aug 28, 2020

imo, the better fix would be:

- #if defined(HOST_ARM64)
+ #if defined(TARGET_LINUX)

to keep the support for non-Linux Unix-like operating systems intact (macOS, FreeBSD, SunOS and since so forth).

@mangod9
Copy link
Member

mangod9 commented Aug 29, 2020

Reopening, so it can ported to .net 5

@mangod9 mangod9 reopened this Aug 29, 2020
@mangod9
Copy link
Member

mangod9 commented Aug 29, 2020

I have validated that the simplified repro shows low gen0 collection counts. I am unable to get the benchmarks working on WSL2 (will investigate), @adamsitnik if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great (guess we need to wait for a daily build anyways). Thx.

@adamsitnik
Copy link
Member Author

I am unable to get the benchmarks working on WSL2 (will investigate)

@mangod9 what error are you getting? You should be able to run the python script and get 3.1 results:

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 --filter 'System.Collections.CtorGivenSize<Int32>.Array'

And then run the benchmarks as 5.0 using your local build of dotnet/runtime by specifying the path to corerun:

python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorGivenSize<Int32>.Array' \
 --corerun './runtime/artifacts/bin/testhost/net5.0-Alpine-Release-x64/shared/Microsoft.NETCore.App/5.0.0/corerun'

You can read more about benchmarking local builds of dotnet/runtime here

if you could please validate that the benchmarks are showing comparable perf to 3.1 that would be great

I currently don't have access to Alpine machine, but as soon as #41547 get propagated to SDK I can spawn a new Azure VM and run the benchmarks

@mangod9
Copy link
Member

mangod9 commented Aug 31, 2020

The container was missing libgdiplus on Alpine, but was able to resolve it by adding it from a http://dl-3.alpinelinux.org/alpine/edge/testing/.

I was able to validate that the fix gets the perf to be comparable with 3.1:

Alpine 3.1 5.0 6.0
CtorGivenSize.Array 80.30 ns 737.6 ns 87.62 ns

@mangod9
Copy link
Member

mangod9 commented Aug 31, 2020

Ported to 5.0 with PR: #41547

@mangod9 mangod9 closed this as completed Aug 31, 2020
@danmoseley
Copy link
Member

Thanks @adamsitnik and @mangod9 for resolving, I'm very happy that my data turned out to be rather useful!

@adamsitnik
Copy link
Member Author

@danmosemsft @mangod9 I wonder how we could prevent from a similar problem in the future. Perhaps we should add some asserts to the method to ensure that it never returns 0 again?

@mangod9
Copy link
Member

mangod9 commented Sep 1, 2020

@danmosemsft Yeah it was very helpful in pinpointing where the issue might be. @adamsitnik yeah I will create a separate issue to track how we can add a test/asserts for this.

@danmoseley
Copy link
Member

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

@Lxiamail
Copy link
Member

Lxiamail commented Sep 1, 2020

@Lxiamail @billwert @DrewScoggins this is evidence I think that regular Alpine runs in the lab are important. This would have justified servicing, I think. I'm not sure where we left that conversation @billwert ?

@danmosemsft Yes, we have an email discussion about adding additional OS coverage in perf lab. We will looks into @adamsitnik's finalized report of exercises' data, and I'm trying to get .NET Core OS usage telemetry data. Hope we can identify the commonly used OSes and add them to perf lab.

@danmoseley
Copy link
Member

@Lxiamail sounds good, thanks! Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.

@Lxiamail
Copy link
Member

Lxiamail commented Sep 1, 2020

Telemetry aside, I think if we can only do 2 Linux flavors, Alpine probably needs to be one of those because of its unique characteristics.
@danmosemsft I'm not familar with the unique characteristics on different Linux OSes. It would be great if someone can articulate what's unique Alpine characteristics that we refer to? From PR: #41547, I didn't see any Alpine specific runtime code change.

@Lxiamail
Copy link
Member

Lxiamail commented Sep 1, 2020

Just as side note, https://stackshare.io/stackups/alpine-linux-vs-ubuntu shows Alpine has a lot less users than Ubuntu. If only using OS telemetry to pick the Linux OSes for perf lab, Alpine may not pop to the top rank.

@mangod9
Copy link
Member

mangod9 commented Sep 1, 2020

Alpine is different since it uses the musl libc instead of GNU, thus its different from other distros. For this particular fix constants such as _SC_LEVEL1_DCACHE_SIZE are not defined for musl. Hence the fallback to using a difference method to retrieve the cache size was mostly alpine specific (there are several such subtle differences on Alpine).

@danmoseley
Copy link
Member

Right, anything related to libc can have different characteristics. Beyond musl, Alpine is the most "size optimized" distro we currently support, so it may have unique characteristics elsewhere because of those optimizations (eg possibly missing bits of /proc - I don't know, just an example). Plus, folks who deploy to Alpine are presumably particularly sensitive to perf (or at least size) so we should particularly care about the perf on Alpine.

cc @richlander in case he wants to add anything.

@Lxiamail
Copy link
Member

Lxiamail commented Sep 1, 2020

@mangod9 @danmosemsft thanks for the info! Looks like we definitly should add Alpine to perf lab. Is there any other OSes have unique charactenstics, which should be considered to monitor in perf lab?

@danmoseley
Copy link
Member

Others are better positioned to answer that one, off the top of my head I cannot remember Linux regressions that wouldn't show up in either Ubuntu or Alpine.

@mangod9
Copy link
Member

mangod9 commented Sep 1, 2020

Adding Alpine is a good start, we can always re-evaluate if we find any other distro specific issues.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants