Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] Potential codegen issues on Graviton3 #100705

Closed
redknightlois opened this issue Apr 5, 2024 · 21 comments
Closed

[ARM] Potential codegen issues on Graviton3 #100705

redknightlois opened this issue Apr 5, 2024 · 21 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@redknightlois
Copy link

redknightlois commented Apr 5, 2024

Description

We have included Graviton3 chipsets on our testing matrix and found many different tests that error consistently. However, those errors are not triggering when run in the following platforms:

  • Linux x64
  • Windows x64
  • Windows x86

Moreover, when run continuously errors only trigger after 2 conditions:

  • It is run in arm64 (reproduced on Graviton3)
  • A set amount of iterations happened (always beyond 128)
  • It is run in release mode

At first we suspected that we were doing something wrong on our end [its possible we are corrupting the heap somehow] but the deterministic nature of the failure and repeatability after it triggers for the first time makes us suspect it is something more permanent than a heap corruption.

I include several different ways to reproduce the errors.

Reproduction Steps

Download any of the following reproductions:

https://github.com/redknightlois/ravendb/tree/numerical-difference
https://github.com/redknightlois/ravendb/tree/failed-datetime-comparison
https://github.com/redknightlois/ravendb/tree/failed-patching

Run them in order to ensure that optimizations will kick in:

cd test/Tryouts
dotnet run -c Release

Observe the errors after a set amount of loops.

Expected behavior

They should execute the 1000 loops without printing any exception.

Actual behavior

Trigger exception repeatedly after a set amount of loops.

Regression?

No response

Known Workarounds

None

Configuration

.NET SDK:
 Version:           8.0.103
 Commit:            6a90b4b4bc
 Workload version:  8.0.100-manifests.e99a2be4

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  22.04
 OS Platform: Linux
 RID:         ubuntu.22.04-arm64
 Base Path:   /usr/lib/dotnet/sdk/8.0.103/

.NET workloads installed:
 Workload version: 8.0.100-manifests.e99a2be4
There are no installed workloads to display.

Host:
  Version:      8.0.3
  Architecture: arm64
  Commit:       9f4b1f5d66

.NET SDKs installed:
  8.0.103 [/usr/lib/dotnet/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.17 [/usr/lib/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.3 [/usr/lib/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.17 [/usr/lib/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.3 [/usr/lib/dotnet/shared/Microsoft.NETCore.App]

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 5, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 5, 2024
@redknightlois redknightlois changed the title [ARM] Potential codegen issues on Cortex M3 [ARM] Potential codegen issues on Graviton3 Apr 5, 2024
@jkotas jkotas added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 5, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 5, 2024

Would be nice to have more details - does it fail on other ARM64 machines, which tests and what is the expected/actual inputs/outputs? Is it floating point related? E.g. we do not guarantee that all floating-point related operations e.g. casts to integers, edge cases in Math.* calls are always 100% the same for the same input on all architectures/platforms. Would be nice to confirm that the issue is indeed on dotnet/runtime side.

@redknightlois
Copy link
Author

redknightlois commented Apr 6, 2024

Some are binding errors at the runtime level which points into heap corruptions. One is showing 2 floating point numbers which is clearly not a floating point issue, more like getting the data from a different memory location.

I only have Graviton3 from AWS on ARM64, it is consistently failing on all of the machines provisioned. All the tests execute properly on Linux and Windows on 32-bits and 64-bit Intel.

That's why I provided 3 different ways to trigger the behavior with completely different error behaviors. But since, they all trigger after a number of passes happen (and the number is consistent across runs) and they all run successfully run under Debug build, it leads into suspect that after a certain optimization is performed, the errors trigger.

@ayende
Copy link
Contributor

ayende commented Apr 10, 2024

Okay, I have a much smaller scope for reproducing this.
Here is the relevant code (running as Tryout/Program.cs):

using System;
using System.Diagnostics;
using System.IO;
using System.Net.Http.Json;
using System.Runtime.CompilerServices;
using System.Threading.Tasks;
using Amazon.S3.Model;
using Tests.Infrastructure;
using Raven.Server.Utils;
using SlowTests.Corax;
using SlowTests.Sharding.Cluster;
using Xunit;
using FastTests.Voron.Util;
using Jint;
using Jint.Native;
using Jint.Native.Function;
using Jint.Native.Object;
using Raven.Server.Config;
using Raven.Server.Documents;
using Raven.Server.Documents.Patch;
using Raven.Server.Documents.Queries.LuceneIntegration;
using Sparrow.Json;

namespace Tryouts;

public static class Program
{
    static Program()
    {
        XunitLogging.RedirectStreams = false;
    }

    public class MyKey : ScriptRunnerCache.Key
    {
        public override void GenerateScript(ScriptRunner runner)
        {
            runner.AddScript("function project(p) { return { Long: p.Long, Double: p.Double }; } ");
        }

        public override bool Equals(object obj)
        {
            return ReferenceEquals(this, obj);
        }

        public override int GetHashCode()
        {
            return RuntimeHelpers.GetHashCode(this);
        }
    }

    public static void Main(string[] args)
    {

        using var ctx = JsonOperationContext.ShortTermSingleUse();

        var ms = new MemoryStream("{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}"u8.ToArray());
        var doc = new Document { Data = ctx.ReadForMemoryAsync(ms, "test").Result };

        var func = Jint.Engine.PrepareScript("function project(p) { return { Long: p.Long, Double: p.Double }; } ", strict: false);
        var engine = new Jint.Engine();
        var jsonStr = (Function)engine.GetValue("JSON").AsObject().Get("stringify");
        engine.Execute(func);

        var method = engine.GetValue("project");
        const string expected = "{\"Long\":9007199254740990,\"Double\":1.7976931348623157E+308}";


        for (int i = 0; i < 100_000; i++)
        {
            var v = method.Call(new BlittableObjectInstance(engine, null, doc.Data, doc));

            var j = JsBlittableBridge.Translate(ctx, engine, (ObjectInstance)v).ToString();
            if (j != expected)
            {
                Console.WriteLine(i + " " + j);
                return;
            }
        }

        Console.WriteLine("Done");
    }
}

To make things easier, what this does is execute this funciton:

function project(p) { return { Long: p.Long, Double: p.Double }; }

Over this input:

{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}

Will sometimes return:

ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
15862 {"Long":9007199254740990.0,"Double":9223372036854775807}
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
18067 {"Long":9007199254740990.0,"Double":9223372036854775807}
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c release
20129 {"Long":9007199254740990.0,"Double":9223372036854775807}

Note that the number of iterations is different.

I am assuming that this is tier JIT running and a bug happening there somewhere.

I wasn't able to simplify this further, I'm afraid.

When running this with debug mode, it completes successfully each time.

ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c debug
Done
ubuntu@ip-172-31-47-129:~/ravendb/test/Tryouts$ dotnet run -c debug
Done

Once this happens, it is entirely reproducible.

@redknightlois
Copy link
Author

redknightlois commented Apr 11, 2024

We dont see the error immediately because of tiered compilation, but it fails immediately on the first iteration when disabled.

  <PropertyGroup>
    <TieredCompilation>false</TieredCompilation>
  </PropertyGroup>

This with a simplifier reproduction that barely needs anything confirms the suspicion that the error is related to code generation on ARM64 platform.

When we disable quick jit even if tiered compilation is on, we also fail on first iteration.

  <PropertyGroup>
    <TieredCompilation>true</TieredCompilation>
     <TieredCompilationQuickJit>false</TieredCompilationQuickJit>
  </PropertyGroup>

However, when we disable only quick jit for loops we revert to fail after a set amount of iterations again.

  <PropertyGroup>
    <TieredCompilation>true</TieredCompilation>
     <TieredCompilationQuickJit>true</TieredCompilationQuickJit>
     <TieredCompilationQuickJitForLoops>false</TieredCompilationQuickJitForLoops>
  </PropertyGroup>

@AndyAyersMS
Copy link
Member

@EgorBo can you see if the repro above works for us? If not I can look later this week.

@ayende
Copy link
Contributor

ayende commented Apr 18, 2024

Some more information on this.

A user is reporting that running RavenDB or Mac M2 will crash with:

2024-04-16 13:28:58 assertion failed [block != nullptr]: BasicBlock requested for unrecognized address
2024-04-16 13:28:58 (BuilderBase.h:550 block_for_offset)

When running docker run ravendb/ravendb:5.4.116-ubuntu.22.04-arm64v8

When setting DOTNET_TC_QuickJitForLoops=0, it goes further and fails with:

ravendb exited with code 139

@ayende
Copy link
Contributor

ayende commented Apr 18, 2024

We tried with:

DOTNET_TieredCompilation=0
DOTNET_TC_QuickJit=0
DOTNET_TC_QuickJitForLoops=0

And the user is getting:

ravendb      | Running non-interactive.

ravendb      | assertion failed [block != nullptr]: BasicBlock requested for unrecognized address

ravendb      | (BuilderBase.h:550 block_for_offset)

ravendb      |

ravendb exited with code 133

Is there any way for us to extract additional information here?

@EgorBo
Copy link
Member

EgorBo commented Apr 18, 2024

@EgorBo can you see if the repro above works for us? If not I can look later this week.

I managed to repro it using dotnet run from SDK, but it never reproduces on locally build runtime (either net9 or net8, either Release or Checked) so I am a bit puzzled, the project is quite huge - does it spawn processes internally, does it try to build some managed code dynamically on run?

PS: it doesn't even repro when I copy my locally built net8.0 bits (from release branch) to the .net 8.0 sdk's folder

@jkotas
Copy link
Member

jkotas commented Apr 18, 2024

assertion failed [block != nullptr]: BasicBlock requested for unrecognized address
(BuilderBase.h:550 block_for_offset)

What is the component that contains this assert check? (There is no assert check like this in .NET runtime.)

@filipnavara
Copy link
Member

filipnavara commented Apr 18, 2024

What is the component that contains this assert check?

It's from the Rosetta emulation layer on macOS.

@EgorBo
Copy link
Member

EgorBo commented Apr 18, 2024

PS: My comment about dotnet run was for the problem with the repro printing:

{'Long': 9007199254740990, 'Double': 1.7976931348623157E+308}

instead of:

Done

In Release, not for the Rosetta issue which I guess is a separate issue - I was running the repro on Linux-arm64

@ayende
Copy link
Contributor

ayende commented Apr 19, 2024

When running dotnet run for the Tryout project, there shouldn't be any code being generated.
This is running a single threaded code and most of the code there is never invoked.

Note that I don't know if both of those issues are related. They are both with ARM64 and the env vars make it looks like it has the same impact, but may be two different items.

@AndyAyersMS
Copy link
Member

I take it this doesn't (yet) work on windows arm64?

Starting to run 0
Default HTTP Compression Algorithm: Gzip
Default HTTP Pooled Connection Idle Timeout:
Default HTTP Version Policy:
Max number of concurrent tests is: 4
System.TypeInitializationException: The type initializer for 'Raven.Server.Documents.Operations.OperationsStorage' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'Voron.StorageEnvironment' threw an exception.
 ---> System.TypeInitializationException: The type initializer for 'Sparrow.Server.Platform.Pal' threw an exception.
 ---> Sparrow.Utils.IncorrectDllException: librvnpal version might be invalid, missing or not usable on current platform. Initialization error could also be caused by missing 'Microsoft Visual C++ 2015 Redistributable Package' (or newer). It can be downloaded from https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads. Arch: Arm64, OSDesc: Microsoft Windows 10.0.22635
 ---> System.BadImageFormatException: An attempt was made to load a program with an incorrect format. (0x8007000B)
   at System.Runtime.InteropServices.NativeLibrary.LoadLibraryByName(String libraryName, Assembly assembly, Nullable`1 searchPath, Boolean throwOnError)
   at Sparrow.Utils.DynamicNativeLibraryResolver.Resolver(String libraryName, Assembly assembly, Nullable`1 dllImportSearchPath) in C:\repos\ravendb\src\Sparrow\Utils\DynamicNativeLibraryResolver.cs:line 70
   at System.Runtime.InteropServices.NativeLibrary.LoadLibraryCallbackStub(String libraryName, Assembly assembly, Boolean hasDllImportSearchPathFlags, UInt32 dllImportSearchPathFlags)
   at Sparrow.Server.Platform.Pal.rvn_get_pal_ver()
   at Sparrow.Server.Platform.Pal..cctor() in C:\repos\ravendb\src\Sparrow.Server\Platform\Pal.cs:line 24

I'll try it under WSL2.

@AndyAyersMS
Copy link
Member

I can repro (I think) on my volterra under WSL2:

Starting to run 169
Starting to run 170
Xunit.Sdk.EqualException: Assert.Equal() Failure: Values differ
Expected: 1.7976931348623157E+308
Actual:   9.2233720368547758E+18
   at Xunit.Assert.Equal[T](T expected, T actual, IEqualityComparer`1 comparer) in /_/src/xunit.assert/Asserts/EqualityAsserts.cs:line 148
   at Xunit.Assert.Equal[T](T expected, T actual) in /_/src/xunit.assert/Asserts/EqualityAsserts.cs:line 83
   at SlowTests.Issues.RavenDB_14036.CanExtractStoredNumberFieldsUsingJsProjection(Options options) in /home/andya/repos/ravendb/test/SlowTests/Issues/RavenDB-14036.cs:line 133
   at Tryouts.Program.Main(String[] args) in /home/andya/repos/ravendb/test/Tryouts/Program.cs:line 49

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 19, 2024

Above was with 8.0.4.

@EgorBo this also no longer repros for me if I drop in a locally built checked libclrjit.so. Am going to see if release does likewise.

It does.

Going to try an SPMI collection and then diff replay using the retail jit and the ones I've built locally.

@AndyAyersMS
Copy link
Member

Three methods with diffs, all are

image

Impacted methods are

<WriteNumber>g__GuessNumberType|15_1
WriteNumber
WriteJsonValue

though without assembly info this is probably not enough to figure out where to disable opts (I can retry with a checked runtime and jit I suppose, if you all want to try crafting a workaround).

Looks like this is #92201. The fix was ported to 8.0 in #100372 but hasn't made it into a release yet. Looks like it should appear in 8.0.5, due the middle of May.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Apr 19, 2024
@redknightlois
Copy link
Author

We have a customer and our own test servers impacted by this. Is there any way we could get the pre-release servicing build in order for those systems to be upgraded?

@jkotas
Copy link
Member

jkotas commented Apr 22, 2024

We do not publish official pre-release servicing builds. The servicing builds contain security fixes that we cannot disclose publicly before the release.

You may want to create your own build using the same docker image that is used for official builds https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/building/coreclr/linux-instructions.md#docker-images ,

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Apr 26, 2024
@JulieLeeMSFT
Copy link
Member

We do not publish official pre-release servicing builds. The servicing builds contain security fixes that we cannot disclose publicly before the release.

You may want to create your own build using the same docker image that is used for official builds https://github.com/dotnet/runtime/blob/release/8.0/docs/workflow/building/coreclr/linux-instructions.md#docker-images ,

@redknightlois, please let us know if either your own build or an official servicing build fixes the problem.

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Apr 26, 2024
@redknightlois
Copy link
Author

As soon as the official servicing build is out I will install and check.

@dotnet-policy-service dotnet-policy-service bot added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Apr 26, 2024
@redknightlois
Copy link
Author

Confirm 8.0.5 solves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

7 participants