Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono?] Cannot run host tests due to xunit version mismatch #60550

Open
uweigand opened this issue Oct 18, 2021 · 26 comments
Open

[mono?] Cannot run host tests due to xunit version mismatch #60550

uweigand opened this issue Oct 18, 2021 · 26 comments
Labels
arch-s390x Related to s390x architecture (unsupported) area-Build-mono
Milestone

Comments

@uweigand
Copy link
Contributor

Description

Attempting to run the host tests on a Mono-based linux-s390x runtime fails with:

[ERROR] FATAL UNHANDLED EXCEPTION: System.TypeLoadException: Could not load type of field 'Xunit.ConsoleClient.ConsoleRunner:completionMessages' (3) due to: Could not load file or assembly 'xunit.runner.utility.netcoreapp10, Version=2.4.2.0, Culture=neutral, PublicKeyToken=8d05b1bb7a6fdb6c' or one of its dependencies.

Reproduction Steps

Run ./build host.tests --test on the current release/6.0 branch on linux-s390x.

Expected behavior

Test execution starts.

Actual behavior

Test fails to start with the error message shown above.

Regression?

Yes. This worked before 491ed9a .

Known Workarounds

Revert XUnitVersion back to 2.4.1. (This also requires moving to back-level versions of MicrosoftDotNetXUnitExtensionsVersion and MicrosoftDotNetXUnitConsoleRunnerVersion.)

Configuration

Runtime from current .NET 6 branch built on linux-s390x.

Other information

The root cause of the issue appears to be that the host tests (e.g. Microsoft.NET.HostModel.AppHost.Tests) depend on both

          "xunit.runner.console": "2.4.2-pre.9",

and

          "xunit.runner.visualstudio": "2.4.2"

Both of these packages bring their own copy of xunit.runner.utility.netcoreapp10.dll, but in two different versions. xunit.runner.console version 2.4.2-pre.9 brings version 2.4.2 of that DLL, while xunit.runner.visualstudio version 2.4.2 brings version 2.4.1 of that DLL.

During restore, it looks like one of those copies "wins", and this happens to be the older one (2.4.1). This causes the Mono loader error while loading dependencies of xunit.runner.console, which expects the newer version (2.4.2).

I tried to re-create the issue on Intel, but there I'm not seeing the error, even though the overall situation looks identical. So maybe part of the root cause may also involve the Mono loader - maybe this is more strict in checking version matches than the CoreCLR loader for some reason?

CC @steveisok @directhex @maryamariyan

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Host untriaged New issue has not been triaged by the area owner labels Oct 18, 2021
@steveisok
Copy link
Member

/cc @akoeplinger

@akoeplinger
Copy link
Member

Can you try if you see the issue on main too or just release/6.0?

@uweigand
Copy link
Contributor Author

Can you try if you see the issue on main too or just release/6.0?

On main, I don't get that far: here the problem is that it tries to build the tests as net7.0 code, but using the host SDK, which doesn't even support net7.0 yet ... Not sure what's going on there.

@lambdageek
Copy link
Member

@uweigand can you run with MONO_LOG_LEVEL=debug MONO_LOG_MASK=asm on s390x - the loader should indicate if it found the assembly but rejected it for some reason

@lambdageek lambdageek added arch-s390x Related to s390x architecture (unsupported) area-AssemblyLoader-mono and removed area-Host labels Oct 18, 2021
@lambdageek
Copy link
Member

To the best of our ability we've tried to match mono's loader to the the assembly binder behavior in CoreCLR with respect to versions

  • mono
    if (result && assembly_names_compare_versions (wanted_name, candidate_name, -1) > 0)
  • coreclr
    bool IsCompatibleAssemblyVersion(/* in */ AssemblyName *pRequestedName,
    /* in */ AssemblyName *pFoundName)
    {
    AssemblyVersion *pRequestedVersion = pRequestedName->GetVersion();
    AssemblyVersion *pFoundVersion = pFoundName->GetVersion();
    if (!pRequestedVersion->HasMajor())
    {
    // An unspecified requested version component matches any value for the same component in the found version,
    // regardless of lesser-order version components
    return true;
    }
    if (!pFoundVersion->HasMajor() || pRequestedVersion->GetMajor() > pFoundVersion->GetMajor())
    {
    // - A specific requested version component does not match an unspecified value for the same component in
    // the found version, regardless of lesser-order version components
    // - Or, the requested version is greater than the found version
    return false;
    }
    if (pRequestedVersion->GetMajor() < pFoundVersion->GetMajor())
    {
    // The requested version is less than the found version
    return true;
    }
    if (!pRequestedVersion->HasMinor())
    {
    return true;
    }
    if (!pFoundVersion->HasMinor() || pRequestedVersion->GetMinor() > pFoundVersion->GetMinor())
    {
    return false;
    }
    if (pRequestedVersion->GetMinor() < pFoundVersion->GetMinor())
    {
    return true;
    }
    if (!pRequestedVersion->HasBuild())
    {
    return true;
    }
    if (!pFoundVersion->HasBuild() || pRequestedVersion->GetBuild() > pFoundVersion->GetBuild())
    {
    return false;
    }
    if (pRequestedVersion->GetBuild() < pFoundVersion->GetBuild())
    {
    return true;
    }
    if (!pRequestedVersion->HasRevision())
    {
    return true;
    }
    if (!pFoundVersion->HasRevision() || pRequestedVersion->GetRevision() > pFoundVersion->GetRevision())
    {
    return false;
    }
    return true;
    }

It's possible there's some non-determinism that causes Mono to load the older xunit assembly and CoreCLR to load the newer one.

I'm not sure whether xunit is using ALCs here - it's also possible we're either looking in the wrong ALC or doing some unintended sharing.

@uweigand
Copy link
Contributor Author

@uweigand can you run with MONO_LOG_LEVEL=debug MONO_LOG_MASK=asm on s390x - the loader should indicate if it found the assembly but rejected it for some reason

Well, it rejects it because of the version:

Mono: Request to load xunit.runner.utility.netcoreapp10 in alc 0x2aa31299db0
Mono: netcore preload hook: did not find 'xunit.runner.utility.netcoreapp10'.
Mono: ApplicationBase is /home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/
Mono: Assembly Loader probing location: '/home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/xunit.runner.utility.netcoreapp10.dll'.
Mono: Image addref xunit.runner.utility.netcoreapp10[0x2aa313fb790] (asmctx DEFAULT) -> /home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/xunit.runner.utility.netcoreapp10.dll[0x2aa31464e10]: 2
Mono: Predicate: wanted = xunit.runner.utility.netcoreapp10, Version=2.4.2.0, Culture=neutral, PublicKeyToken=8d05b1bb7a6fdb6c
Mono: Predicate: candidate = xunit.runner.utility.netcoreapp10, Version=2.4.1.0, Culture=neutral, PublicKeyToken=8d05b1bb7a6fdb6c
Mono: Predicate: candidate and wanted names don't match, returning FALSE
Mono: Predicate returned FALSE, skipping 'xunit.runner.utility.netcoreapp10' (/home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/xunit.runner.utility.netcoreapp10.dll)

Mono: Unloading image /home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/xunit.runner.utility.netcoreapp10.dll [0x2aa31464e10].
Mono: The following assembly referenced from /home/uweigand/.nuget/packages/xunit.runner.console/2.4.2-pre.9/tools/netcoreapp2.0/xunit.console.dll could not be loaded:
     Assembly:   xunit.runner.utility.netcoreapp10    (assemblyref_index=10)
     Version:    2.4.2.0
     Public Key: 8d05b1bb7a6fdb6c

Mono: Failed to load assembly xunit.console[0x2aa3138c150].

It also seems to be the default ALC according to those logs.

As to:

It's possible there's some non-determinism that causes Mono to load the older xunit assembly and CoreCLR to load the newer one.

As far as I can see, at execution time there is just one assembly present. During restore, files from the two dependent packages are copied over, and given that each of those two packages contains a copy of that assembly (just at different versions), at the end of the restore process, one copy actually made it into the application directory. And that is the one the Mono loader finds (I don't think it even could find any other version at this point), but it is has the wrong version.

@lambdageek
Copy link
Member

As far as I can see, at execution time there is just one assembly present. During restore, files from the two dependent packages are copied over, and given that each of those two packages contains a copy of that assembly (just at different versions), at the end of the restore process, one copy actually made it into the application directory. And that is the one the Mono loader finds (I don't think it even could find any other version at this point), but it is has the wrong version.

yea doesn't seem like the loader has a lot of options here. It would be nice to confirm that the coreclr loader isn't being less strict in this case for some reason, but it seems like an issue with the build.

@uweigand
Copy link
Contributor Author

As far as I can see, at execution time there is just one assembly present. During restore, files from the two dependent packages are copied over, and given that each of those two packages contains a copy of that assembly (just at different versions), at the end of the restore process, one copy actually made it into the application directory. And that is the one the Mono loader finds (I don't think it even could find any other version at this point), but it is has the wrong version.

yea doesn't seem like the loader has a lot of options here. It would be nice to confirm that the coreclr loader isn't being less strict in this case for some reason, but it seems like an issue with the build.

I did confirm that in my Intel build (using CoreCLR), where the tests succeed, we also have the older version of xunit.runner.utility.netcoreapp10.dll in the application directory, so there doesn't appear to be any difference during restore. It's just that tests executes successfully anyway ...

@uweigand
Copy link
Contributor Author

Given that the error is triggered by Could not load type of field 'Xunit.ConsoleClient.ConsoleRunner:completionMessages', maybe the underlying difference is actually that CoreCLR doesn't even try to load that assembly? I think it is OK for a field to refer to some type that isn't currently loaded, and that load is only triggered in some cases. I think we already saw a similar case where Mono tries (and fails) to load the assembly defining a type, where CoreCLR simply wouldn't even load it at that point ...

@lambdageek
Copy link
Member

lambdageek commented Oct 18, 2021

I think it is OK for a field to refer to some type that isn't currently loaded, and that load is only triggered in some cases. I think we already saw a similar case where Mono tries (and fails) to load the assembly defining a type, where CoreCLR simply wouldn't even load it at that point ...

If that's the case this is probably a hard problem.

completionMessages is an instance field with type [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentDictionary`2<string,[xunit.runner.utility.netcoreapp10]Xunit.ExecutionSummary>. I think mono triggers assembly loading just trying to assign a type to the field (with mono_field_resolve_type called from mono_class_setup_fields). This happens even before we can determine if the field is a reference or value type. (You could imagine trying to avoid loading a type of a generic instantiation if we know that the generic type definition is a reference type). Ultimately it's because mono_metadata_parse_type_checked can't create the generic instance without triggering a load.

@uweigand
Copy link
Contributor Author

As far as I can see, at execution time there is just one assembly present. During restore, files from the two dependent packages are copied over, and given that each of those two packages contains a copy of that assembly (just at different versions), at the end of the restore process, one copy actually made it into the application directory. And that is the one the Mono loader finds (I don't think it even could find any other version at this point), but it is has the wrong version.

yea doesn't seem like the loader has a lot of options here. It would be nice to confirm that the coreclr loader isn't being less strict in this case for some reason, but it seems like an issue with the build.

I did confirm that in my Intel build (using CoreCLR), where the tests succeed, we also have the older version of xunit.runner.utility.netcoreapp10.dll in the application directory, so there doesn't appear to be any difference during restore. It's just that tests executes successfully anyway ...

Huh, that's interesting. While the copy in the application directory is indeed the same, on Intel I see (just using strace) the test execution directly use the copy in the .nuget package directory - which is of course the correct version:

[pid 62768] openat(AT_FDCWD, "/home/uweigand/.nuget/packages/xunit.runner.console/2.4.2-pre.9/tools/netcoreapp2.0/xunit.runner.utility.netcoreapp10.dll", O_RDONLY) = 34

versus on s390x:

[pid 1726846] openat(AT_FDCWD, "/home/uweigand/runtime-net6/artifacts/bin/Microsoft.NET.HostModel.AppHost.Tests/Debug/net6.0/xunit.runner.utility.netcoreapp10.dll", O_RDONLY) = 5

I'm not sure how this happens -- is there a way to create traces for the CoreCLR loader as well?

@lambdageek
Copy link
Member

I'm not sure how this happens -- is there a way to create traces for the CoreCLR loader as well?

https://docs.microsoft.com/en-us/dotnet/core/dependency-loading/collect-details

@uweigand
Copy link
Contributor Author

uweigand commented Oct 19, 2021

It turns out that the CoreCLR loader doesn't find the xunit.runner.utility.netcoreapp10.dll assembly either. Instead, the correct version of that library, when running on CoreCLR, is actually provided by a Resolving event handler that is installed by the Main routine of Xunit.ConsoleClient (via AssemblyHelper.SubscribeResolveForAssembly):

namespace Xunit.ConsoleClient
{
    public class Program
    {
        [STAThread]
        public static int Main(string[] args)
        {
            // This code must not contain any references to code in any external assembly (or any code that references any code in any
            // other assembly) until AFTER the creation of the AssemblyHelper.
            var consoleLock = new object();
            var internalDiagnosticsMessageSink = DiagnosticMessageSink.ForInternalDiagnostics(consoleLock, args.Contains("-internaldiagnostics"), args.Contains("-nocolor"));

            using (AssemblyHelper.SubscribeResolveForAssembly(typeof(Program), internalDiagnosticsMessageSink))
                return new ConsoleRunner(consoleLock).EntryPoint(args);
        }
    }
}

Unfortunately, when running under Mono, we never even get to installing that Resolving event handler. This is because the runtime aborts already during compilation of that very Main routine! It looks like mere reference to the ConsoleRunner type triggers an attempted load of the defining assembly, but that cannot be resolved until after the event handler was installed:

#0  mono_field_resolve_type (field=0x2aa0023a268, error=0x3ffffff9f00) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class.c:6431
#1  0x000003fffd0dc3c8 in mono_class_setup_fields (klass=0x2aa002a4b50) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:322
#2  0x000003fffd0e6c2e in init_sizes_with_info (klass=0x2aa002a4b50, cached_info=0x0) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:1754
#3  0x000003fffd0dcb74 in mono_class_init_internal (klass=0x2aa002a4b50) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:2804
#4  0x000003fffd386a02 in mono_method_to_ir (cfg=0x2aa0023b110, method=0x2aa00035cd8, start_bblock=0x2aa00265fa0, end_bblock=0x2aa002660e0, return_var=0x0, inline_args=0x0, 
    inline_offset=0, is_virtual_call=0) at /home/uweigand/runtime-net6/src/mono/mono/mini/method-to-ir.c:8698
#5  0x000003fffd339418 in mini_method_compile (method=0x2aa00035cd8, opts=374417919, flags=JIT_FLAG_RUN_CCTORS, parts=0, aot_method_index=-1)
    at /home/uweigand/runtime-net6/src/mono/mono/mini/mini.c:3460

(method at this point is Xunit.ConsoleClient.Program:Main)

This is the mono_class_init_internal call in the MONO_CEE_NEWOBJ case of mono_method_to_ir, used to implement the new ConsoleRunner expression. It looks like for some reason CoreCLR is able to compile this without resolving all the field types, but Mono currently is not ... Any suggestion what to do here?

@lambdageek
Copy link
Member

It turns out that the CoreCLR loader doesn't find the xunit.runner.utility.netcoreapp10.dll assembly either. Instead, the correct version of that library, when running on CoreCLR, is actually provided by a Resolving event handler that is installed by the Main routine of Xunit.ConsoleClient (via AssemblyHelper.SubscribeResolveForAssembly):

namespace Xunit.ConsoleClient
{
    public class Program
    {
        [STAThread]
        public static int Main(string[] args)
        {
            // This code must not contain any references to code in any external assembly (or any code that references any code in any
            // other assembly) until AFTER the creation of the AssemblyHelper.
            var consoleLock = new object();
            var internalDiagnosticsMessageSink = DiagnosticMessageSink.ForInternalDiagnostics(consoleLock, args.Contains("-internaldiagnostics"), args.Contains("-nocolor"));

            using (AssemblyHelper.SubscribeResolveForAssembly(typeof(Program), internalDiagnosticsMessageSink))
                return new ConsoleRunner(consoleLock).EntryPoint(args);
        }
    }
}

Unfortunately, when running under Mono, we never even get to installing that Resolving event handler. This is because the runtime aborts already during compilation of that very Main routine! It looks like mere reference to the ConsoleRunner type triggers an attempted load of the defining assembly, but that cannot be resolved until after the event handler was installed:

#0  mono_field_resolve_type (field=0x2aa0023a268, error=0x3ffffff9f00) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class.c:6431
#1  0x000003fffd0dc3c8 in mono_class_setup_fields (klass=0x2aa002a4b50) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:322
#2  0x000003fffd0e6c2e in init_sizes_with_info (klass=0x2aa002a4b50, cached_info=0x0) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:1754
#3  0x000003fffd0dcb74 in mono_class_init_internal (klass=0x2aa002a4b50) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:2804
#4  0x000003fffd386a02 in mono_method_to_ir (cfg=0x2aa0023b110, method=0x2aa00035cd8, start_bblock=0x2aa00265fa0, end_bblock=0x2aa002660e0, return_var=0x0, inline_args=0x0, 
    inline_offset=0, is_virtual_call=0) at /home/uweigand/runtime-net6/src/mono/mono/mini/method-to-ir.c:8698
#5  0x000003fffd339418 in mini_method_compile (method=0x2aa00035cd8, opts=374417919, flags=JIT_FLAG_RUN_CCTORS, parts=0, aot_method_index=-1)
    at /home/uweigand/runtime-net6/src/mono/mono/mini/mini.c:3460

(method at this point is Xunit.ConsoleClient.Program:Main)

This is the mono_class_init_internal call in the MONO_CEE_NEWOBJ case of mono_method_to_ir, used to implement the new ConsoleRunner expression. It looks like for some reason CoreCLR is able to compile this without resolving all the field types, but Mono currently is not ... Any suggestion what to do here?

Talked with some CoreCLR JIT folks. Apparently they always generate essentially a call to a helper and pass it the class token:

call CORINFO_HELP_NEWSFAST(TypeHandlePtr) 

So for reference types they might not have to do any class intialization (or at least instance size calculation) at JIT time.

To do the same on Mono we would need some variant initialization path that does enough work to initialize the methods (or at least the .ctor being called) but not compute the field layout. I'm not sure how feasible that is. The initialization order is pretty fragile.

@uweigand
Copy link
Contributor Author

I've tried to work around the problem, following the initial assumption, by avoiding installing the xunit.runner.visualstudio package. But it turns out this doesn't work. First of all, the PackageReference on xunit.runner.visualstudio is generated automatically by the Arcade SDK, but even when hacking Arcade to remove that, the tests still do not work.

In fact, it turns out my initial assumption (that this is due to a conflict installing two copies of the same file originating from different packages) was actually wrong. Only xunit.runner.visualstudio installs those files into the application directory. The xunit.runner.console package doesn't install any files into the application directory, instead the console runner is being executed directly from the nuget package directory. It seems the intent has always been for the console runner to satisfy its dependencies solely due to the Resolving event handler described above.

It also turns out that with CoreCLR, the xunit.runner.utility.netcoreapp10.dll assembly as a dependency of xunit.runner.console is never used from the application directory, even if it has the matching version. That is because it is not mentioned in any deps.json file during execution of the console runner, and therefore the CoreCLR loader apparently never finds it via file system lookup at all. On the other hand, the Mono loader falls back to scanning the "Application Base" directory and will load the assembly from there -- that turns out to be the assembly actually installed by and intended for xunit.runner.visualstudio, but if the version matches, everything just happens to work.

It seems this behavior of the Mono loader (scanning the "Application Base" directory -- this may be a relic of Mono's -mostly removed- appdomain handling?) is itself an incompatibility with CoreCLR, and that actually causes a couple of failures in the host.tests suite where the test relies on an assembly not being loaded if it isn't mentioned in deps.json. It's just that while the versions happened to match, these two incompatibilities "canceled" each other and the test execution happened to work ...

Talked with some CoreCLR JIT folks. Apparently they always generate essentially a call to a helper and pass it the class token:

Looking in more detail at the Mono loader implementation of MONO_CEE_NEWOBJ, it seems that there is already at least one case where Mono also delegates the actual allocation to a helper routine (MONO_JIT_ICALL_mono_helper_newobj_mscorlib), but I don't understand that code enough to see whether this could be generalized similarly to what CoreCLR does. In particular, it's not immediately obvious to me which fields of the Mono class object may be accessed before the class was fully loaded, and if there is some way we'd be able to verify no such access happens if we were to change the MONO_CEE_NEWOBJ support to delay class loading.

@lambdageek
Copy link
Member

Looking in more detail at the Mono loader implementation of MONO_CEE_NEWOBJ, it seems that there is already at least one case where Mono also delegates the actual allocation to a helper routine (MONO_JIT_ICALL_mono_helper_newobj_mscorlib), but I don't understand that code enough to see whether this could be generalized similarly to what CoreCLR does. In particular, it's not immediately obvious to me which fields of the Mono class object may be accessed before the class was fully loaded, and if there is some way we'd be able to verify no such access happens if we were to change the MONO_CEE_NEWOBJ support to delay class loading.

I'm quite pessimistic that we can make it work. One thing we could try: instead of always calling mono_class_init_internal (cmethod->klass), check whether the class is already initialized. If it's initialized and it's not a corlib class, always delegate to a helper icall to do the allocation. I think it will have poor performance, but at least we can validate that delaying initialization would resolve the issue. (If we had a tiered JIT this might be reasonable in general - on the hot paths the class would already be initialized so we'd get better code on the second tier.)

But actually, looking at the Main method in the console runner, we would also have to make sure that the call EntryPoint instruction doesn't try to initialize the class, either:

      IL_0034:  ldloc.0
      IL_0035:  newobj     instance void Xunit.ConsoleClient.ConsoleRunner::.ctor(object)
      IL_003a:  ldarg.0
      IL_003b:  call       instance int32 Xunit.ConsoleClient.ConsoleRunner::EntryPoint(string[])
      IL_0040:  stloc.3
      IL_0041:  leave.s    IL_004d

On the other hand, the Mono loader falls back to scanning the "Application Base" directory and will load the assembly from there -- that turns out to be the assembly actually installed by and intended for xunit.runner.visualstudio, but if the version matches, everything just happens to work.

I think this is worth a separate issue.

@lambdageek
Copy link
Member

One thing we could try: instead of always calling mono_class_init_internal (cmethod->klass), check whether the class is already initialized. If it's initialized and it's not a corlib class, always delegate to a helper icall to do the allocation.

I had an idea of how we could refine this a little. We add some flags that the loader can set to tell the JIT "do something different with this class". And we define a single "something different" right now: always delay constructor calls to a JIT icall.

  1. Add a bit to MonoClass:special_jit_handling, and add a PROP_SPECIAL_JIT_FLAGS to InfrequentDataKind. The only flag will be CTOR_NEEDS_ICALL.
  2. In class-init.c, add an array of struct SpecialClassDef { const char *assembly_name; const char *name_space; const char *name; uint32_t flags; }. Initially the array just has {"xunit.console.runner", "Xunit.ConsoleClient", "ConsoleRunner", CTOR_NEEDS_ICALL}. Modify the class creation to set the special_jit_handling bit when it's loading that class.
  3. In method-to-ir, check the bit and emit a jit icall that just gets the token of the ctor and returns some MonoObject* instead of the normal allocation and construction code.

@uweigand I'm going to try to make a prototype with this approach. It should at least be enough to see if we need to do something about call EntryPoint, too.

@uweigand
Copy link
Contributor Author

@lambdageek thanks for working on this! I'm wondering if we can avoid the hard-coded list by just detecting the failure during allocation - always try the current approach first, and if it fails to initialize the class, then instead of throwing an exception switch to emitting the icall and trying again there.

I think this is worth a separate issue.

Agreed, I'll open one with more details from the host.tests failures I was seeing.

@lambdageek
Copy link
Member

lambdageek commented Oct 28, 2021

@lambdageek thanks for working on this! I'm wondering if we can avoid the hard-coded list by just detecting the failure during allocation - always try the current approach first, and if it fails to initialize the class, then instead of throwing an exception switch to emitting the icall and trying again there.

I don't think that will work. When we call mono_class_init_internal for the first time, when it fails to initialize the class, it will set MonoClass:has_failure which pushes a lot of runtime code onto error paths. We don't have a recovery mechanism - once a class has an initialization failure, we don't have a way to unset the bit.

Thinking about the right long term fix might be one of:

  • I don't think we would be able to use our IR newobj (which needs class initialization) by somehow delaying loading the field types (which we need to compute the instance size and GC descriptor). I don't think there's enough info on a typespec like .field private initonly [System.Collections.Concurrent]System.Collections.Concurrent.ConcurrentDictionary`2<string,[xunit.runner.utility.netcoreapp10]Xunit.ExecutionSummary> completionMessages to tell that this is a reference field. Even if we could get the instance size and GC descriptor, we would still need to compute the runtime vtable which will need method signatures - which probably will trigger more class loading.
  • So instead we need some way of speculatively initializing a class at JIT-time and falling back to an icall if it doesn't work. We actually already have a similar issue when initialization of a class triggers execution of managed code which may deadlock (see [mono][loader] Calling assembly resolve while holding the loader lock can lead to deadlocks #51864). So it might be that we can fix both issues with this approach. But it's going to be pretty invasive - we will need to watch for MonoClass:has_unloaded_dependencies on a lot of initialization paths. It won't be a simple fix.

@lambdageek
Copy link
Member

@uweigand Could you try #60987 and let me know if it works.

@uweigand
Copy link
Contributor Author

@uweigand Could you try #60987 and let me know if it works.

Thanks. Unfortunately, it still doesn't work - it looks like you were right about the call EntryPoint. We now get the type load failure a bit later, during MONO_CEE_CALL:

#0  mono_field_resolve_type (field=0x2aa002348d0, error=0x3ffffff9f00) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class.c:6430
#1  0x000003fffd0dbb38 in mono_class_setup_fields (klass=0x2aa00275b00) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:322
#2  0x000003fffd0e65be in init_sizes_with_info (klass=0x2aa00275b00, cached_info=0x0) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:1790
#3  0x000003fffd0dc2e4 in mono_class_init_internal (klass=0x2aa00275b00) at /home/uweigand/runtime-net6/src/mono/mono/metadata/class-init.c:2840
#4  0x000003fffd3753ee in mono_method_to_ir (cfg=0x2aa00235700, method=0x2aa0007d108, start_bblock=0x2aa00236310, end_bblock=0x2aa00236450, return_var=0x0, inline_args=0x0, 
    inline_offset=0, is_virtual_call=0) at /home/uweigand/runtime-net6/src/mono/mono/mini/method-to-ir.c:7270
#5  0x000003fffd338ef8 in mini_method_compile (method=0x2aa0007d108, opts=374417919, flags=JIT_FLAG_RUN_CCTORS, parts=0, aot_method_index=-1)
    at /home/uweigand/runtime-net6/src/mono/mono/mini/mini.c:3460

Line 7270 in the patched sources corresponds to this:
https://github.com/dotnet/runtime/blob/release/6.0/src/mono/mono/mini/method-to-ir.c#L7249
cmethod at this point is Xunit.ConsoleClient.ConsoleRunner:EntryPoint.

Also, I realized that even if it did work, it still wouldn't help to just apply the patch to the runtime sources - the xunit tests are run using the host SDK, so we'd have to get that fix in there as well.

Maybe it would be easier after all to put a workaround for this particular issue in xunit, i.e. replace this:

            using (AssemblyHelper.SubscribeResolveForAssembly(typeof(Program), internalDiagnosticsMessageSink))
                return new ConsoleRunner(consoleLock).EntryPoint(args);

with something along the lines of:

            using (AssemblyHelper.SubscribeResolveForAssembly(typeof(Program), internalDiagnosticsMessageSink))
                return CallEntryPointHelper(args);

and then do both the allocation and the actual call in the helper.

(Unfortunately I wasn't able to test this so far - it doesn't seem to be (easily?) possible to even build xunit v2 on Linux at all ...)

lambdageek added a commit to lambdageek/xunit that referenced this issue Nov 1, 2021
On platforms where Mono is the dotnet SDK runtime (Linux s390x, currently),
work around a limitation of the Mono JIT by moving the instantiation of
ConsoleRunner to a separate function from the call to
SubscribeResolveForAssembly.

Unlike CoreCLR, when the Mono JIT is compiling Main and sees a call to new
ConsoleRunner, it needs to load the types of the fields of ConsoleRunner.  Some
of those fields come from assemblies that can only be found using the resolve
helper.  But the helper is not installed until the JIT has finished compiling
Main and starts executing it.  This is a fundamental limitation of the Mono JIT
and is unlikely to be fixed in the short term.

The workaround is to move the code that depends on the assembly resolve to a
separate non-inlinable function.  As a result, Mono with JIT and invoke Main
without trying to load and initialize ConsoleRunner until after the assembly resolve
event handler is installed.

Fixes dotnet/runtime#60550
@lambdageek
Copy link
Member

lambdageek commented Nov 1, 2021

@uweigand I updated #60987 to also address the ApplicationBase loader issue, and it immediately reproduced the xunit issue 😁 . I added some code for the call instruction to also avoid class initialization. (And also in the interpreter).

I think this might be less of a hack than I initially thought - in particular I wonder if this might have a benefit for app startup. I wonder if it's possible to more broadly avoid class initialization until an instance is created . At least in cases where nothing difficult is going on (ie non-generic reference types without static constructors)

@lambdageek
Copy link
Member

@uweigand I updated #60987 to also address the ApplicationBase loader issue, and it immediately reproduced the xunit issue 😁 . I added some code for the call instruction to also avoid class initialization. (And also in the interpreter).

And removing ApplicationBase broke some tests - for reasons I don't understand yet - as we observed it's not clear that coreclr ever probes ApplicationBase and yet it seems necessary in Mono. I checked the other properties that the host might pass to the runtime (APP_PATHS and PLATFORM_RESOURCE_ROOTS) and they are either unset or empty.

Questions/Next steps:

  1. I think we may need to go back to upstreaming a fix to xunit. It would be great if we had a self-contained repro that we could show to the xunit maintainers. I'm not sure what is the best way to construct such a thing.

  2. It looks like for some tests we use a fork of the console runner https://github.com/dotnet/arcade/tree/main/src/Microsoft.DotNet.XUnitConsoleRunner but not, apparently, the host tests? Do you know how the installer tests end up invoking the upstream console runner, @uweigand ? I'm having trouble locating the responsible code.

@uweigand
Copy link
Contributor Author

uweigand commented Nov 5, 2021

Do you know how the installer tests end up invoking the upstream console runner, @uweigand ? I'm having trouble locating the responsible code.

I don't think the installer tests do anything special, they just use the "normal" arcade logic for test projects, which always uses the upstream console runner.

For the library tests, in contrast, we have src/libraries/Directory.Build.props do:

  <Import Project="$(RepositoryEngineeringDir)testing\tests.props" Condition="'$(EnableTestSupport)' == 'true'" />

and that file eng/testing/tests.props has logic that appears to be responsible for overriding arcade and doing something special. This file is not used by the installer tests.

@steveisok
Copy link
Member

steveisok commented Nov 5, 2021

I really wonder if we need a package reference to xunit.runner.visualstudio at all for 390x? @uweigand have you tried to exclude it?

@uweigand
Copy link
Contributor Author

uweigand commented Nov 7, 2021

I really wonder if we need a package reference to xunit.runner.visualstudio at all for 390x? @uweigand have you tried to exclude it?

Unfortunately that doesn't help, see #60550 (comment)

lambdageek added a commit to lambdageek/runtime that referenced this issue Nov 17, 2021
Add a rarely used MonoClass property for "special jit
flags" (`m_class_has_special_jit_flags` and `mono_class_get_special_jit_flags`)
that is set at class loading time on some classes known to the runtime.

Add an `enum MonoSpecialJitClassFlags` with the various values.  Currently
there's a single value `MONO_SPECIAL_JIT_USE_ICALL_NEWOBJ`.

When loading `Xunit.ConsoleClient.ConsoleRunner` from `xunit.console.dll`, set
the `USE_ICALL_NEWOBJ` flag.

The JIT handling of this flag is not yet implemented.

The `USE_ICALL_NEWOBJ` should tell the JIT to avoid initializing the class at
JIT-time, and to delegte new instance creation to an icall that will initialize
the class, allocate the memory and invoke the constructor.

This is part of a workaround for dotnet#60550
lambdageek added a commit to lambdageek/runtime that referenced this issue Nov 17, 2021
Avoid calling `mono_class_init_internal` for classes marked with
`MONO_SPECIAL_JIT_USE_ICALL_NEWOBJ` until runtime, rather than JIT-time.

Related to dotnet#60550
@steveisok steveisok removed the untriaged New issue has not been triaged by the area owner label Jul 5, 2022
@steveisok steveisok added this to the Future milestone Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-s390x Related to s390x architecture (unsupported) area-Build-mono
Projects
None yet
Development

No branches or pull requests

4 participants