Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize CoreCLR, HRESULT: 0x80004005 when deploying on sandstorm #2534

Open
gkhanna79 opened this issue Nov 12, 2016 · 42 comments
Assignees
Milestone

Comments

@gkhanna79
Copy link
Member

Moved from github.com/dotnet/cli/issues/4111 on behalf of @pgrm

Steps to reproduce

Expected behavior

At least it would be great if I could see the current dotnet version. Actually the yeoman generator default website should show up, but at least in the debug log, I should be able to see the dotnet version

Actual behavior

...** SANDSTORM SUPERVISOR: Starting up grain.
dotnet --version
Failed to initialize CoreCLR, HRESULT: 0x80004005

Environment data

dotnet --info output:

.NET Command Line Tools (1.0.0-preview2-003121)

Product Information:
 Version:            1.0.0-preview2-003121
 Commit SHA-1 hash:  1e9d529bc5

Runtime Environment:
 OS Name:     debian
 OS Version:  8
 OS Platform: Linux
 RID:         debian.8-x64

However, this dotnet --info was executed inside the .sandstorm/build.sh script where dotnet works just fine, and not inside .sandstorm/launcher.sh script where the problems show up.

I think the problem is, because sandstorm containers don't have access to /proc/self/exe which was introduced with https://github.com/dotnet/coreclr/issues/1774. I tried to fake that file as discussed in sandstorm-io/sandstorm#1582. I've set the link to point to /usr/local/bin/dotnet as well as /opt/dotnet/dotnet but in both cases I'm getting the same error.

Maybe I'm pointing it to the wrong file, or there is something more missing to make it work. Unfortunately the error message is hard to understand so some help would really be appreciated. Thx

@gkhanna79
Copy link
Member Author

CC @schellap

@ramarag
Copy link
Member

ramarag commented Mar 18, 2017

@pgrm can you please provide the trace generated when run with export COREHOST_TRACE=1.

@ramarag
Copy link
Member

ramarag commented Apr 21, 2017

@pgrm can you check if https://github.com/dotnet/coreclr/issues/9014#issuecomment-290677198 solves your issue

@gkhanna79
Copy link
Member Author

Closing the issue due to lack of details. Please reopen if the issue persists with the latest runtime.

@msftgits msftgits transferred this issue from dotnet/core-setup Jan 30, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 30, 2020
@ocdtrekkie
Copy link

I have gotten back to this issue on modern .NET Core 3.1, and would love some help troubleshooting. Creating a hack for /dev similar to as @ramarag mentioned did not help me.

Failed to create CoreCLR, HRESULT: 0x80004005

Host (useful for support):
  Version: 3.1.2
  Commit:  916b5cba26

.NET Core SDKs installed:
  3.1.200 [/usr/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.App 3.1.2 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.2 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download
** HTTP-BRIDGE: App server exited with status code: 137

@ocdtrekkie
Copy link

@gkhanna79 Can we reopen this for troubleshooting? I've got a .NET developer who would be interested in being able to play with .NET Core inside a Sandstorm sandbox.

@ocdtrekkie
Copy link

This is the log with COREHOST_TRACE=1:
log.txt

@ocdtrekkie
Copy link

New test repo is here: https://github.com/ocdtrekkie/vagrant-spk-dotnet-test1

@janvorli
Copy link
Member

@ocdtrekkie neither @gkhanna79 nor @ramarag are on the .NET Core team anymore.
Would you be able to get me a strace output (ideally via a gist, as it is going to be very large)? That might show us what's failing during the initialization.

@janvorli janvorli reopened this Mar 23, 2020
@janvorli janvorli assigned janvorli and unassigned ramarag Mar 23, 2020
@janvorli janvorli modified the milestones: 2.1.0, 5.0 Mar 23, 2020
@ocdtrekkie
Copy link

How would be the best way to generate that for you? (I'm a bit if a beginner in the Linux space as much as I am new to .NET Core.)

I was wondering if they were still around as I did notice a sparse GitHub contribution history, and it's been a few years.

@janvorli
Copy link
Member

To get the strace log, you can execute it as follows:

strace -o strace.log dotnet --version

The log will get generated into the strace.log file.

@ocdtrekkie
Copy link

@janvorli
Copy link
Member

@ocdtrekkie I can see the code is looking for several proc file system files that do not exist exist based on the log close to the failure:

  • /proc/self/maps
  • /proc/mounts
  • /dev/shm/
  • /proc/{PID}/stat

Only the /proc/self/maps and /proc/{PID}/stat is something that our runtime explicitly tries to open and we fail if we don't find it.

@ocdtrekkie
Copy link

I basically have to figure out how to fake things in /proc, so let me see what I can manage there. I think we have an open issue to add /dev/shm too.

Thanks! This lets me know what I need to accomplish.

@zenhack
Copy link

zenhack commented Mar 24, 2020

Spoofing /proc/self/maps and /proc/{PID}/stat is going to be a lot harder than /proc/self/exe. It would be helpful to know what these are actually being used for; ideally we'd find a way for .NET to still work if /proc isn't available.

@janvorli
Copy link
Member

the /proc/{PID}/stat is used to get a start time of a process and the /proc/self/maps to get a list of shared libraries and executable in the current process.

@zenhack
Copy link

zenhack commented Mar 27, 2020

Does .NET care about the address ranges in /proc/self/maps? I'm wondering if it would be possible to generate the information statically, but if it does care about the addresses that won't work because of ASLR.

I think with /proc/{PID}/stat the only non-obvious part is working out which pid we need to supply.

Ideally .NET would just not have hard dependencies on stuff in proc though. Unfortunately just providing procfs in sandstorm isn't something I'm open to; its absence has protected us from LPE vulnerabilities in the past, and I don't think any other runtime has given us much trouble. We bind mount specifically /proc/cpuinfo, which many things want, but most apps are able to get by without much else that lives in /proc.

I wonder how much of .NET's use of /proc is actually necessary; the issue with /proc/self/exe suggests that at least some of these have easy alternatives (in that particular case the program name can be fetched from argv[0]). @janvorli, how open would you be to migrating .NET functionality that uses /proc to more portable alternatives?

@janvorli
Copy link
Member

@zenhack I am sorry, I've missed your last question and discovered it today. Wherever we can use non- /proc based approach, I'd be more than happy to use it. However, I am not aware of ones that we could easily replace.
But in .NET 5.0, we now primarily use getauxval(AT_EXECFN) to get the current exe and only fall back to /proc/self/exe if that doesn't work.

@janvorli janvorli modified the milestones: 5.0.0, Future Jul 15, 2020
@pfeigl
Copy link

pfeigl commented Aug 25, 2020

Getting rid of /proc/ would actually allow to run .NET self-contained executables to work within most of the shared hostings out there which provide SSH access.

Those are mostly running in some kind of chroot or similar reduced environments where access to /proc/ is not available.

@zenhack
Copy link

zenhack commented Sep 4, 2020

I tried updating @ocdtrekkie's scripts to use the latest .NET 5 preview:

https://github.com/ocdtrekkie/vagrant-spk-dotnet-test1/pull/1/files

It doesn't appear to have fixed the /proc/self/exe issue; if I remove the symlink
from those scripts I get:

Tracing enabled @ Fri Sep  4 21:35:54 2020 GMT
--- Invoked dotnet [version: 5.0.0-preview.8.20407.11, commit hash: bf456654f9a4f9a86c15d9d50095ff29cde5f0a4] main = {
dotnet
--info
}
realpath(/proc/self/exe) failed: Input/output error
Failed to resolve full path of the current executable [/proc/self/exe]

But if I keep the workaround for that, then I get the same error as before; full log:

log.txt

So the situation seems unchanged with .NET 5.


A couple thoughts:

  • Re: /proc/{PID}/stat, does the runtime use this to get the status of its own start time, or of other processes? If the former, reading /proc/self/stat would make it easier to work around.
  • I'm also curious to know what the information in these files is being used for -- what does the runtime actually do with the startup time and the list of shared libs?

@janvorli
Copy link
Member

janvorli commented Sep 4, 2020

We are using /proc/self/exe only if getauxval(AT_EXECFN) returns NULL. Is it possible that it happens in your environment?

Regarding the proc/{PID}/stat, we use it for other processes too for diagnostics interprocess communication. We use the time obtained from it for process id disambiguation (see GetProcessIdDisambiguationKey). The point is that a process id can get recycled during the lifetime of a .NET app.

@zenhack
Copy link

zenhack commented Sep 4, 2020

And /proc/self/maps?

(getauxval returning null seems plausible, and it's a secondary concern anyway).

@janvorli
Copy link
Member

janvorli commented Sep 4, 2020

/proc/{PID}/maps is used to get a list of currently loaded shared libraries for a given process for debugger interface purposes and to create a crash dump when .NET app crashes and creating crashdump is enabled. I don't see /proc/self/maps being used in our sources now.

@zenhack
Copy link

zenhack commented Sep 4, 2020

Hm, both of those things seem like they could be non-fatal errors; how invasive a change would it be to just disable the relevant functionality if /proc isn't available, rather that aborting? Not being able to launch a debugger doesn't seem like it should prevent an app from running at all.

@janvorli
Copy link
Member

janvorli commented Sep 5, 2020

The /proc/{PID}/maps is actually not opened at startup, only the /proc/{PID}/stat is. But that one can be disabled by setting env var COMPlus_EnableDiagnostics=0. This disables all the diagnostics stuff. Maybe that would actually get you going.

@zenhack
Copy link

zenhack commented Sep 5, 2020

No dice. Did an strace, it appears that it is trying to open /proc/self/maps on startup; here's the end of that log:

open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
futex(0x7fc9ad7341a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "Failed to create CoreCLR, HRESUL"..., 45Failed to create CoreCLR, HRESULT: 0x80004005) = 45
write(2, "\n", 1
)                       = 1
munmap(0x7fc9acf1c000, 2499424)         = 0
exit_group(-2147450743)                 = ?

Adding the full log as an attachment. (I think this is the same as what we observed with @ocdtrekkie's earlier attempt)

log-strace.txt

@janvorli
Copy link
Member

janvorli commented Sep 5, 2020

Ah, I have realized that I was not testing it with the dotnet host, but rather the corerun that is a simple host that predates dotnet and we use it for running coreclr tests. With that one, executing a dotnet app didn't touch anything in the /proc file system, according to my strace log.

Looking at the dotnet host, it really uses /proc/self/maps and /proc/self/exe (it doesn't try to use getauxval(AT_EXECFN)). We can easily fix the /proc/self/exe. The /proc/self/maps is used to find whether a library is already loaded or not if dlopen fails with just the library name. The comment says that it was found that on some systems, dlopen finds that a library was loaded only if given absolute path. I'm not sure how dlopen behaves on your system.

@zenhack
Copy link

zenhack commented Sep 5, 2020

The exact rules for resolution on linux are documented in the man page: https://linux.die.net/man/3/dlopen

It looks like it was previously failing to find libnuma and libttng-ust. Through a combination of adding a couple directories to LD_LIBRARY_PATH and installing some packages, I got it to find those, but it's still trying to open /proc/self/maps. Full strace log is attached, I scanned through it and it doesn't look like it's failing to find any of the libraries it looks for.

log-strace2.txt

@janvorli
Copy link
Member

janvorli commented Sep 7, 2020

@zenhack so it turned out the code that opens /self/proc/maps in the dotnet host that I was referring to is used only when a 3rd party library is trying to use .NET hosting API, so it is definitely not your case.
So I've ran strace with -k option to see where the call to open that file comes from and surprisingly, it was not from our code. Our code just calls pthread_getattr_np and that function from libpthread internally opens /self/proc/maps:

open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = 10
 > /lib/x86_64-linux-gnu/libc-2.23.so(_IO_file_open+0x1f) [0x79a5f]
 > /lib/x86_64-linux-gnu/libc-2.23.so(_IO_file_fopen+0x205) [0x79d35]
 > /lib/x86_64-linux-gnu/libc-2.23.so(fgets+0x264) [0x6dd34]
 > /lib/x86_64-linux-gnu/libpthread-2.23.so(pthread_getattr_np+0x17a) [0x93ba]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so() [0x4e8cc9]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so(_ZN6Thread10InitThreadEi+0x12c) [0x17a91c]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so(_Z11SetupThreadi+0x18b) [0x17997b]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so() [0x4a37f9]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so() [0x4a2cd9]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so() [0x4a2b3d]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so(_ZN8CorHost25StartEv+0x6e) [0xca33e]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libcoreclr.so(coreclr_initialize+0x3c0) [0x74d10]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libhostpolicy.so(_ZN9coreclr_t6createERKSsPKcS3_RK22coreclr_property_bag_tRSt10unique_ptrIS_St14default_deleteIS_EE+0x320) [0x8ee0]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libhostpolicy.so(_ZN12_GLOBAL__N_114create_coreclrEv+0x135) [0x18b15]
 > /home/janvorli/dotnet50latest/shared/Microsoft.NETCore.App/5.0.0-rc.1.20430.6/libhostpolicy.so(corehost_main+0xe1) [0x18531]
 > /home/janvorli/dotnet50latest/host/fxr/5.0.0-rc.1.20430.6/libhostfxr.so(_ZN10fx_muxer_t24handle_exec_host_commandERKSsRK19host_startup_info_tS1_RKSt13unordered_mapI13known_optionsSt6vectorISsSaISsEE18known_options_hashSt8equal_toIS6_ESaISt4pairIKS6_S9_EEEiPPKci11host_mode_tPciPi+0x6d2) [0x1d422]
 > /home/janvorli/dotnet50latest/host/fxr/5.0.0-rc.1.20430.6/libhostfxr.so(_ZN10fx_muxer_t10handle_cliERK19host_startup_info_tiPPKcRKSs+0x6d5) [0x1c6f5]
 > /home/janvorli/dotnet50latest/host/fxr/5.0.0-rc.1.20430.6/libhostfxr.so(_ZN10fx_muxer_t7executeESsiPPKcRK19host_startup_info_tPciPi+0x260) [0x1bb40]
 > /home/janvorli/dotnet50latest/host/fxr/5.0.0-rc.1.20430.6/libhostfxr.so(hostfxr_main_startupinfo+0x94) [0x18434]
 > /home/janvorli/dotnet50latest/dotnet(_Z9exe_startiPPKc+0x3aa) [0x109aa]
 > /home/janvorli/dotnet50latest/dotnet(main+0x90) [0x10e90]
 > /lib/x86_64-linux-gnu/libc-2.23.so(__libc_start_main+0xf0) [0x20830]
 > /home/janvorli/dotnet50latest/dotnet(_start+0x29) [0x4eca]

@zenhack
Copy link

zenhack commented Sep 7, 2020

The man page suggests that this can happen if you call it on the main thread:

In addition, if thread refers to the main thread, then pthread_getattr_np() can fail because  of
errors  from  various  underlying calls: fopen(3), if /proc/self/maps can't be opened; and getr‐
limit(2), if the RLIMIT_STACK resource limit is not supported.

Peeking at the implementation confirms this:

https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_getattr_np.c;h=ce437205e41dc05653e435f6188768cccdd91c99;hb=HEAD#l76

So I guess that gets us back to: what is this information needed for, and can we work around it?

@zenhack
Copy link

zenhack commented Sep 7, 2020

(I had a look at the musl libc implementation and it looks like it doesn't have this problem -- so maybe I'll try it out on alpine)

@janvorli
Copy link
Member

janvorli commented Sep 7, 2020

We use the pthread_getattr_np to get stack ranges for each thread. This is used for several purposes:

  • Sanity checks that specific addresses are on stack, but these happen only in debug builds of the runtime, not the release ones.
  • Verifying that there is a sufficient amount of stack left for executing some runtime functions. We use that at various places, one place for example determines whether it can use stackalloc or whether it needs to fall back to malloc. Another place is in Microsoft.Extensions.DependencyInjection.
  • Internal diagnostics purposes
  • Debugger interface purposes

@zenhack
Copy link

zenhack commented Sep 7, 2020

I tried getting this working with the alpine/musl build (using docker-spk, since vagrant-spk is tied to debian) and can confirm that we don't hit the /proc/self/maps issue with the musl build. However, it appears musl's implementation of realpath() works by opening the path you pass to it, and then calling readlink(/proc/self/fd/), which obviously doesn't work without /proc mounted. So it trips up doing this for /proc/self/exe even with the symlink workaround, since /proc/self/fd isn't there either -- and we can't just add it, since if the fd gets reused (and it does) future results will be wrong.

I asked in #musl on freenode, and I guess realpath() is difficult to implement without race conditions if you want to avoid /proc. But it works on glibc.

Avoiding /proc/self/exe is a no-brainer. Beyond that, I can see a couple possible ways to proceed:

  1. Avoid pthread_getattr_np on the main thread (it looks like other threads should be ok).
  2. Try to convince the glibc devs to get pthread_getattr_np working on the main thread without /proc.

I don't know how hard (1) would be. you can get the total size of the stack with getrlimit(RSTACK_SIZE), but that doesn't help without knowing the stack's base address, and I'm not sure how to get that.

@zenhack
Copy link

zenhack commented Sep 8, 2020

A small update: the musl maintainer expressed interest in getting realpath() working without /proc, so that's another possible way forward.

@zenhack
Copy link

zenhack commented Sep 8, 2020

Ok, I swapped in a working realpath() implementation and got a bit further, but it's getting a NullReferenceException on startup. s
trace log attached.

log.txt

@janvorli
Copy link
Member

janvorli commented Sep 8, 2020

Unfortunately this NullReferenceException is in an area I know nothing about. Dumping the exception here from the log for better visibility for others:

System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.DotNet.Cli.Utils.Muxer..ctor()
   at Microsoft.DotNet.Cli.Utils.MSBuildForwardingAppWithoutLogging..ctor(IEnumerable`1 argsToForward, String msbuildPath)
   at Microsoft.DotNet.Tools.MSBuild.MSBuildForwardingApp..ctor(IEnumerable`1 argsToForward, String msbuildPath)
   at Microsoft.DotNet.Tools.Run.RunCommand.EnsureProjectIsBuilt()
   at Microsoft.DotNet.Tools.Run.RunCommand.Execute()
   at Microsoft.DotNet.Tools.Run.RunCommand.Run(String[] args)
   at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, ITelemetry telemetryClient)
   at Microsoft.DotNet.Cli.Program.Main(String[] args)

However, I've just found a github issue in the sdk repo that mentions exactly the same exception:
dotnet/sdk#4165 and there is a reference to a comment in another issue that fixed the problem for the reporter of the issue.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

It is likely that Process.GetCurrentProcess().MainModule.FileName at https://github.com/dotnet/sdk/blob/b1223209644d900702287faea8e9b71f95ec49f8/src/Cli/Microsoft.DotNet.Cli.Utils/Muxer.cs#L39 uses proc file system.

It is an example of situation where #40862 would help.

@zenhack
Copy link

zenhack commented Sep 8, 2020

Hm, it looks like the issue in that case was that libproc was not being found, but that's a MacOS thing with no equivalent on Linux, so an analogous solution seems unlikely. Per the issue description, the function that is throwing the exception is:

        public Muxer()
        {
            _muxerPath = Process.GetCurrentProcess().MainModule.FileName;
        }

GetCurrentProcess() is a one-liner which is a direct call to new Process(...), so I don't think that can return null (I don't actually know C#, so am just working off of the hopefully-not-too-wrong assumption that most stuff works like Java...). I don't see anything called MainModule in Process.cs though.

It seems reasonable to believe that this is due to procfs being absent, though I'm not exactly sure what part of procfs it's trying to access.

@jkotas
Copy link
Member

jkotas commented Sep 8, 2020

Process.GetCurrentProcess().MainModule.FileName will end up calling https://github.com/dotnet/runtime/blob/master/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessManager.Linux.cs that has heavy dependency on procfs.

@zenhack
Copy link

zenhack commented Sep 8, 2020

So I realized I'm probably making this harder than I need to by doing dotnet run in the sandbox; instead I built a self-contained publish and ran that in the sandbox, and that actually seems to work, modulo the /proc/self/exe hackery and swapping out the realpath() implementation. So hopefully at some point alpine will ship a version of musl with the the realpath() fix, and if you guys can fix the /proc/self/exe thing, I think we're good to go...

@janvorli
Copy link
Member

janvorli commented Sep 8, 2020

I was just about to suggest trying self contained publish, so I am glad it worked.

@janvorli
Copy link
Member

janvorli commented Nov 7, 2022

I was going to close this issue, but I have found that /proc/self/exe is also explicitly used by libunwind to get the current executable path. The dl_iterate_phdr returns empty string in the dlpi_name and the libunwind needs to locate the executable then in order to read exception handling data from it.
So we could update libunwind to try to use the getauxval(AT_EXECFN) and only fall back to the /proc/self/exe if it didn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants