Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create CoreCLR, HRESULT: 0x8007001F on MacOSX #10870

Closed
nbkolchin opened this issue Mar 15, 2020 · 21 comments
Closed

Failed to create CoreCLR, HRESULT: 0x8007001F on MacOSX #10870

nbkolchin opened this issue Mar 15, 2020 · 21 comments

Comments

@nbkolchin
Copy link

nbkolchin commented Mar 15, 2020

related: #10737

MacOSX: 10.15.3, dotnet installed via brew cask.

Any dotnet cli command in MacOSX fails after starting Windows inside Parallels Desktop. After stopping Parallels, dotnet commands run normally.

bash-3.2$ dotnet --info
Failed to create CoreCLR, HRESULT: 0x8007001F

Host (useful for support):
  Version: 3.1.2
  Commit:  916b5cba26

.NET Core SDKs installed:
  3.1.102 [/usr/local/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.App 3.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download
bash-3.2$ dotnet fsi
Failed to create CoreCLR, HRESULT: 0x8007001F
bash-3.2$ dotnet run
Failed to create CoreCLR, HRESULT: 0x8007001F

Gist with COREHOST_TRACE: https://gist.github.com/nbkolchin/667e96211531eb277bdb320fcc23fc90

@am11
Copy link
Member

am11 commented Mar 16, 2020

Decoding this error:

**********************************************************************
** Visual Studio 2019 Developer Command Prompt v16.4.2
** Copyright (c) 2019 Microsoft Corporation
**********************************************************************

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community>certutil -error 0x8007001F
0x8007001f (WIN32: 31 ERROR_GEN_FAILURE) -- 2147942431 (-2147024865)
Error message text: A device attached to the system is not functioning.
CertUtil: -error command completed successfully.

reveals that it is ERROR_GEN_FAILURE, and it might be related to CoreCLR not being about to allocate enough memory, when Parallels is running on the system. Here is a similar discussion about FreeBSD getting the same error when VirtualBox is running on the host: dotnet/runtime#6353 (comment).

cc @janvorli

@nbkolchin
Copy link
Author

I closed Linux VM and dotnet runs normally. However, this is a bit strange and doesn't look like memory problem according to vm_stat

bash-3.2$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                               15427.
Pages active:                            170566.
Pages inactive:                          168086.
Pages speculative:                         1870.
Pages throttled:                              0.
Pages wired down:                       3307031.
Pages purgeable:                            164.
"Translation faults":                2652199561.
Pages copy-on-write:                  214559536.
Pages zero filled:                   1575362268.
Pages reactivated:                     98113039.
Pages purged:                          55526875.
File-backed pages:                       124505.
Anonymous pages:                         216017.
Pages stored in compressor:             2147461.
Pages occupied by compressor:            530806.
Decompressions:                       104749354.
Compressions:                         144644094.
Pageins:                              170120013.
Pageouts:                               3434494.
Swapins:                               13571981.
Swapouts:                              14414055.
bash-3.2$ dotnet fsi
Failed to create CoreCLR, HRESULT: 0x8007001F

------

bash-3.2$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                                7033.
Pages active:                            647789.
Pages inactive:                          621903.
Pages speculative:                        24851.
Pages throttled:                              0.
Pages wired down:                       2229157.
Pages purgeable:                          35747.
"Translation faults":                2663584840.
Pages copy-on-write:                  214862336.
Pages zero filled:                   1582131837.
Pages reactivated:                     99356682.
Pages purged:                          55918274.
File-backed pages:                       405872.
Anonymous pages:                         888671.
Pages stored in compressor:             2101202.
Pages occupied by compressor:            663028.
Decompressions:                       106543683.
Compressions:                         146861763.
Pageins:                              171531754.
Pageouts:                               3465199.
Swapins:                               13873146.
Swapouts:                              14637865.
bash-3.2$ dotnet --version
3.1.102

In second sample system has less free memory but dotnet runs normally...

P.S. I'm a bit curious why Blender, Libreoffice, Android Studio and R-Studio are able to start at the same time, but dotnet --version fails...

@janvorli
Copy link
Member

The ERROR_GEN_FAILURE can mean a lot of things. It means that something non-specific has failed during CoreCLR PAL initialization, it doesn't have to be related to memory at all.
@nbkolchin would you be willing to try an instrumented version of libcoreclr.dylib that I would create for you and that would print the exact failure location to the console?

@nbkolchin
Copy link
Author

Yes. How I can get "instrumented version"?

P.S. I'm not able to reproduce the problem after system reboot, but will try on holidays.

@janvorli
Copy link
Member

How I can get "instrumented version"?

I would create one for you and share it via e.g. OneDrive. But let's first wait to see if you can repro the issue again.

@nbkolchin
Copy link
Author

It is hard to reproduce. But...

bash-3.2$ dotnet --version
3.1.200
bash-3.2$ dotnet --version
Failed to create CoreCLR, HRESULT: 0x8007001F

Notice the updated dotnet version.

@janvorli
Copy link
Member

@nbkolchin what did the updated version change? It is less frequent than with the 3.1.102 you've reported a week ago?

@nbkolchin
Copy link
Author

Nothing changed. The problem is hard to reproduce but exist in all tested versions.

@nbkolchin
Copy link
Author

The problem is actually more serious than I thought. Published .NET applications also don't work.

bash-3.2$ ./app
Failed to create CoreCLR, HRESULT: 0x8007001F

This makes .NET not a viable solution for cross-platform development and we bet much on it...

@janvorli
Copy link
Member

It is not a surprise, as the issue seems to be in the coreclr initialization and that's shared between the dotnet tool execution and any app execution. So it seems you can repro it quite well. Then it seems it would make sense to try to figure out the exact reason for the failure using an instrumented libcoreclr.dylib.
Let me prepare one and share it with you.

@janvorli
Copy link
Member

@nbkolchin you can get a modified libcoreclr.dylib here: https://1drv.ms/u/s!AkLV4wRkyHYhyRiRIChDXxs8RvQJ?e=NnoGZ0
After downloading it, you need to ungzip it as follows:

gunzip libcoreclr.dylib.gz

Then the easiest way to try it is to use the published version of your app and replace the libcoreclr.dylib in its directory by the one you've just downloaded and ungzipped.

When you run the application, it would still fail, but it should display a different error code that corresponds to the step in the runtime initialization that has failed instead of just this generic failure code. Please share the error code (HRESULT) with me and I'll look it up.

@nbkolchin
Copy link
Author

bash-3.2$ ./app
Failed to create CoreCLR, HRESULT: 0x8007FF02

@janvorli
Copy link
Member

Hmm, that's interesting. So this is the function that fails:
https://github.com/dotnet/runtime/blob/26eb70b5720fdf925d8ccf47007bcaeaafef321e/src/coreclr/src/pal/src/thread/process.cpp#L3484-L3531

There are three possible failure points there:

  • mmap failing to map a single page - this sounds very unlikely that the system would not have a single memory page of free memory for dotnet
  • mlock failing to lock the single page - there is a limit to how much memory a process can lock. It can be controlled e.g. by the ulimit shell command.
  • pthread_mutex_init failing to initialize a mutex. I also don't see how that could happen.

So I wonder if it is the mlock failing. Can you please share the output of ulimit -a when starting dotnet is failing?

I will also share one more instrumented version of libcoreclr.dylib with you that will report clearly which of those three failures happened and what was the exact error.

@nbkolchin
Copy link
Author

bash-3.2$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2784
virtual memory          (kbytes, -v) unlimited
bash-3.2$ ./app
Failed to create CoreCLR, HRESULT: 0x8007FF02

@nbkolchin
Copy link
Author

nbkolchin commented Mar 24, 2020

I've tracked problem to mlock() call. See https://gist.github.com/nbkolchin/067fb7adc45f2a084915a1ec17ed5e61

bash-3.2$ g++ test.cpp
bash-3.2$ ./a.out
mlock failed: Resource temporarily unavailable (4096)
FAILURE

But

bash-3.2$ vm_stat
Mach Virtual Memory Statistics: (page size of 4096 bytes)
Pages free:                               74806.
Pages active:                            184860.
Pages inactive:                          184282.
Pages speculative:                          238.
Pages throttled:                              0.
Pages wired down:                       3440295.
Pages purgeable:                          12373.
"Translation faults":                4935133219.
Pages copy-on-write:                  731013643.
Pages zero filled:                   2776350982.
Pages reactivated:                    232121984.
Pages purged:                          94058094.
File-backed pages:                       108974.
Anonymous pages:                         260406.
Pages stored in compressor:             3071434.
Pages occupied by compressor:            309349.
Decompressions:                       247162531.
Compressions:                         331500515.
Pageins:                              328983977.
Pageouts:                               5876590.
Swapins:                               46780653.
Swapouts:                              49548215.

@janvorli
Copy link
Member

It seems that parallels consume all the wired memory allowed. Wired memory is a memory that's locked (not allowed to be swapped to disk). It seems that the system wide limit is set to 80% of the physical memory. And it is the limit that causes the mlock to fail with EAGAIN.

Looking at your vm_stat results that you've reported before, I can see that without parallels, the "Pages wired down" is 2229157 and with parallels, it is 3307031.
What is the amount of total physical memory on your Mac? Is it 16GB?

@nbkolchin
Copy link
Author

Yes 16GB.

I understand that this is weird MacOSX behaviour. Forcing some pages to swap (i.e. running application that eats some memory) make mlock() work again without stopping parallels or anything else. I.e.

bash-3.2$ dotnet --version
Failed to create CoreCLR, HRESULT: 0x8007001F
bash-3.2$ ./eatram 1073741824
bash-3.2$ dotnet --version
3.1.200

@janvorli
Copy link
Member

Maybe there is something else running that consumes a lot of wired pages. Even without parallels, the 2229157 pages means that around 8.5GB of wired pages are consumed, which seems to be a lot and close to the limit that is approx 12.8GB on your device. On my Mac, the vm_stat reports only around 1.23GB of wired pages.

I've found the following script to dump wired pages by process sorted by size at https://apple.stackexchange.com/questions/349037/30gb-out-of-32gb-being-used-for-wired-memory

kextstat | awk 'NR==1{ printf "%10s %s\n", $5, $6; } NR!=1{ printf "%10d %s\n", $5, $6; }' | sort -n

@marcpopMSFT marcpopMSFT added the untriaged Request triage from a team member label Apr 6, 2020
@sfoslund
Copy link
Member

sfoslund commented Apr 7, 2020

@janvorli does this issue belong in dotnet/runtime?

@sfoslund sfoslund removed the untriaged Request triage from a team member label Apr 7, 2020
@sfoslund sfoslund added this to the Discussion milestone Apr 7, 2020
@sfoslund sfoslund removed their assignment Apr 7, 2020
@janvorli
Copy link
Member

janvorli commented Apr 9, 2020

@sfoslund I am sorry for missing your message. Yes, it belongs there.

@sfoslund
Copy link
Member

sfoslund commented Apr 9, 2020

This issue was moved to dotnet/runtime#34793

@sfoslund sfoslund removed this from the Discussion milestone Apr 9, 2020
@sfoslund sfoslund closed this as completed Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants