-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC deadlock under Linux (works in Windows) #47700
Comments
|
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionWe're working on updating our application from Windows to Linux and we're consistently running into what seems to be a deadlock in GC. I see thread on Thread Pool is trying to get more space, enters GC, and then it gets stuck. Meanwhile Finalizer thread and all GC threads are also stuck in similar manner This happens pretty consistently for us. Configuration
Regression?This was broken on .NET Core 3.1. We wanted to see if it got fixed with .NET 5, but doesn't look like there's any difference. Other informationI did capture full stack traces and dumps. I understand that's needed to diagnose the issue and I can send all of that in private. My guess is that it might be a situation where GC is trying to suspend all threads for marking (all GC threads are in mark phase), but some thread is waiting on something else and is unable to get suspended? Not sure.
|
|
So I’ve found that it is a glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=25847. The reported of the issue has analyzed the problem very diligently and found and proposed a fix to glibc to fix that. He/she has hit it in a C# application too, but other people have reported to hit it in Occam runtime and Python. The proposed fix fixed it for all of them. So I believe that if you could use a distro with older glibc (the bug was introduced in glibc 2.27) that has a different implementation of the cond variables (Ubuntu 16.04, CentOS 7, Debian 9, …), this problem should not occur. |
|
So far the issue hasn't manifested under Ubuntu 16.04 or Debian 9. It seems the issue is outside of control of .NET dev teams and can be closed. |
|
@slav thank you for confirming that! |
|
This is a known issue in glibc 2.27+ in workstealing of the pthread condvars tracked on their side under https://sourceware.org/bugzilla/show_bug.cgi?id=25847. As of now, no shipping OS has a patched glibc. Possible workarounds:
# Sample ASP.NET Core Dockerfile that builds glibc with the patch for https://sourceware.org/bugzilla/show_bug.cgi?id=25847
# The critical lines to use here are 6-22, 37.
FROM mcr.microsoft.com/dotnet/sdk:5.0 AS build
RUN echo "deb-src http://deb.debian.org/debian buster main" >> /etc/apt/sources.list \
&& echo "deb-src http://security.debian.org/debian-security buster/updates main" >> /etc/apt/sources.list \
&& echo "deb-src http://deb.debian.org/debian buster-updates main" >> /etc/apt/sources.list \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
dpkg-dev devscripts \
&& apt-get source glibc \
&& apt-get build-dep -y glibc \
&& cd /glibc-* \
# Apply patch for https://sourceware.org/bugzilla/show_bug.cgi?id=25847
&& curl "https://sourceware.org/bugzilla/attachment.cgi?id=12484&action=diff&collapsed=&headers=1&format=raw" | \
patch nptl/pthread_cond_wait.c \
# Disable tests (some fail when run in a container)
&& sed -i 's/\(RUN_TESTSUITE = \)yes/\1no/' debian/rules \
# Build glibc
&& debuild -b -uc -us \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /source
COPY *.sln .
COPY aspnetapp/*.csproj ./aspnetapp/
RUN dotnet restore
COPY aspnetapp/. ./aspnetapp/
WORKDIR /source/aspnetapp
RUN dotnet publish -c release -o /app --no-restore
FROM mcr.microsoft.com/dotnet/aspnet:5.0
WORKDIR /app
COPY --from=build /app ./
COPY --from=build /glibc-2.28/build-tree/amd64-libc/libc.so /lib/x86_64-linux-gnu/libc-2.28.so
ENTRYPOINT ["dotnet", "aspnetapp.dll"] |
Description
We're working on updating our application from Windows to Linux and we're consistently running into what seems to be a deadlock in GC.
I see thread on Thread Pool is trying to get more space, enters GC, and then it gets stuck.
Meanwhile Finalizer thread and all GC threads are also stuck in similar manner
This happens pretty consistently for us.
Configuration
Regression?
This was broken on .NET Core 3.1. We wanted to see if it got fixed with .NET 5, but doesn't look like there's any difference.
Other information
I did capture full stack traces and dumps. I understand that's needed to diagnose the issue and I can send all of that in private.
My guess is that it might be a situation where GC is trying to suspend all threads for marking (all GC threads are in mark phase), but some thread is waiting on something else and is unable to get suspended? Not sure.
The text was updated successfully, but these errors were encountered: