Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/7.0] Fix pthread_cond_wait race on macOS #82893

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Mar 2, 2023

Backport of #82709 to release/7.0

/cc @janvorli

Customer Impact

Applications compiled with NativeAOT can hang intermittently at startup on macOS. This was occurring with our own crossgen2 in the CI.
The problem is caused by the implementation of pthread_cond_broadcast not adhering to the documentation in a race condition case. There is a tiny window of opportunity within which the related pthread_cond_wait isn't woken by the pthread_cond_broadcast when the latter is not invoked with the related mutex taken.

Testing

Stress testing running of crossgen2 compiled with NativeAOT on macOS without any arguments. Without the fix, it hanged in tens or hundreds of thousands of iterations. With the fix, it was running ok for 5.5 million of iterations.

Risk

Very low, the change just moves pthread_cond_broadcast inside of a mutex and the doc for that method says it should not matter whether it is called inside of the mutex or not.

The native runtime event implementations for nativeaot and GC use
pthread_cond_wait to wait for the event and pthread_cond_broadcast
to signal that the event was set. While the usage of the
pthread_cond_broadcast conforms with the documentation, it turns out
that glibc before 2.25 had a race in the implementation that can
cause the pthread_cond_broadcast to be unnoticed and the wait
waiting forever. It turns out that macOS implementation has the
same issue.
The fix for the issue is to call pthread_cond_broadcast while the
related mutex is taken.

This change fixes intermittent crossgen2 hangs with nativeaot build of
crossgen2 reported in #81570. I was able to repro the hang locally in
tens of thousands of iterations of running crossgen2 without any arguments
(the hang occurs when server GC creates threads). With this fix,
it ran without problems over the weekend, passing 5.5 million iterations.
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@janvorli janvorli requested a review from jkotas March 2, 2023 14:18
@janvorli janvorli self-assigned this Mar 2, 2023
@janvorli janvorli added this to the 7.0.x milestone Mar 2, 2023
@janvorli janvorli added the Servicing-consider Issue for next servicing release review label Mar 2, 2023
@ghost
Copy link

ghost commented Mar 2, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #82709 to release/7.0

/cc @janvorli

Customer Impact

Testing

Risk

IMPORTANT: Is this backport for a servicing release? If so and this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.

Author: github-actions[bot]
Assignees: janvorli
Labels:

area-NativeAOT-coreclr

Milestone: -

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. we will take for consideration in 7.0.x

@rbhanda rbhanda modified the milestones: 7.0.x, 7.0.5 Mar 7, 2023
@rbhanda rbhanda added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Mar 7, 2023
@carlossanlop
Copy link
Member

Approved by Tactics.
Signed-off by area owners.
CI is green.
No OOB changes needed (native).
Ready to merge. :shipit:

@carlossanlop carlossanlop merged commit 43b0192 into release/7.0 Mar 8, 2023
@carlossanlop carlossanlop deleted the backport/pr-82709-to-release/7.0 branch March 8, 2023 18:42
@ghost ghost locked as resolved and limited conversation to collaborators Apr 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants