-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glibc segfaulting on P10 with OpenBLAS #1780
Comments
According to the core dump:
|
Steps to recreate on P10 : and then link the test.. |
The code is segfaulting when loading libquadmath, an indirect dependency from libgfortran:
So the initialization code for libquadmath is doing something wrong there... |
FYI, This works with at-next-15.0-0-alpha2. |
More info: the issue is not likely on libquadmath itself, it's just triggered by it. libquadmath has a constructor that is run at load time that registers new printf formats. In the process a call to glibc @ malloc.c:3076
The instruction that causes the segfault:
I tried reproducing the issue without linking to openblas, but using the same dependencies:
But the executable above works fine. I ran both programs side by side and single stepped close to where the issue happens, but couldn't spot anything out of the ordinary. I suspect this could be an issue with TLS initialization on glibc. |
I was able to replicate this exact segfault traceback, when running one of the shipped tests in our internal OpenBLAS distribution. After I extract and build OpenBLAS on a P10 (using at14.0), the executable is found in OpenBLAS/lapack-netlib/TESTING/EIG/xeigtstz, and the testing environment executes this test when I type "make complex16" in OpenBLAS/lapack-netlib/TESTING. Interestingly, the other three data types ("single", "double", and "complex") do not run into this segfault. Only double-precision complex has the problem. |
The TLS access was failing because the offset used in the instruction was incorrect. The some code in the power9-tuned glibc from AT 14 shows that load should instead be This struct is defined in glibc this way ( struct _Unwind_Exception
{
_Unwind_Exception_Class exception_class;
_Unwind_Exception_Cleanup_Fn exception_cleanup;
_Unwind_Word private_1;
_Unwind_Word private_2;
/* @@@ The IA-64 ABI says that this structure must be double-word aligned.
Taking that literally does not make much sense generically. Instead we
provide the maximum alignment required by any type for the machine. */
} __attribute__((__aligned__)); When no specific value is given to attribute But the code above may need to be changed on glibc upstream. I'll start a discussion there. |
A bit more explanation on the problem: the issue happens only when linking against the P10 glibc multilib (compiled with
So there was likely a mismatch between the memory the loader initialized and what glibc was trying to access. The change to the default alignment value has been reverted on GCC upstream (gcc-mirror/gcc@a37b5bc) and will be backported to GCC 10 branch (followed by AT 14). There is also discussion on a fix for glibc so that The fixes should land on AT 14 over the next weeks, and will be released as part of a future AT 14.0-2. |
The upstream has also been backported to glibc 2.32 branch and will be merged to the ibm branch before the next AT 14 release. |
Report from @RajalakshmiS :
Build and install OpenBlas.
Then
The text was updated successfully, but these errors were encountered: