New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault in Cholesky #181

Closed
haimav opened this Issue Sep 13, 2016 · 30 comments

Comments

Projects
None yet
5 participants
@haimav
Contributor

haimav commented Sep 13, 2016

In my larger code, the following code generates a seg fault in Elemental.

    El::Identity(C, 100, 100);
    El::Cholesky(El::LOWER, C);

(original code was more complex, but even this generated the seg fault).

Here is a stack trace:

Program received signal SIGSEGV, Segmentation fault.
0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
(gdb) where
#0 0x00007fff8af9a2da in stack_not_16_byte_aligned_error () from /usr/lib/system/libdyld.dylib
#1 0x00007fff5fbfcb80 in ?? ()
#2 0x00000001028d2398 in ?? () from /usr/local/lib/libEl.dylib
#3 0x0000000000137ad6 in ?? ()
#4 0x00000001012cbf6e in El::Matrix::operator()(El::Range, El::Range) () from /usr/local/lib/libEl.dylib
#5 0x000000010152ecfc in void El::cholesky::UVar3(El::Matrix&) () from /usr/local/lib/libEl.dylib
#6 0x0000000101537f15 in void El::cholesky::UVar3(El::AbstractDistMatrix&) () from /usr/local/lib/libEl.dylib
#7 0x00000001000a01b6 in skylark::ml::feature_map_precond_t<El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> >::feature_map_precond_t<skylark::ml::kernel_container_t, El::DistMatrix<double, (El::DistNS::Dist)0, (El::DistNS::Dist)2, (El::DistWrapNS::DistWrap)0> > (this=0x105757490, k=..., lambda=, X=..., s=, context=..., params=...)

at /Users/haimav/Coding/libskylark/ml/krr.hpp:385

#8 0x00000001000a44c1 in skylark::ml::FasterKernelRidge<double, skylark::ml::kernel_container_t> (direction=, k=..., X=...,

Y=..., lambda=0.01, A=..., s=50, context=..., params=...) at /Users/haimav/Coding/libskylark/ml/krr.hpp:501

#9 0x000000010012a280 in skylark::ml::FasterKernelRLSC<double, int, skylark::ml::kernel_container_t> (direction=COLUMNS, k=..., X=..., L=...,

lambda=0.01, A=..., rcoding=..., s=50, context=..., params=...) at /Users/haimav/Coding/libskylark/ml/rlsc.hpp:244

#10 0x000000010012f243 in execute_classification (context=...) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:401
#11 0x00000001001309b9 in main (argc=9, argv=0x7fff5fbff978) at /Users/haimav/Coding/libskylark/examples/kernel_regression.cpp:882

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 13, 2016

Member

None of my current builds seem to show any of these symptoms (with tests/lapack_like/Cholesky passing all tests with --uplo L and --uplo U and various numbers of MPI processes). Would you mind providing a bit more information?

Member

poulson commented Sep 13, 2016

None of my current builds seem to show any of these symptoms (with tests/lapack_like/Cholesky passing all tests with --uplo L and --uplo U and various numbers of MPI processes). Would you mind providing a bit more information?

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Sep 13, 2016

Contributor

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.

Contributor

haimav commented Sep 13, 2016

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.

@rhl-

This comment has been minimized.

Show comment
Hide comment
@rhl-

rhl- Sep 13, 2016

Member

Is this on master or a release?

On Tue, Sep 13, 2016 at 2:52 PM Haim Avron notifications@github.com wrote:

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#181 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATdUbRgVOr9Wu824Dxjt5WJwUaF40oLks5qpxsggaJpZM4J7Tvz
.

Member

rhl- commented Sep 13, 2016

Is this on master or a release?

On Tue, Sep 13, 2016 at 2:52 PM Haim Avron notifications@github.com wrote:

Just try a code that only has

El::Initialize(argc, argv);

El::Matrix C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);

and you get the seg fault, at least on my mac.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#181 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATdUbRgVOr9Wu824Dxjt5WJwUaF40oLks5qpxsggaJpZM4J7Tvz
.

@jeffhammond

This comment has been minimized.

Show comment
Hide comment
@jeffhammond

jeffhammond Sep 13, 2016

Member

stack_not_16_byte_aligned_error looks like some kind of Mac tool chain
issue.

Jeff Hammond
jeff.science@gmail.com
http://jeffhammond.github.io/

Member

jeffhammond commented Sep 13, 2016

stack_not_16_byte_aligned_error looks like some kind of Mac tool chain
issue.

Jeff Hammond
jeff.science@gmail.com
http://jeffhammond.github.io/

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 14, 2016

Member

I unfortunately don't have a personal Mac to test this on, but extensive tests on my Linux box have not turned anything up.

Member

poulson commented Sep 14, 2016

I unfortunately don't have a personal Mac to test this on, but extensive tests on my Linux box have not turned anything up.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 17, 2016

Member

Has anyone seen this behavior on any other system? A large number of users making regular use of El::Cholesky, in addition to none of my tests turning anything up, suggests that this is indeed a toolchain mismatch/issue as suggested by @jeffhammond

Member

poulson commented Sep 17, 2016

Has anyone seen this behavior on any other system? A large number of users making regular use of El::Cholesky, in addition to none of my tests turning anything up, suggests that this is indeed a toolchain mismatch/issue as suggested by @jeffhammond

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Sep 17, 2016

Contributor

Maybe it is, but I don't have any other mac to test it on. Also there is no reason why suddenly the toolchain got a mismatch -- it was working fine until not so long ago

Contributor

haimav commented Sep 17, 2016

Maybe it is, but I don't have any other mac to test it on. Also there is no reason why suddenly the toolchain got a mismatch -- it was working fine until not so long ago

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 17, 2016

Member

For what it's worth, Cholesky has not been modified in quite some time.

Member

poulson commented Sep 17, 2016

For what it's worth, Cholesky has not been modified in quite some time.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 19, 2016

Member

@haimav Am I understanding the discussion in xdata-skylark/libskylark#36 properly that the issue was in the implementation of Skylark's Gram and not Elemental's Cholesky?

Member

poulson commented Sep 19, 2016

@haimav Am I understanding the discussion in xdata-skylark/libskylark#36 properly that the issue was in the implementation of Skylark's Gram and not Elemental's Cholesky?

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Sep 19, 2016

Contributor

@poulson No that was a separate issue.

Contributor

haimav commented Sep 19, 2016

@poulson No that was a separate issue.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 19, 2016

Member

Thanks; could you share what compiler (and version) and MPI (and version) was being used? If necessary I will buy a Mac to debug this.

Member

poulson commented Sep 19, 2016

Thanks; could you share what compiler (and version) and MPI (and version) was being used? If necessary I will buy a Mac to debug this.

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Sep 19, 2016

Contributor

I have just updated gcc and I am recompiling Elemental. Will update you if it works...

Contributor

haimav commented Sep 19, 2016

I have just updated gcc and I am recompiling Elemental. Will update you if it works...

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 20, 2016

Member

I bought a Macbook yesterday, compiled GCC 6.2.0 from scratch, MPICH 3.2 on top of it, and Elemental HEAD on top of those in Debug mode and tests/lapack_like/Cholesky passes all tests I can throw at it. What version of GCC and/or MPICH/OpenMPI are your errors occurring with?

Member

poulson commented Sep 20, 2016

I bought a Macbook yesterday, compiled GCC 6.2.0 from scratch, MPICH 3.2 on top of it, and Elemental HEAD on top of those in Debug mode and tests/lapack_like/Cholesky passes all tests I can throw at it. What version of GCC and/or MPICH/OpenMPI are your errors occurring with?

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Sep 20, 2016

Contributor

I am actually using homebrew to install gcc and mpi, and used it to compile gcc 5.2.0.

Contributor

haimav commented Sep 20, 2016

I am actually using homebrew to install gcc and mpi, and used it to compile gcc 5.2.0.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 20, 2016

Member

Thanks; I can look into that toolchain tonight. My guess is that this is a homebrew compatibility issue. It would be good to verify that the same version of GCC and MPI implementation (is it MPICH or OpenMPI?) was used for each component.

Member

poulson commented Sep 20, 2016

Thanks; I can look into that toolchain tonight. My guess is that this is a homebrew compatibility issue. It would be good to verify that the same version of GCC and MPI implementation (is it MPICH or OpenMPI?) was used for each component.

@jeffhammond

This comment has been minimized.

Show comment
Hide comment
@jeffhammond

jeffhammond Sep 20, 2016

Member

I homebrewed GCC 6.2.0 yesterday and can test that myself.

I wish I had the kind of money that let me buy a new computer just to debug
GitHub issues 😄

Member

jeffhammond commented Sep 20, 2016

I homebrewed GCC 6.2.0 yesterday and can test that myself.

I wish I had the kind of money that let me buy a new computer just to debug
GitHub issues 😄

@jeffhammond

This comment has been minimized.

Show comment
Hide comment
@jeffhammond

jeffhammond Sep 21, 2016

Member

I too am unable to reproduce.

I saw the following issue:

/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: warning: section "__const_coal" is deprecated
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~
/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: note: change section name to "__const"
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~

This was solved exactly as described on http://stackoverflow.com/questions/39502921/warning-section-const-coal-is-deprecated-error-after-updating-xcode-to-la.

The test I ran successful was:

#include "El.hpp"
int main(int argc, char* argv[]){
El::Initialize(argc, argv);
El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);
}
/opt/mpich/dev/gcc/default/bin/mpicxx \
-I$HOME/Work/Elemental/git/install-gcc/include \
-std=c++11 toy.cc  \
-L$HOME/Work/Elemental/git/install-gcc/lib -lEl \
-Wl,-rpath -Wl,$HOME/Work/Elemental/git/install-gcc/lib

I am running OS X 10.11.6 with the following GCC and MPICH.

$ g++-6 -v
Using built-in specs.
COLLECT_GCC=g++-6
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/6.2.0/libexec/gcc/x86_64-apple-darwin15.6.0/6.2.0/lto-wrapper
Target: x86_64-apple-darwin15.6.0
Configured with: ../configure --build=x86_64-apple-darwin15.6.0 --prefix=/usr/local/Cellar/gcc/6.2.0 --libdir=/usr/local/Cellar/gcc/6.2.0/lib/gcc/6 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-6 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking --enable-checking=release --enable-lto --with-build-config=bootstrap-debug --disable-werror --with-pkgversion='Homebrew gcc 6.2.0 --without-multilib' --with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin --disable-nls --disable-multilib
Thread model: posix
gcc version 6.2.0 (Homebrew gcc 6.2.0 --without-multilib) 
$ /opt/mpich/dev/gcc/default/bin/mpichversion 
MPICH Version:      3.3a1
MPICH Release date: unreleased development copy
MPICH Device:       ch3:nemesis
MPICH configure:    CC=gcc-6 CXX=g++-6 FC=gfortran-6 F77=gfortran-6 --enable-cxx --enable-fortran --enable-threads=runtime --enable-g=dbg --with-pm=hydra --prefix=/opt/mpich/dev/gcc/default --enable-wrapper-rpath --disable-static --enable-shared
MPICH CC:   gcc-6    -g -O2
MPICH CXX:  g++-6   -g
MPICH F77:  gfortran-6   -g
MPICH FC:   gfortran-6   -g
MPICH Custom Information: 
Member

jeffhammond commented Sep 21, 2016

I too am unable to reproduce.

I saw the following issue:

/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: warning: section "__const_coal" is deprecated
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~
/var/folders/tz/3sxkhvt90632mzr6fxm1cd0h0000gp/T//ccRiuRgy.s:235666:11: note: change section name to "__const"
        .section __DATA,__const_coal,coalesced
                 ^      ~~~~~~~~~~~~

This was solved exactly as described on http://stackoverflow.com/questions/39502921/warning-section-const-coal-is-deprecated-error-after-updating-xcode-to-la.

The test I ran successful was:

#include "El.hpp"
int main(int argc, char* argv[]){
El::Initialize(argc, argv);
El::Matrix<double> C;
El::Identity(C, 1000, 1000);
El::Cholesky(El::LOWER, C);
}
/opt/mpich/dev/gcc/default/bin/mpicxx \
-I$HOME/Work/Elemental/git/install-gcc/include \
-std=c++11 toy.cc  \
-L$HOME/Work/Elemental/git/install-gcc/lib -lEl \
-Wl,-rpath -Wl,$HOME/Work/Elemental/git/install-gcc/lib

I am running OS X 10.11.6 with the following GCC and MPICH.

$ g++-6 -v
Using built-in specs.
COLLECT_GCC=g++-6
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc/6.2.0/libexec/gcc/x86_64-apple-darwin15.6.0/6.2.0/lto-wrapper
Target: x86_64-apple-darwin15.6.0
Configured with: ../configure --build=x86_64-apple-darwin15.6.0 --prefix=/usr/local/Cellar/gcc/6.2.0 --libdir=/usr/local/Cellar/gcc/6.2.0/lib/gcc/6 --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-6 --with-gmp=/usr/local/opt/gmp --with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc --with-isl=/usr/local/opt/isl --with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking --enable-checking=release --enable-lto --with-build-config=bootstrap-debug --disable-werror --with-pkgversion='Homebrew gcc 6.2.0 --without-multilib' --with-bugurl=https://github.com/Homebrew/homebrew/issues --enable-plugin --disable-nls --disable-multilib
Thread model: posix
gcc version 6.2.0 (Homebrew gcc 6.2.0 --without-multilib) 
$ /opt/mpich/dev/gcc/default/bin/mpichversion 
MPICH Version:      3.3a1
MPICH Release date: unreleased development copy
MPICH Device:       ch3:nemesis
MPICH configure:    CC=gcc-6 CXX=g++-6 FC=gfortran-6 F77=gfortran-6 --enable-cxx --enable-fortran --enable-threads=runtime --enable-g=dbg --with-pm=hydra --prefix=/opt/mpich/dev/gcc/default --enable-wrapper-rpath --disable-static --enable-shared
MPICH CC:   gcc-6    -g -O2
MPICH CXX:  g++-6   -g
MPICH F77:  gfortran-6   -g
MPICH FC:   gfortran-6   -g
MPICH Custom Information: 
@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Sep 21, 2016

Member

Thanks for looking into this Jeff. I'm compiling with Homebrew's GCC 6 right now (with MPICH 3.2 manually built on top of said compiler) and hope to contribute another datapoint.

EDIT: All tests pass for me.

Member

poulson commented Sep 21, 2016

Thanks for looking into this Jeff. I'm compiling with Homebrew's GCC 6 right now (with MPICH 3.2 manually built on top of said compiler) and hope to contribute another datapoint.

EDIT: All tests pass for me.

@rhl-

This comment has been minimized.

Show comment
Hide comment
@rhl-

rhl- Oct 13, 2016

Member

Since we can't replicate it, i'm going to close it. Let's reopen if something changes.

Member

rhl- commented Oct 13, 2016

Since we can't replicate it, i'm going to close it. Let's reopen if something changes.

@rhl- rhl- closed this Oct 13, 2016

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Oct 31, 2016

Member

I have been running into similar segfaults on Mac OS X El Capitan with Release builds (but not Debug builds). I believe that the following discussion might be relevant:
https://trac.macports.org/ticket/44596#comment:35

Member

poulson commented Oct 31, 2016

I have been running into similar segfaults on Mac OS X El Capitan with Release builds (but not Debug builds). I believe that the following discussion might be relevant:
https://trac.macports.org/ticket/44596#comment:35

@poulson poulson reopened this Oct 31, 2016

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Oct 31, 2016

Member

To nail this down a bit further: I only observe the alignment errors with homebrew's GCC when building in Release mode; the issue seems to disappear when compiling with LLVM, so my current hypothesis is that this is caused by a faulty GCC toolchain (similar to that discussed in Macports).

Member

poulson commented Oct 31, 2016

To nail this down a bit further: I only observe the alignment errors with homebrew's GCC when building in Release mode; the issue seems to disappear when compiling with LLVM, so my current hypothesis is that this is caused by a faulty GCC toolchain (similar to that discussed in Macports).

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Nov 13, 2016

Member

I believe that this issue is due to a bug in GCC not always forcing the stack to be aligned to 16-byte boundaries on OS X when compiling with -O3. A similar issue was reported about 8 years ago: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271

The following output from lldb shows that a movdqa between an SSE register (xmm0) and the stack (%rsp) that is not 16-byte aligned is at fault:

Process 86524 stopped
* thread #1: tid = 0x23d403, 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error
libdyld.dylib`stack_not_16_byte_aligned_error:
->  0x7fffb178c506 <+0>: movdqa %xmm0, (%rsp)
    0x7fffb178c50b <+5>: int3   

libdyld.dylib`_dyld_func_lookup:
    0x7fffb178c50c <+0>: pushq  %rbp
    0x7fffb178c50d <+1>: movq   %rsp, %rbp
Member

poulson commented Nov 13, 2016

I believe that this issue is due to a bug in GCC not always forcing the stack to be aligned to 16-byte boundaries on OS X when compiling with -O3. A similar issue was reported about 8 years ago: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35271

The following output from lldb shows that a movdqa between an SSE register (xmm0) and the stack (%rsp) that is not 16-byte aligned is at fault:

Process 86524 stopped
* thread #1: tid = 0x23d403, 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fffb178c506 libdyld.dylib`stack_not_16_byte_aligned_error
libdyld.dylib`stack_not_16_byte_aligned_error:
->  0x7fffb178c506 <+0>: movdqa %xmm0, (%rsp)
    0x7fffb178c50b <+5>: int3   

libdyld.dylib`_dyld_func_lookup:
    0x7fffb178c50c <+0>: pushq  %rbp
    0x7fffb178c50d <+1>: movq   %rsp, %rbp

poulson added a commit that referenced this issue Nov 14, 2016

Temporarily addressing Issue #181 with an error message and fixing a …
…typo that leads to Debug builds not properly compiling

@poulson poulson closed this Nov 19, 2016

@haimav

This comment has been minimized.

Show comment
Hide comment
@haimav

haimav Nov 19, 2016

Contributor

So, on OSX we should compile with -O2 ?

Contributor

haimav commented Nov 19, 2016

So, on OSX we should compile with -O2 ?

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Nov 19, 2016

Member

Due to what is almost certainly a GCC optimization bug on OS X, the better recommendation would be to compile with Clang. But -O2 would also work (if you're okay with the performance hit).

EDIT: If you choose to go with a Release build with GCC and an -O2 optimization level, you will need to add the extra "I know what I'm doing" CMake flag detailed in 6fd612e

Member

poulson commented Nov 19, 2016

Due to what is almost certainly a GCC optimization bug on OS X, the better recommendation would be to compile with Clang. But -O2 would also work (if you're okay with the performance hit).

EDIT: If you choose to go with a Release build with GCC and an -O2 optimization level, you will need to add the extra "I know what I'm doing" CMake flag detailed in 6fd612e

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Nov 19, 2016

Member

If anyone can come up with a Minimum Reproducible Example (ideally not depending on Elemental), then I would be willing to shephard the bug report through GCC.

Member

poulson commented Nov 19, 2016

If anyone can come up with a Minimum Reproducible Example (ideally not depending on Elemental), then I would be willing to shephard the bug report through GCC.

@jwakely

This comment has been minimized.

Show comment
Hide comment
@jwakely

jwakely Dec 2, 2016

A bug that was fixed 8 years ago is unlikely to be the problem, and should not have recurred.

Before reporting a new GCC bug, try building with ubsan, and asan, and see if they find any problems. A minimal reproducer is ideal, but not necessary. It should be enough to provide preprocessed source for the translation unit that segfaults, and details of the compiler flags that cause apparent miscompilation. See https://gcc.gnu.org/bugs/ (and please read it twice).

jwakely commented Dec 2, 2016

A bug that was fixed 8 years ago is unlikely to be the problem, and should not have recurred.

Before reporting a new GCC bug, try building with ubsan, and asan, and see if they find any problems. A minimal reproducer is ideal, but not necessary. It should be enough to provide preprocessed source for the translation unit that segfaults, and details of the compiler flags that cause apparent miscompilation. See https://gcc.gnu.org/bugs/ (and please read it twice).

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Dec 3, 2016

Member

@jwakely I completely agree that this particular bug is unlikely to be the issue, but my guess is that there is a bug that is similar in spirit.

For what it's worth, I'm going down the path of running ubsan, but it is unfortunately known to be broken in the system clang on Sierra, and I hit a compiler bug from the git head in Elemental after manually compiling clang:

.	/Users/poulson/Source/Elemental/include/El/macros/Instantiate.h:96:1 <Spelling=/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:290:67>: current parser token ';'
2.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:12:1: parsing namespace 'El'
3.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:56:6: instantiating function definition 'El::ColumnMinAbs<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR>'
4.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/VC_STAR.hpp:153:67: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR, El::DistWrapNS::DistWrap::ELEMENT>'
5.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::STAR, El::DistNS::Dist::MC, El::DistWrapNS::DistWrap::ELEMENT>'
6.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: LLVM IR generation of declaration 'El::DistMatrix'
clang-4.0: error: unable to execute command: Abort trap: 6
clang-4.0: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 4.0.0 (trunk 288506)
Target: x86_64-apple-darwin16.1.0
Thread model: posix
InstalledDir: /Users/poulson/Source/build/bin
clang-4.0: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang-4.0: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.cpp
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.sh
clang-4.0: note: diagnostic msg: Crash backtrace is located in
clang-4.0: note: diagnostic msg: /Users/poulson/Library/Logs/DiagnosticReports/clang-4.0_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang-4.0: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang-4.0: note: diagnostic msg: 

********************
make[2]: *** [CMakeFiles/El.dir/src/blas_like/level1/ColumnMinAbs.cpp.o] Error 254
make[1]: *** [CMakeFiles/El.dir/all] Error 2
make: *** [all] Error 2

LLVM's user registration is currently down, but at some point I can start working up this compiler bug inception stack.

EDIT: In the mean time, I'll try GCC's ubsan on a Linux machine.

Member

poulson commented Dec 3, 2016

@jwakely I completely agree that this particular bug is unlikely to be the issue, but my guess is that there is a bug that is similar in spirit.

For what it's worth, I'm going down the path of running ubsan, but it is unfortunately known to be broken in the system clang on Sierra, and I hit a compiler bug from the git head in Elemental after manually compiling clang:

.	/Users/poulson/Source/Elemental/include/El/macros/Instantiate.h:96:1 <Spelling=/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:290:67>: current parser token ';'
2.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:12:1: parsing namespace 'El'
3.	/Users/poulson/Source/Elemental/src/blas_like/level1/ColumnMinAbs.cpp:56:6: instantiating function definition 'El::ColumnMinAbs<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR>'
4.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/VC_STAR.hpp:153:67: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::MC, El::DistNS::Dist::STAR, El::DistWrapNS::DistWrap::ELEMENT>'
5.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: instantiating class definition 'El::DistMatrix<float, El::DistNS::Dist::STAR, El::DistNS::Dist::MC, El::DistWrapNS::DistWrap::ELEMENT>'
6.	/Users/poulson/Source/Elemental/include/El/core/DistMatrix/Element/STAR_MC.hpp:21:7: LLVM IR generation of declaration 'El::DistMatrix'
clang-4.0: error: unable to execute command: Abort trap: 6
clang-4.0: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 4.0.0 (trunk 288506)
Target: x86_64-apple-darwin16.1.0
Thread model: posix
InstalledDir: /Users/poulson/Source/build/bin
clang-4.0: note: diagnostic msg: PLEASE submit a bug report to http://llvm.org/bugs/ and include the crash backtrace, preprocessed source, and associated run script.
clang-4.0: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.cpp
clang-4.0: note: diagnostic msg: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/ColumnMinAbs-946391.sh
clang-4.0: note: diagnostic msg: Crash backtrace is located in
clang-4.0: note: diagnostic msg: /Users/poulson/Library/Logs/DiagnosticReports/clang-4.0_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang-4.0: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang-4.0: note: diagnostic msg: 

********************
make[2]: *** [CMakeFiles/El.dir/src/blas_like/level1/ColumnMinAbs.cpp.o] Error 254
make[1]: *** [CMakeFiles/El.dir/all] Error 2
make: *** [all] Error 2

LLVM's user registration is currently down, but at some point I can start working up this compiler bug inception stack.

EDIT: In the mean time, I'll try GCC's ubsan on a Linux machine.

@jwakely

This comment has been minimized.

Show comment
Hide comment
@jwakely

jwakely Dec 4, 2016

What about GCC's ubsan? Oh, just saw your edit. It works on OS X too.

jwakely commented Dec 4, 2016

What about GCC's ubsan? Oh, just saw your edit. It works on OS X too.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Dec 4, 2016

Member

GCC 5 ubsan is clean on Linux. I can try on OS X as well.

Member

poulson commented Dec 4, 2016

GCC 5 ubsan is clean on Linux. I can try on OS X as well.

@poulson

This comment has been minimized.

Show comment
Hide comment
@poulson

poulson Dec 12, 2016

Member

For what it's worth, there seems to be no issue with Homebrew's GCC 4.9 on OS X Sierra (with any optimization level).

Member

poulson commented Dec 12, 2016

For what it's worth, there seems to be no issue with Homebrew's GCC 4.9 on OS X Sierra (with any optimization level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment