Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler bus error during build #418

Closed
davidssmith opened this issue Feb 28, 2018 · 9 comments
Closed

Compiler bus error during build #418

davidssmith opened this issue Feb 28, 2018 · 9 comments

Comments

@davidssmith
Copy link

I'm getting a compiler bus error during build, and I really don't know where to start debugging it. I have tried nuking the package directory and rebuilding, and I've tried a different compiler, but I got bus errors on both, so I'm thinking it's an MXNet issue.

g++ -std=c++11 -c -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -O3 -DNDEBUG=1 -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/mshadow/ -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/dmlc-core/include -fPIC -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/nnvm/include -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/dlpack/include -Iinclude -funroll-loops -Wno-unused-variable -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -DINTERFACE64 -msse3 -I/opt/easybuild/software/Core/CUDA/8.0.61/include -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=0 -fopenmp -DMSHADOW_USE_CUDNN=1 -DMXNET_USE_LAPACK -I/gpfs22/home/dss/.julia/v0.6/MXNet/deps/src/mxnet/cub -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c src/operator/contrib/multibox_detection.cc -o build/src/operator/contrib/multibox_detection.o
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/deformable_convolution.o] Error 4
make: *** Waiting for unfinished jobs....
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/dequantize.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/deformable_psroi_pooling.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/multibox_detection.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/count_sketch.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/ifft.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/ctc_loss.o] Error 4
g++: internal compiler error: Bus error (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
make: *** [build/src/operator/contrib/fft.o] Error 4
================================================[ ERROR: MXNet ]=================================================

LoadError: failed process: Process(`make -j8 USE_BLAS=openblas 'MSHADOW_LDFLAGS=-lm /gpfs22/home/dss/julia-d386e40c17/bin/../lib/julia/libopenblas64_.so'`, ProcessExited(2)) [2]
while loading /gpfs22/home/dss/.julia/v0.6/MXNet/deps/build.jl, in expression starting on line 81

=================================================================================================================

shell> g++ --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)
@iblislin
Copy link
Member

iblislin commented Mar 1, 2018

gcc 4.4 is quite old.

I've tried a different compiler

which one have you tried?
gcc 5.x and 6.x work for me.

@davidssmith
Copy link
Author

I don't have root, so I can only load modules on this cluster. I have gcc 5 in my path:

[dss@gpu0025 ~]$ which gcc
/opt/easybuild/software/Core/GCCcore/5.4.0/bin/gcc
[dss@gpu0025 ~]$ gcc --version
gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[dss@gpu0025 ~]$ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

shell> gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

julia> ENV["PATH"]
"/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/LLVM/3.9.0/bin:/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/git/2.12.2/bin:/opt/easybuild/software/Compiler/GCC/5.4.0-2.26/Perl/5.24.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/gettext/0.19.8/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/ncurses/6.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/libxml2/2.9.4/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/XZ/5.2.2/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/expat/2.2.0/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/cURL/7.49.1/bin:/opt/easybuild/software/Compiler/GCCcore/5.4.0/binutils/2.26/bin:/opt/easybuild/software/Core/GCCcore/5.4.0/bin:/opt/easybuild/software/Core/CUDA/8.0.61:/opt/easybuild/software/Core/CUDA/8.0.61/bin:/usr/scheduler/slurm/sbin:/usr/scheduler/slurm/bin:/usr/lpp/mmfs/bin:/usr/local/bin:/usr/local/common/bin:/usr/bin:/bin:/usr/scheduler/slurm/sbin:/usr/scheduler/slurm/bin:/usr/lpp/mmfs/bin:/usr/local/bin:/usr/local/common/bin:/usr/bin:/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/var/cfengine/bin:/home/dss/bin:/var/cfengine/bin"

but the build script is not finding it and is instead using /usr/bin/gcc which is version 4.

How can I override the gcc used by the build?

@iblislin
Copy link
Member

iblislin commented Mar 2, 2018

You can change the CC and CXX in this file

~/.julia/v0.6/MXNet/deps/src/mxnet/make/config.mk

maybe setting it to /opt/easybuild/software/Core/GCCcore/5.4.0/bin/gcc and /opt/easybuild/software/Core/GCCcore/5.4.0/bin/g++

I will add a patch to allow user config it from Julia's REPL later.

@iblislin
Copy link
Member

iblislin commented Mar 2, 2018

Please checkout this patch: https://github.com/dmlc/MXNet.jl/pull/419/files

iblislin added a commit that referenced this issue Mar 4, 2018
* build: propagate CC/CXX into config.mk

See: #418 (comment)

* update doc
@davidssmith
Copy link
Author

It compiles now, but when I start Julia and issue using MXNet at the REPL, it crashes the REPL. I'm looking into it to make sure I applied the patch correctly.

@davidssmith
Copy link
Author

Here is the beginning of the error message, just in case it helps.

               _                                                                                                              [51/1921]
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.2 (2017-12-13 18:08 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-pc-linux-gnu

julia> using MXNet

signal (11): Segmentation fault
while loading no file, in expression starting on line 0
free at /usr/lib/x86_64-linux-gnu/libjemalloc.so.1 (unknown line)
_ZN5mxnet2op12OperatorTuneIlE8demangleB5cxx11EPKc at /home/dss/.julia/v0.6/MXNet/deps/usr/lib/libmxnet.so (unknown line)
unknown function (ip: 0x7f18050c58bd)
unknown function (ip: 0x7f184d9f9ad9)
unknown function (ip: 0x7f184d9f9bea)
unknown function (ip: 0x7f184d9febf5)
_dl_catch_error at /build/glibc-itYbWN/glibc-2.26/elf/dl-error-skeleton.c:198
unknown function (ip: 0x7f184d9fe148)
dlopen_doit at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlopen.c:66
_dl_catch_error at /build/glibc-itYbWN/glibc-2.26/elf/dl-error-skeleton.c:198
_dlerror_run at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlerror.c:163
__dlopen at /build/glibc-itYbWN/glibc-2.26/dlfcn/dlopen.c:87
jl_load_dynamic_library_ at /buildworker/worker/package_linux64/build/src/dlload.c:189
jl_get_library at /buildworker/worker/package_linux64/build/src/runtime_ccall.cpp:159
emit_a_ccall at /buildworker/worker/package_linux64/build/src/ccall.cpp:2074
emit_ccall at /buildworker/worker/package_linux64/build/src/ccall.cpp:1899
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4156
emit_assignment at /buildworker/worker/package_linux64/build/src/codegen.cpp:3853 [inlined]
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4159
emit_stmtpos at /buildworker/worker/package_linux64/build/src/codegen.cpp:4064 [inlined]
emit_function at /buildworker/worker/package_linux64/build/src/codegen.cpp:6248
jl_compile_linfo at /buildworker/worker/package_linux64/build/src/codegen.cpp:1256
emit_invoke at /buildworker/worker/package_linux64/build/src/codegen.cpp:3400 [inlined]
emit_expr at /buildworker/worker/package_linux64/build/src/codegen.cpp:4135
emit_stmtpos at /buildworker/worker/package_linux64/build/src/codegen.cpp:4064 [inlined]
emit_function at /buildworker/worker/package_linux64/build/src/codegen.cpp:6248
jl_compile_linfo at /buildworker/worker/package_linux64/build/src/codegen.cpp:1256

@iblislin
Copy link
Member

iblislin commented Mar 7, 2018

Oh...jemalloc, I ran into similar issue on Arch Linux.
You can try to disable it in ~/.julia/v0.6/MXNet/deps/src/mxnet/make/config.mk
(note that it's not ~/.julia/v0.6/MXNet/deps/src/mxnet/config.mk, this file will be override by build.jl)

set USE_JEMALLOC to 0.

@davidssmith
Copy link
Author

Success! I wasn't able to change that file without git complaining that I need to stash, so I uninstalled jemalloc and was able to compile. Now all but one test passes. I doubt it is related, but I'm including the error message in case it is.

SymbolicNode Test: Error During Test                                  
  Got an exception of type MXNet.mx.MXError outside of a @test        
  Cannot find argument 'a', Possible Arguments:                       
  ----------------                                                    
  kernel : Shape(tuple), required                                     
      Convolution kernel size: (w,), (h, w) or (d, h, w)              
  stride : Shape(tuple), optional, default=[]                         
      Convolution stride: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
  dilate : Shape(tuple), optional, default=[]                         
      Convolution dilate: (w,), (h, w) or (d, h, w). Defaults to 1 for each dimension.
  pad : Shape(tuple), optional, default=[]                            
      Zero pad for convolution: (w,), (h, w) or (d, h, w). Defaults to no padding.
  num_filter : int (non-negative), required                           
      Convolution filter(channel) number                              
  num_group : int (non-negative), optional, default=1                 
      Number of group partitions.                                     
  workspace : long (non-negative), optional, default=1024             
      Maximum temporary workspace allowed (MB) in convolution.This parameter has two usages. When CUDNN is not used, it determines the effect
ive batch size of the convolution kernel. When CUDNN is used, it controls the maximum temporary storage used for tuning the best CUDNN kernel
 when `limited_workspace` strategy is used.                                                                                                  
  no_bias : boolean, optional, default=0                                                                                                     
      Whether to disable bias parameter.                                                                                                     
  cudnn_tune : {None, 'fastest', 'limited_workspace', 'off'},optional, default='None'
      Whether to pick convolution algo by running performance test.                                                                            cudnn_off : boolean, optional, default=0                                                                                                         Turn off cudnn for this layer.  
  layout : {None, 'NCDHW', 'NCHW', 'NCW', 'NDHWC', 'NHWC'},optional, default='None'
      Set layout for input, output and weight. Empty for                                                                                     
      default layout: NCW for 1d, NCHW for 2d and NCDHW for 3d.
  , in operator Convolution(name="", a="a", kernel="(1, 1)", num_filter="1")
  Stacktrace:                      
   [1] macro expansion at /home/dss/.julia/v0.6/MXNet/src/base.jl:77 [inlined]
   [2] set_attr(::MXNet.mx.SymbolicNode, ::Symbol, ::String) at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:232
   [3] _create_atomic_symbol at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:825 [inlined]
   [4] #Convolution#5492(::Array{Any,1}, ::Function, ::Type{MXNet.mx.SymbolicNode}, ::MXNet.mx.SymbolicNode, ::Vararg{MXNet.mx.SymbolicNode,N
} where N) at /home/dss/.julia/v0.6/MXNet/src/symbolic-node.jl:903
   [5] (::MXNet.mx.#kw##Convolution)(::Array{Any,1}, ::MXNet.mx.#Convolution, ::Type{MXNet.mx.SymbolicNode}, ::MXNet.mx.SymbolicNode, ::Varar
g{MXNet.mx.SymbolicNode,N} where N) at ./<missing>:0
   [6] #Convolution#5496(::Array{Any,1}, ::Function, ::MXNet.mx.SymbolicNode, ::Vararg{MXNet.mx.SymbolicNode,N} where N) at /home/dss/.julia/
v0.6/MXNet/src/symbolic-node.jl:924
   [7] (::MXNet.mx.#kw##Convolution)(::Array{Any,1}, ::MXNet.mx.#Convolution, ::MXNet.mx.SymbolicNode) at ./<missing>:0
   [8] test_attrs() at /home/dss/.julia/v0.6/MXNet/test/unittest/symbolic-node.jl:140
   [9] macro expansion at /home/dss/.julia/v0.6/MXNet/test/unittest/symbolic-node.jl:535 [inlined]
   [10] macro expansion at ./test.jl:860 [inlined]
   [11] anonymous at ./<missing>:?
   [12] include_from_node1(::String) at ./loading.jl:576
   [13] include(::String) at ./sysimg.jl:14
   [14] collect_to!(::Array{Module,1}, ::Base.Generator{Array{String,1},##3#6{String}}, ::Int64, ::Int64) at ./array.jl:508
   [15] _collect(::Array{String,1}, ::Base.Generator{Array{String,1},##3#6{String}}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:489
   [16] test_dir(::String) at /home/dss/.julia/v0.6/MXNet/test/runtests.jl:9
   [17] macro expansion at /home/dss/.julia/v0.6/MXNet/test/runtests.jl:18 [inlined]
   [18] macro expansion at ./test.jl:860 [inlined]
   [19] anonymous at ./<missing>:?
   [20] include_from_node1(::String) at ./loading.jl:576
   [21] include(::String) at ./sysimg.jl:14

@iblislin
Copy link
Member

iblislin commented Mar 7, 2018

That error is from recent uptream changes.
See apache/mxnet#9677 (comment).

@iblislin iblislin closed this as completed Mar 7, 2018
iblislin added a commit that referenced this issue Mar 8, 2018
iblislin added a commit that referenced this issue Mar 8, 2018
* build: propagate USE_JEMALLOC

see #418 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants