Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault running MNIST example lenet-stn.jl #369

Open
rickhg12hs opened this issue Dec 10, 2017 · 18 comments · May be fixed by #371
Open

Segfault running MNIST example lenet-stn.jl #369

rickhg12hs opened this issue Dec 10, 2017 · 18 comments · May be fixed by #371
Labels

Comments

@rickhg12hs
Copy link

lenet.jl example seems to run OK, but lenet-stn.jl segfaults.

$ julia -e 'versioninfo(); include("lenet-stn.jl")'
Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 13 MB allocated on CPU0
INFO: Start training...

signal (11): Segmentation fault
while loading /home/rick/tmp/mnist/MXNet/lenet-stn.jl, in expression starting on line 64
Segmentation fault (core dumped)
@phinzphinz
Copy link

phinzphinz commented Dec 10, 2017

I have exactly the same problem with this versioninfo():

Julia Version 0.6.1
Commit 0d7248e2ff (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Prescott)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

It always segfaults when I start training, it is not a problem of the lenet-stn.jl file. THIS IS REALLY ANNOYING. I have tried everything for the whole weekend: installing MXNet.jl in both ways, reinstalling Julia, compiling Julia from source, recompiling incubator-mxnet MANY times with different configurations and I have even reinstalled my Debian twice (fresh install wiping the whole SSD) but the problem is still there.

My Debian version is 9.3
I installed CUDA with the runfile cuda_9.0.176_384.81_linux.run
and CUDNN with both methods (1: using the three .deb files from NVIDIA and also 2: copying the relevant files form cudnn-9.0-linux-x64-v7.tgz as described here). The CUDA samples work without any problems and also the CUDNN samples work, so I think I have set up CUDA and CUDNN correctly. I have a ZOTAC 1080 TI GPU, but was not able to use it yet because of this problem :( .
I have tried so many things that I cannot say this for sure, but I think It the problem was not there, when I disabled CUDA in the incubator-mxnet/make/config.mk file!!! So I think, it has something to do with CUDA support. If libmxnet.so is built with CUDA support, it segfaults. My last try was this config.mk for incubator-mxnet:


#-------------------------------------------------------------------------------
#  Template configuration for compiling mxnet
#
#  If you want to change the configuration, please use the following
#  steps. Assume you are on the root directory of mxnet. First copy the this
#  file so that any local changes will be ignored by git
#
#  $ cp make/config.mk .
#
#  Next modify the according entries, and then compile by
#
#  $ make
#
#  or build in parallel with 8 threads
#
#  $ make -j8
#-------------------------------------------------------------------------------

#---------------------
# choice of compiler
#--------------------

export CC = gcc
export CXX = g++
export NVCC = nvcc

# whether compile with options for MXNet developer
DEV = 0

# whether compile with debug
DEBUG = 1

# whether compile with profiler
USE_PROFILER =

# whether to turn on signal handler (e.g. segfault logger)
USE_SIGNAL_HANDLER = 1

# the additional link flags you want to add
ADD_LDFLAGS =

# the additional compile flags you want to add
ADD_CFLAGS =

#---------------------------------------------
# matrix computation libraries for CPU/GPU
#---------------------------------------------

# whether use CUDA during compile
USE_CUDA = 1

# add the path to CUDA library to link and compile flag
# if you have already add them to environment variable, leave it as NONE
# USE_CUDA_PATH = /usr/local/cuda
USE_CUDA_PATH = /usr/local/cuda-9.0/

# whether use CuDNN R3 library
USE_CUDNN = 0

#whether to use NCCL library
USE_NCCL = 0
#add the path to NCCL library
USE_NCCL_PATH = NONE

# whether use opencv during compilation
# you can disable it, however, you will not able to use
# imbin iterator
USE_OPENCV = 0

#whether use libjpeg-turbo for image decode without OpenCV wrapper
USE_LIBJPEG_TURBO = 0
#add the path to libjpeg-turbo library
USE_LIBJPEG_TURBO_PATH = NONE

# use openmp for parallelization
USE_OPENMP = 1

# MKL ML Library for Intel CPU/Xeon Phi
# Please refer to MKL_README.md for details

# MKL ML Library folder, need to be root for /usr/local
# Change to User Home directory for standard user
# For USE_BLAS!=mkl only
MKLML_ROOT=/usr/local

# whether use MKL2017 library
USE_MKL2017 = 0

# whether use MKL2017 experimental feature for high performance
# Prerequisite USE_MKL2017=1
USE_MKL2017_EXPERIMENTAL = 0

# whether use NNPACK library
USE_NNPACK = 0

# choose the version of blas you want to use
# can be: mkl, blas, atlas, openblas
# in default use atlas for linux while apple for osx
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
USE_BLAS = apple
else
USE_BLAS = atlas
endif

# whether use lapack during compilation
# only effective when compiled with blas versions openblas/apple/atlas/mkl
USE_LAPACK = 0

# path to lapack library in case of a non-standard installation
USE_LAPACK_PATH =

# by default, disable lapack when using MKL
# switch on when there is a full installation of MKL available (not just MKL2017/MKL_ML)
ifeq ($(USE_BLAS), mkl)
USE_LAPACK = 0
endif

# add path to intel library, you may need it for MKL, if you did not add the path
# to environment variable
USE_INTEL_PATH = NONE

# If use MKL only for BLAS, choose static link automatically to allow python wrapper
ifeq ($(USE_MKL2017), 0)
ifeq ($(USE_BLAS), mkl)
USE_STATIC_MKL = 1
endif
else
USE_STATIC_MKL = NONE
endif

#----------------------------
# Settings for power and arm arch
#----------------------------
ARCH := $(shell uname -a)
ifneq (,$(filter $(ARCH), armv6l armv7l powerpc64le ppc64le aarch64))
	USE_SSE=0
else
	USE_SSE=1
endif

#----------------------------
# distributed computing
#----------------------------

# whether or not to enable multi-machine supporting
USE_DIST_KVSTORE = 0

# whether or not allow to read and write HDFS directly. If yes, then hadoop is
# required
USE_HDFS = 0

# path to libjvm.so. required if USE_HDFS=1
LIBJVM=$(JAVA_HOME)/jre/lib/amd64/server

# whether or not allow to read and write AWS S3 directly. If yes, then
# libcurl4-openssl-dev is required, it can be installed on Ubuntu by
# sudo apt-get install -y libcurl4-openssl-dev
USE_S3 = 0

#----------------------------
# performance settings
#----------------------------
# Use operator tuning
USE_OPERATOR_TUNING = 1

# Use gperftools if found
USE_GPERFTOOLS = 0

# Use JEMalloc if found, and not using gperftools
USE_JEMALLOC = 0

#----------------------------
# additional operators
#----------------------------

# path to folders containing projects specific operators that you don't want to put in src/operators
EXTRA_OPERATORS =

#----------------------------
# other features
#----------------------------

# Create C++ interface package
USE_CPP_PACKAGE = 0

#----------------------------
# plugins
#----------------------------

# whether to use caffe integration. This requires installing caffe.
# You also need to add CAFFE_PATH/build/lib to your LD_LIBRARY_PATH
# CAFFE_PATH = $(HOME)/caffe
# MXNET_PLUGINS += plugin/caffe/caffe.mk

# whether to use torch integration. This requires installing torch.
# You also need to add TORCH_PATH/install/lib to your LD_LIBRARY_PATH
# TORCH_PATH = $(HOME)/torch
# MXNET_PLUGINS += plugin/torch/torch.mk

# WARPCTC_PATH = $(HOME)/warp-ctc
# MXNET_PLUGINS += plugin/warpctc/warpctc.mk

# whether to use sframe integration. This requires build sframe
# git@github.com:dato-code/SFrame.git
# SFRAME_PATH = $(HOME)/SFrame
# MXNET_PLUGINS += plugin/sframe/plugin.mk

And it gives a bit more info about the segfault. julia lenet-stn.jl returns

--2017-12-10 17:29:38--  http://data.mxnet.io/mxnet/data/mnist.zip
Auflösen des Hostnamens »data.mxnet.io (data.mxnet.io)« … 54.208.175.7
Verbindungsaufbau zu data.mxnet.io (data.mxnet.io)|54.208.175.7|:80 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet … 200 OK
Länge: 11595270 (11M) [application/zip]
Wird in »»mnist.zip«« gespeichert.

mnist.zip                                            100%[=====================================================================================================================>]  11,06M   314KB/s    in 36s     

2017-12-10 17:30:14 (318 KB/s) - »»mnist.zip«« gespeichert [11595270/11595270]

Archive:  mnist.zip
  inflating: t10k-images-idx3-ubyte  
  inflating: t10k-labels-idx1-ubyte  
  inflating: train-images-idx3-ubyte  
  inflating: train-labels-idx1-ubyte  
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 13 MB allocated on CPU0
INFO: Start training...

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet15segfault_loggerEi+0x44) [0x7fe091483e1e]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x33030) [0x7fe0f2b59030]
[bt] (2) /opt/incubator-mxnet/lib/libmxnet.so(_ZN7mshadow24BilinearSamplingBackwardIfEEvRKNS_6TensorINS_3cpuELi4ET_EERKNS1_IS2_Li3ES3_EES6_S6_+0x683) [0x7fe091358084]
[bt] (3) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op20SpatialTransformerOpIN7mshadow3cpuEfE8BackwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS9_EESD_SD_RKS8_INS_9OpReqTypeESaISE_EESD_SD_+0x403) [0x7fe0913404bf]
[bt] (4) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op13OperatorState8BackwardERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS6_EERKS5_INS_9OpReqTypeESaISB_EESA_+0x473) [0x7fe090d6f641]
[bt] (5) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet2op16LegacyOpBackwardERKNS_10OpStatePtrERKNS_9OpContextERKSt6vectorINS_5TBlobESaIS8_EERKS7_INS_9OpReqTypeESaISD_EESC_+0x4b) [0x7fe090d695dc]
[bt] (6) /opt/incubator-mxnet/lib/libmxnet.so(_ZNSt17_Function_handlerIFvRKN5mxnet10OpStatePtrERKNS0_9OpContextERKSt6vectorINS0_5TBlobESaIS8_EERKS7_INS0_9OpReqTypeESaISD_EESC_EPSI_E9_M_invokeERKSt9_Any_dataS3_S6_SC_SH_SC_+0x91) [0x7fe090d74ea3]
[bt] (7) /opt/incubator-mxnet/lib/libmxnet.so(_ZNKSt8functionIFvRKN5mxnet10OpStatePtrERKNS0_9OpContextERKSt6vectorINS0_5TBlobESaIS8_EERKS7_INS0_9OpReqTypeESaISD_EESC_EEclES3_S6_SC_SH_SC_+0xa6) [0x7fe090dbc372]
[bt] (8) /opt/incubator-mxnet/lib/libmxnet.so(_ZN5mxnet4exec23StatefulComputeExecutor3RunENS_10RunContextEb+0x91) [0x7fe09140dbdf]
[bt] (9) /opt/incubator-mxnet/lib/libmxnet.so(+0x3ced91d) [0x7fe0913f691d]

I hope that this helps to solve the problem. I am busy this week, so I can only do more tests next weekend.

@phinzphinz
Copy link

Furthermore, I think that it is a MXNet.jl or Julia related issue because one time (and I don't know the config.mk configuration anymore but it was with manual ENV["MXNET_HOME"]=... setting) it worked a bit: I could train on the GPU but only until I loaded using Plots. After using Plots, the training had a buggy behaviour (the MSE() exploded after a few steps and then all weights were NA) for SGD optimizer but the Adagrad optimizer still worked normally. When I put using Plots at the beginning, Julia already complained about train_provider = mx.ArrayDataProvider(:data=>trainx, :linreg_label=>trainy, batch_size=100000,shuffle=true), I think the error was something about readonly memory? After a fresh Debian installation, it didn't even work anymore without using Plots but always had segfaults when training.

@iblislin
Copy link
Member

Here is my gdb trace:

Thread 37 "julia" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff35a15700 (LWP 13819)]
0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
120                   *(g_input + data_index + 1) += *(grad + grad_index) * top_left_y_w
(gdb) bt
#0  0x00007fff83e77ba0 in mshadow::BilinearSamplingBackward<float> (input_grad=..., grid_src_data=..., output_grad=..., 
    input_data=...) at src/operator/spatial_transformer.cc:120
#1  0x00007fff83e5f18c in mxnet::op::SpatialTransformerOp<mshadow::cpu, float>::Backward (this=0x38bcd30, ctx=..., 
    out_grad=std::vector of length 1, capacity 1 = {...}, in_data=std::vector of length 2, capacity 2 = {...}, 
    out_data=std::vector of length 3, capacity 3 = {...}, req=std::vector of length 2, capacity 2 = {...}, 
    in_grad=std::vector of length 2, capacity 2 = {...}, aux_args=std::vector of length 0, capacity 0)
    at src/operator/./spatial_transformer-inl.h:136

I guess there is something wrong in shape.

@iblislin
Copy link
Member

(gdb) p grad
$1 = (const float *) 0x7fff251e6f90
(gdb) p top_left_y_w
$2 = 0.376614928
(gdb) p grad_index
$3 = 0
(gdb) p *(grad + grad_index)                                                                                              
$4 = 0.00177509966
(gdb) p g_input + data_index + 1
$5 = (float *) 0x80032442cf50
(gdb) p g_input
$6 = (float *) 0x7fff2442cf50
(gdb) p data_index
$7 = 4294967295

oh.. data_index is weird....

@iblislin
Copy link
Member

I can reproduce the segfault via change optimizer to adam, which is sample as our code.

% ./train_mnist.py --network lenet --add_stn --optimizer adam
INFO:root:start with arguments Namespace(add_stn=True, batch_size=64, disp_batches=100, dtype='float32', gpus=None, kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='lenet', num_classes=10, num_epochs=20, num_examples=60000, num_layers=None, optimizer='adam', test_io=0, top_k=0, wd=0.0001)

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2559619) [0x7f642acdd619]
[bt] (1) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f645935b4b0]
[bt] (2) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2527f9d) [0x7f642acabf9d]
[bt] (3) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x252a9f6) [0x7f642acae9f6]
[bt] (4) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2203f87) [0x7f642a987f87]
[bt] (5) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fea13b) [0x7f642a76e13b]
[bt] (6) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fee562) [0x7f642a772562]
[bt] (7) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd0cbd) [0x7f642a754cbd]
[bt] (8) /home/iblis/venv/py3/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x1fd48c1) [0x7f642a7588c1]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f6454474c80]

so... simply switch to SGD and it works

diff --git a/examples/mnist/lenet-stn.jl b/examples/mnist/lenet-stn.jl
index 23ca9de..60f2def 100644
--- a/examples/mnist/lenet-stn.jl
+++ b/examples/mnist/lenet-stn.jl
@@ -57,6 +57,6 @@
 model = mx.FeedForward(lenet, context=mx.cpu())

 # optimizer
-optimizer = mx.ADAM(lr=0.01, weight_decay=0.00001)
+optimizer = mx.SGD(lr=0.1, momentum=.9)

 # fit parameters

iblislin added a commit that referenced this issue Dec 11, 2017
make its optimizer configured same as Python's

fix #369
@iblislin iblislin linked a pull request Dec 11, 2017 that will close this issue
@rickhg12hs
Copy link
Author

So, does this mean there is something wrong in libmxnet.so?

@iblislin
Copy link
Member

@rickhg12hs seems ADAM make some value fall into negative, then libmxnet.so blew up.

@iblislin
Copy link
Member

iblislin commented Dec 12, 2017

So, does this mean there is something wrong in libmxnet.so?

well, not exactly, IMO.
Maybe libmxnet should protect itself from accepting negative input,
or... maybe ADAM is too aggressive in this case.

@rickhg12hs
Copy link
Author

rickhg12hs commented Dec 12, 2017

lenet-stn.jl runs without segfaulting after the pull request edits. The accuracy after several epochs is horrible, but that is a separate issue (maybe).

@iblislin
Copy link
Member

I changed momentum to 0.1 and set n_epoch=15 (do early stopping as kind of regularization), then it works fine.

optimizer = mx.SGD(lr=0.1, momentum=.1)

@iblislin
Copy link
Member

🤔 ignore my post,
I'm tuning other configs.

@iblislin
Copy link
Member

try this? 8e99fa9

@iblislin
Copy link
Member

got this on my machine

INFO: == Epoch 020/020 ==========
INFO: ## Training summary
INFO:           accuracy = 0.9965
INFO:               time = 5.2912 seconds
INFO: ## Validation summary
INFO:           accuracy = 0.9917
INFO: Finish training on MXNet.mx.Context[GPU0]

@rickhg12hs
Copy link
Author

rickhg12hs commented Dec 12, 2017

Using the edits in 8e99fa9, I get a segfault.

$ /usr/local/src/julia/julia/julia ./lenet-stn.jl 
INFO: Start training on MXNet.mx.Context[CPU0]
INFO: Initializing parameters...
INFO: Creating KVStore...
INFO: TempSpace: Total 14 MB allocated on CPU0
INFO: Start training...

signal (11): Segmentation fault
while loading /home/rick/tmp/mnist/MXNet/lenet-stn.jl, in expression starting on line 70
Segmentation fault (core dumped)

@iblislin
Copy link
Member

hmm, I believe that it's a bug of libmxnet now.
My GPU build will invoke cuDNN, and it works without segfault in all the cases.

@iblislin
Copy link
Member

I reported this issue to upstream: apache/mxnet#9050

@adrianloy
Copy link

I think it is a bug in the STN layer. I also had some issues with that, I train a model using the simple_bind API and sometimes I get SegFaults, sometimes not. Seems to be dependent on the random parameter initialization. GDB stack trace told me it was in the BilinearSamplingBackward method, same as was mentioned before here.

@iblislin
Copy link
Member

@adrianloy do you have GPU and can try out cuDNN ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants