Skip to content
This repository has been archived by the owner. It is now read-only.

Debugging libnd4j

Adam Gibson edited this page Apr 27, 2016 · 5 revisions

Introduction

So, you've got a crash in libnd4j, now what? Valgrind and GDB are great tools to debug memory corruption occurring in native code, while tolerating the JVM relatively well. One drawback is that it works only on Linux, OS X, and a few others platforms, but not Windows.

Installation

Valgrind and GDB can be installed with most package managers, for example:

$ apt-get install valgrind gdb
$ yum install valgrind gdb
$ brew install valgrind gdb

Note: On OS X El Captain you cannot use brew as the stable version of valgrind only supports earlier versions of the OS. Follow this to install valgrind from head. It takes a while to check out everything.

svn co svn://svn.valgrind.org/valgrind/trunk valgrind
cd valgrind
./autogen.sh
./configure
make
make install

There is no longer a brew formula for gdb. If you want to install gdb you have to install with macports (port install gdb) and you should use the command "ggdb" to start the MacPorts' build of gdb Typical Workflow

  1. If you can reproduce the bug without OpenMP or any other optimization flags, make sure to remove all of them from the CMakeLists.txt files, and rebuild everything. Ideally we would like to leave only the -g flag to produce debugging information, for example:

    set(CMAKE_CXX_FLAGS "-Wall -g")
  2. Run the JVM inside Valgrind's own virtual machine. Because running a VM inside a VM gets bit confusing make sure to run Java without the JIT compiler, for example:

    $ valgrind --track-origins=yes --error-limit=no java -Djava.compiler=NONE -cp <the usual ...>

    The --track-origins=yes flag will get us more information, and make sure to specify --error-limit=no to have Valgrind keep reporting errors after the lot from the JVM.

    It's also possible to have the Maven Surefire Plugin execute it for us by first creating a script named something like valgrindJava in your path:

    $ valgrind --track-origins=yes --error-limit=no java -Djava.compiler=NONE $@

    And by executing Maven on your test this way

    mvn test -Djvm=valgrindJava -Dtest=Nd4jTestsC#testName* <the usual ...>
    
  3. Wait for a while... Valgrind will report a lot of "Invalid write of size 4" errors caused by the JVM. Simply ignore them. At one point, just before the crash, you should see something interesting like this:

    ==14335== Conditional jump or move depends on uninitialised value(s)
    ==14335==    at 0x2BBF9637: shape::tadOffset(int, int*, int*, int) (shape.h:1452)
    ==14335==    by 0x2BC0C4A7: functions::broadcast::Broadcast<float>::exec(float*, int*, float*, int*, float*, int*, int*, int) (broadcasting.h:330)
    ==14335==    by 0x2BC012DF: NativeOpExcutioner<float>::execBroadcast(int, float*, int*, float*, int*, float*, int*, int*, int) (NativeOpExcutioner.h:110)
    ==14335==    by 0x2BBFF01A: NativeOps::execBroadcastFloat(long long*, int, long long, long long, long long, long long, long long, long long, long long, int) (NativeOps.cpp:778)
    ==14335==    by 0x2E6F71F3: Java_org_nd4j_nativeblas_NativeOps_execBroadcastFloat (in /tmp/javacpp29699219924451/libjniNativeOps.so)
    ==14335==    by 0x8080773: ???
    ==14335==    by 0x807398C: ???
    ==14335==    by 0x807370F: ???
    ==14335==    by 0x80737E3: ???
    ==14335==    by 0x807370F: ???
    ==14335==    by 0x807370F: ???
    ==14335==    by 0x80737E3: ???
    ==14335==  Uninitialised value was created by a heap allocation
    ==14335==    at 0x4C28D06: malloc (vg_replace_malloc.c:299)
    ==14335==    by 0x2BBF9B03: shape::createShapeInfo(int*, int*, int) (shape.h:1566)
    ==14335==    by 0x2BBFB008: shape::squeezeDimensions(int*, int**, int*, bool*, bool*, int, int) (shape.h:2253)
    ==14335==    by 0x2BC0C3D1: functions::broadcast::Broadcast<float>::exec(float*, int*, float*, int*, float*, int*, int*, int) (broadcasting.h:303)
    ==14335==    by 0x2BC012DF: NativeOpExcutioner<float>::execBroadcast(int, float*, int*, float*, int*, float*, int*, int*, int) (NativeOpExcutioner.h:110)
    ==14335==    by 0x2BBFF01A: NativeOps::execBroadcastFloat(long long*, int, long long, long long, long long, long long, long long, long long, long long, int) (NativeOps.cpp:778)
    ==14335==    by 0x2E6F71F3: Java_org_nd4j_nativeblas_NativeOps_execBroadcastFloat (in /tmp/javacpp29699219924451/libjniNativeOps.so)
    ==14335==    by 0x8080773: ???
    ==14335==    by 0x807398C: ???
    ==14335==    by 0x807370F: ???
    ==14335==    by 0x80737E3: ???
    ==14335==    by 0x807370F: ???
    ==14335== 
    
  4. Correct the problem, in this case, for example, replace malloc() with calloc() to have the memory initialized to all zeros.

  5. Rerun the problematic code with Valgrind again to confirm that the error is gone. We might end up with another kind of crash though, for example, SIGFPE, which most like indicates a division by zero, something that is probably easier to diagnose.

  6. So, let's fire up GDB, either within your favorite IDE, or on the command line:

    $ gdb -ex run --args java -Djava.compiler=NONE -cp <the usual ...>

    Now, the JVM will produce a few "Program received signal SIGSEGV, Segmentation fault." Simply ignore those by entering the "continue" command.

    Again, it's possible to have the Maven Surefire Plugin execute it for us by first creating a script named something like gdbJava in your path:

    $ gdb -ex run --args java -Djava.compiler=NONE $@

    And by executing Maven on your test this way:

    mvn test -Djvm=gdbJava -Dtest=Nd4jTestsC#testName* <the usual ...>
    
  7. With a bit of luck, the culprit of the crash will eventually show up, and we can poke around, for example:

    Program received signal SIGFPE, Arithmetic exception.
    0x00007fffd07f8e88 in shape::tensorsAlongDimension (shapeInfo=0x7ffff04b6170, dimension=0x7ffff074b080, dimensionLength=1)
        at .../libnd4j/include/shape.h:3422
    3422	                  / shape::prod(tensorShape, dimensionLength);
    (gdb) print dimensionLength
    $1 = 1
    (gdb) print (int[1])*tensorShape
    $2 = {0}
    

    Here, for some reason, the content of tensorShape ends up being 0... Time to look at the logic at the code!

###Using the dbugger with an external project: In Clion when you create a new project, use this CMakeLists.txt to start (the normal environment variables such as LIBND4J_HOME etc still apply:

cmake_minimum_required(VERSION 3.5)
project(yourprojectname)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O0")

set(SOURCE_FILES main.cpp)
link_directories($ENV{LIBND4J_HOME}/blasbuild/cpu/blas)
include_directories($ENV{LIBND4J_HOME}/include)
link_libraries(nd4j)
add_executable(testdimensioncollapse ${SOURCE_FILES})
Clone this wiki locally
You can’t perform that action at this time.