Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Catch exceptions in CudaGridExecutioner#flushQueue and rethrow with help message #6493

Closed
DrChainsaw opened this issue Sep 26, 2018 · 6 comments

Comments

@DrChainsaw
Copy link

commented Sep 26, 2018

Sorry for confusing title, here is the story:

It seems like CudaGridExecutioner sometimes queues up operations which to be executed at some later point in time. This might (and sometimes do) cause exceptions for "past faults" to be thrown when doing some later operation, making it hard to understand what the origin of the fault is.

For instance, an exception for trying to access an element out of bounds could be thrown when calling Nd4j.create far from where the fault was made (resulting in the Nd4j.create call being blamed for something like "Op.x.length not equals to Op.y.length").

Inspecting the stacktrace might raise the user's suspicions that there error was delayed, but it seems like a low hanging fruit to just put a try-catch block around the execution of the queued op and have the catch rethrow with a clear message to the user that the exception might been delayed, preferably with some instructions on what to do should that be the case (e.g. try with native CPU backend).

@AlexDBlack AlexDBlack added the ND4J label Sep 27, 2018

@Charele

This comment has been minimized.

Copy link

commented Sep 27, 2018

CudaGridExecutioner?

@DrChainsaw DrChainsaw changed the title Suggestion: Catch exceptions in CudaGridManager#flushQueue and rethrow with help message Suggestion: Catch exceptions in CudaGridExecutioner#flushQueue and rethrow with help message Sep 27, 2018

@DrChainsaw

This comment has been minimized.

Copy link
Author

commented Sep 27, 2018

Yes I mean CudaGridExecutioner. I guess my mind was somewhere else (twice?!) when I typed. Corrected now. Thanks for pointing it out.

@AlexDBlack

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

@DrChainsaw Did you have a test case or a way to reproduce this?
I'm sure I've seen this sort of thing before, but I all of the intentional errors I've tried so far have been caught by op/argument validation...

@DrChainsaw

This comment has been minimized.

Copy link
Author

commented Oct 3, 2018

Here is one which triggers 100% on my setup on 1.0.0-beta2. It fails at a different place compared to what I saw in the real program, but it seems to be doing roughly the same thing (dequeuing an op and then executing it).

I haven't dug into the internals but in case the queueing of execution is due to some low level stuff I guess it could behave differently on another setup. I'm using windows 10 with a GTX 980 ti and driver version 398.36.

Btw, is the class javadoc for CudaGridExecutioner really correct? Not that I'm running production code, but is there a choice besides CPU backend for those who do?

Testcase:

import org.junit.Test;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.exception.ND4JIllegalStateException;
import org.nd4j.linalg.factory.Nd4j;

public class DelayedExceptionTest {

    /**
     * Throws an exception for invalid assign after an action which is not related to the actual error
     */
    @Test
    public void dueToInvalidAssign() {

        final INDArray arr1 = Nd4j.create(1, 2, 3, 4);
        final INDArray arr2 = Nd4j.create(1, 2, 3, 5);

        System.out.println("This should throw an exception I guess...");
        arr1.assign(arr2);

        // Fails right away as well
        //((CudaGridExecutioner)Nd4j.getExecutioner()).flushQueue();

        System.out.println("No, problem? Ok, lets go on with the program");
        Nd4j.create(1,2,3);

        try {
            System.out.println("Ok then, and now I need to do this");
            Nd4j.ones(3, 4, 5, 6);
        } catch (ND4JIllegalStateException e) {
            throw new IllegalStateException("PairwiseTransform?? Wuuuut??!", e);
        }
        System.out.println("YMMW, try do more stuff if you reach this point to make the buffer flush (or just flush it yourself)");
    }
}

Output:

This should throw an exception I guess...
No, problem? Ok, lets go on with the program
Ok then, and now I need to do this

java.lang.IllegalStateException: PairwiseTransform?? Wuuuut??!
...
Caused by: org.nd4j.linalg.exception.ND4JIllegalStateException: X, Y and Z arguments should have the same length for PairwiseTransform
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.invoke(CudaExecutioner.java:1621)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaGridExecutioner.pushToGrid(CudaGridExecutioner.java:245)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaGridExecutioner.processAsGridOp(CudaGridExecutioner.java:327)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaGridExecutioner.exec(CudaGridExecutioner.java:132)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.assign(BaseNDArray.java:4038)
	at org.nd4j.linalg.factory.BaseNDArrayFactory.ones(BaseNDArrayFactory.java:1069)
	at org.nd4j.linalg.factory.Nd4j.ones(Nd4j.java:5244)
	at ampcontrol.model.training.model.evolve.transfer.DelayedExceptionTest.dueToInvalidAssign(DelayedExceptionTest.java:30)
	... 27 more


Process finished with exit code -1
@Charele

This comment has been minimized.

Copy link

commented Oct 4, 2018

yes, I understand your mean, the exception message will point to other line number,
not the real line number.

@AlexDBlack AlexDBlack self-assigned this Oct 15, 2018

AlexDBlack added a commit that referenced this issue Oct 15, 2018
AlexDBlack added a commit that referenced this issue Oct 15, 2018
Misc DL4J/Nd4J fixes (#6581)
* #6565 Fix GraphBuilder.removeVertex for non-modifiable lists

* #6562 fix Cnn3DLossLayer javadoc

* #6512 OldSoftMax validation for invalid rank inputs

* #6493 Warn users about delayed exceptions with CUDA grid executioner

* #6493 Warn users about delayed exceptions with CUDA grid executioner
@lock

This comment has been minimized.

Copy link

commented Nov 14, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Nov 14, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants
You can’t perform that action at this time.