Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

Closed
danhlephuoc opened this issue May 30, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@danhlephuoc
Copy link

danhlephuoc commented May 30, 2020

Description

ObjectDetection example with the pre-trained Yolo models (dataset=coco, backbone=darknet53) return error:

"MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered"

Expected Behavior

Object Detection example should work on different pre-trained Yolo models. Note, Yolo models trained with Pascal VOC work just fine.

Error Message

INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
Loading: 100% |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|
[11:32:04] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.6.0. Attempting to upgrade...
[11:32:04] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
model yolo
[11:32:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[WARNING]
ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
Stack trace:
File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81

at ai.djl.mxnet.jna.JnaUtils.checkCall (JnaUtils.java:1788)
at ai.djl.mxnet.jna.JnaUtils.syncCopyToCPU (JnaUtils.java:473)
at ai.djl.mxnet.engine.MxNDArray.toByteBuffer (MxNDArray.java:283)
at ai.djl.ndarray.NDArray.toIntArray (NDArray.java:279)
at ai.djl.modality.cv.translator.YoloTranslator.processOutput (YoloTranslator.java:40)
at ai.djl.modality.cv.translator.YoloTranslator.processOutput (YoloTranslator.java:26)
at ai.djl.inference.Predictor.processOutputs (Predictor.java:202)
at ai.djl.inference.Predictor.batchPredict (Predictor.java:160)
at ai.djl.inference.Predictor.predict (Predictor.java:112)
at ai.djl.examples.inference.ObjectDetectionBench.predict (ObjectDetectionBench.java:71)
at ai.djl.examples.inference.ObjectDetectionBench.main (ObjectDetectionBench.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.461 s
[INFO] Finished at: 2020-05-30T11:32:18+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project examples: An exception occured while executing the Java class. MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
[ERROR] Stack trace:
[ERROR] File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[11:32:18] src/resource.cc:279: Ignore CUDA Error [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

[[[[11:32:18] 11:32:18] src/engine/threaded_engine_perdevice.cc11:32:18src/engine/threaded_engine_perdevice.cc] src/engine/threaded_engine_perdevice.cc:27511:32:18:275:: 275: Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

] src/engine/threaded_engine_perdevice.cc:275Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

: : Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

terminate called after throwing an instance of 'dmlc::Error'
what(): [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

Aborted (core dumped)

How to Reproduce?

  1. Change the Criteria configuration of 'predict' method of ai.djl.examples.inference.ObjectDetection.java :

from

.optFilter("backbone", "resnet50")"

to

.optFilter("dataset", "coco")
.optFilter("imageSize","416")
.optFilter("backbone", "darknet53")

  1. run
    mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"

Environment Info

Ubuntu, CUDA 10.2, GPU V100

@danhlephuoc danhlephuoc added the bug Something isn't working label May 30, 2020
@lanking520
Copy link
Contributor

lanking520 commented May 30, 2020

Thanks for report the issue! Could you please try out:

Add snapshot repository: "https://oss.sonatype.org/content/repositories/snapshots/" in maven repository.

And use 0.6.0-SNAPSHOT version.

recently we fixed an issue related to this. It might be caused by the NDArray is no CPU and not on GPU, and GPU cannot find the pointer caused the crash.

@danhlephuoc
Copy link
Author

I changed to 0.6.0-SNAPSHOT version, I got the same error. Btw, I tested with my MacPro without GPU, the Yolo models with Darknet53 and Coco work fine on CPU

@frankfliu
Copy link
Contributor

@danhlephuoc is that possible you can share your repo?
Are you using multiple GPU for the training? Can you try to limit use only one GPU?

The commit lanking520 mentioned is here:
0ca79f4

It's an error checking to identify where the mismatch device happens. You still need fix the ndarray creation.

@danhlephuoc
Copy link
Author

danhlephuoc commented May 30, 2020

@frankfliu : I've forked report and changed the code for the error at https://github.com/danhlephuoc/djl.git , you just clone and run following command to reproduce the error

mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"

I already limited to 1 GPU via CUDA_VISIBLE_DEVICES, but the error still stays

@stu1130
Copy link
Contributor

stu1130 commented May 30, 2020

I can reproduce the issue with single GPU
When I used export MXNET_ENGINE_TYPE=NaiveEngine, I saw

Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory
 access was encountered
Stack trace:
  File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-
inl.h", line 81

        at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1788)
        at ai.djl.mxnet.jna.JnaUtils.cachedOpInvoke(JnaUtils.java:1757)
        at ai.djl.mxnet.engine.CachedOp.forward(CachedOp.java:133)
        at ai.djl.mxnet.engine.MxSymbolBlock.forward(MxSymbolBlock.java:145)
        at ai.djl.nn.Block.forward(Block.java:116)
        at ai.djl.inference.Predictor.predict(Predictor.java:117)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
        at ai.djl.inference.Predictor.predict(Predictor.java:112)
        at ai.djl.examples.inference.ObjectDetection.predict(ObjectDetection.java:68)
        at ai.djl.examples.inference.ObjectDetection.main(ObjectDetection.java:47)
[18:09:57] src/resource.cc:279: Ignore CUDA Error [18:09:57] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

So it could be a problem with our symbolic model.

@stu1130
Copy link
Contributor

stu1130 commented Jul 28, 2020

I have tried out the same model with the same libmxnet using the MXNet Python and it worked fine. The next step is to dive deeper into our CachedOp

@stu1130
Copy link
Contributor

stu1130 commented Jul 29, 2020

@danhlephuoc Hi after I dove deeper, the root cause is that the image is too large, which causes the GPU OOM. I tried to reduce the size of the input image and works perfectly. In addition, I also tried the MXNet Python with the same image size and failed as well. The PR 4629a6c fixes it.

@stu1130
Copy link
Contributor

stu1130 commented Jul 29, 2020

Feel free to reopen the issue if you have any other question

@stu1130 stu1130 closed this as completed Jul 29, 2020
@stu1130
Copy link
Contributor

stu1130 commented Jul 31, 2020

I think I found the problem @danhlephuoc. When I tried out the original gluoncv model, it works with the image as large as 1000 * 1000. To be able to run on the DJL, we have to hybridize the model. The hybridized model with current MXNet failed to execute on the GPU when we upgrade the mxnet from 1.6 to 1.7. I created a minimal reproducible script apache/mxnet#18834. I will keep you posted.

@stu1130 stu1130 reopened this Jul 31, 2020
@stu1130
Copy link
Contributor

stu1130 commented Jan 4, 2021

As it is MXNet issue, will close the issue and update our MXNet artifact once it is fixed

@stu1130 stu1130 closed this as completed Jan 4, 2021
Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants