YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

danhlephuoc · 2020-05-30T10:09:58Z

Description

ObjectDetection example with the pre-trained Yolo models (dataset=coco, backbone=darknet53) return error:

"MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered"

Expected Behavior

Object Detection example should work on different pre-trained Yolo models. Note, Yolo models trained with Pascal VOC work just fine.

Error Message

INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
Loading: 100% |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|
[11:32:04] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.6.0. Attempting to upgrade...
[11:32:04] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
model yolo
[11:32:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[WARNING]
ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
Stack trace:
File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81

at ai.djl.mxnet.jna.JnaUtils.checkCall (JnaUtils.java:1788)
at ai.djl.mxnet.jna.JnaUtils.syncCopyToCPU (JnaUtils.java:473)
at ai.djl.mxnet.engine.MxNDArray.toByteBuffer (MxNDArray.java:283)
at ai.djl.ndarray.NDArray.toIntArray (NDArray.java:279)
at ai.djl.modality.cv.translator.YoloTranslator.processOutput (YoloTranslator.java:40)
at ai.djl.modality.cv.translator.YoloTranslator.processOutput (YoloTranslator.java:26)
at ai.djl.inference.Predictor.processOutputs (Predictor.java:202)
at ai.djl.inference.Predictor.batchPredict (Predictor.java:160)
at ai.djl.inference.Predictor.predict (Predictor.java:112)
at ai.djl.examples.inference.ObjectDetectionBench.predict (ObjectDetectionBench.java:71)
at ai.djl.examples.inference.ObjectDetectionBench.main (ObjectDetectionBench.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.461 s
[INFO] Finished at: 2020-05-30T11:32:18+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project examples: An exception occured while executing the Java class. MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
[ERROR] Stack trace:
[ERROR] File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[11:32:18] src/resource.cc:279: Ignore CUDA Error [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

[[[[11:32:18] 11:32:18] src/engine/threaded_engine_perdevice.cc11:32:18src/engine/threaded_engine_perdevice.cc] src/engine/threaded_engine_perdevice.cc:27511:32:18:275:: 275: Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

] src/engine/threaded_engine_perdevice.cc:275Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

: : Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered

terminate called after throwing an instance of 'dmlc::Error'
what(): [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

Aborted (core dumped)

How to Reproduce?

Change the Criteria configuration of 'predict' method of ai.djl.examples.inference.ObjectDetection.java :

from

.optFilter("backbone", "resnet50")"

to

.optFilter("dataset", "coco")
.optFilter("imageSize","416")
.optFilter("backbone", "darknet53")

run
mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"

Environment Info

Ubuntu, CUDA 10.2, GPU V100

The text was updated successfully, but these errors were encountered:

lanking520 · 2020-05-30T16:42:01Z

Thanks for report the issue! Could you please try out:

Add snapshot repository: "https://oss.sonatype.org/content/repositories/snapshots/" in maven repository.

And use 0.6.0-SNAPSHOT version.

recently we fixed an issue related to this. It might be caused by the NDArray is no CPU and not on GPU, and GPU cannot find the pointer caused the crash.

danhlephuoc · 2020-05-30T17:12:35Z

I changed to 0.6.0-SNAPSHOT version, I got the same error. Btw, I tested with my MacPro without GPU, the Yolo models with Darknet53 and Coco work fine on CPU

frankfliu · 2020-05-30T17:20:28Z

@danhlephuoc is that possible you can share your repo?
Are you using multiple GPU for the training? Can you try to limit use only one GPU?

The commit lanking520 mentioned is here:
0ca79f4

It's an error checking to identify where the mismatch device happens. You still need fix the ndarray creation.

danhlephuoc · 2020-05-30T17:55:42Z

@frankfliu : I've forked report and changed the code for the error at https://github.com/danhlephuoc/djl.git , you just clone and run following command to reproduce the error

mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"

I already limited to 1 GPU via CUDA_VISIBLE_DEVICES, but the error still stays

stu1130 · 2020-05-30T18:12:35Z

I can reproduce the issue with single GPU
When I used export MXNET_ENGINE_TYPE=NaiveEngine, I saw

Exception in thread "main" ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory
 access was encountered
Stack trace:
  File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-
inl.h", line 81

        at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1788)
        at ai.djl.mxnet.jna.JnaUtils.cachedOpInvoke(JnaUtils.java:1757)
        at ai.djl.mxnet.engine.CachedOp.forward(CachedOp.java:133)
        at ai.djl.mxnet.engine.MxSymbolBlock.forward(MxSymbolBlock.java:145)
        at ai.djl.nn.Block.forward(Block.java:116)
        at ai.djl.inference.Predictor.predict(Predictor.java:117)
        at ai.djl.inference.Predictor.batchPredict(Predictor.java:157)
        at ai.djl.inference.Predictor.predict(Predictor.java:112)
        at ai.djl.examples.inference.ObjectDetection.predict(ObjectDetection.java:68)
        at ai.djl.examples.inference.ObjectDetection.main(ObjectDetection.java:47)
[18:09:57] src/resource.cc:279: Ignore CUDA Error [18:09:57] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered

So it could be a problem with our symbolic model.

stu1130 · 2020-07-28T23:34:02Z

I have tried out the same model with the same libmxnet using the MXNet Python and it worked fine. The next step is to dive deeper into our CachedOp

stu1130 · 2020-07-29T22:07:03Z

@danhlephuoc Hi after I dove deeper, the root cause is that the image is too large, which causes the GPU OOM. I tried to reduce the size of the input image and works perfectly. In addition, I also tried the MXNet Python with the same image size and failed as well. The PR 4629a6c fixes it.

stu1130 · 2020-07-29T22:11:39Z

Feel free to reopen the issue if you have any other question

stu1130 · 2020-07-31T22:02:48Z

I think I found the problem @danhlephuoc. When I tried out the original gluoncv model, it works with the image as large as 1000 * 1000. To be able to run on the DJL, we have to hybridize the model. The hybridized model with current MXNet failed to execute on the GPU when we upgrade the mxnet from 1.6 to 1.7. I created a minimal reproducible script apache/mxnet#18834. I will keep you posted.

stu1130 · 2021-01-04T21:25:15Z

As it is MXNet issue, will close the issue and update our MXNet artifact once it is fixed

danhlephuoc added the bug Something isn't working label May 30, 2020

stu1130 closed this as completed Jul 29, 2020

stu1130 reopened this Jul 31, 2020

stu1130 closed this as completed Jan 4, 2021

Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023

[serving] Fixes startup NPE bug (deepjavalibrary#78)

e941228

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

danhlephuoc commented May 30, 2020 •

edited

Loading

lanking520 commented May 30, 2020 •

edited

Loading

danhlephuoc commented May 30, 2020

frankfliu commented May 30, 2020

danhlephuoc commented May 30, 2020 •

edited

Loading

stu1130 commented May 30, 2020 •

edited

Loading

stu1130 commented Jul 28, 2020

stu1130 commented Jul 29, 2020 •

edited

Loading

stu1130 commented Jul 29, 2020

stu1130 commented Jul 31, 2020 •

edited

Loading

stu1130 commented Jan 4, 2021 •

edited

Loading

YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78

Comments

danhlephuoc commented May 30, 2020 • edited Loading

Description

Expected Behavior

Error Message

How to Reproduce?

Environment Info

lanking520 commented May 30, 2020 • edited Loading

danhlephuoc commented May 30, 2020

frankfliu commented May 30, 2020

danhlephuoc commented May 30, 2020 • edited Loading

stu1130 commented May 30, 2020 • edited Loading

stu1130 commented Jul 28, 2020

stu1130 commented Jul 29, 2020 • edited Loading

stu1130 commented Jul 29, 2020

stu1130 commented Jul 31, 2020 • edited Loading

stu1130 commented Jan 4, 2021 • edited Loading

danhlephuoc commented May 30, 2020 •

edited

Loading

lanking520 commented May 30, 2020 •

edited

Loading

danhlephuoc commented May 30, 2020 •

edited

Loading

stu1130 commented May 30, 2020 •

edited

Loading

stu1130 commented Jul 29, 2020 •

edited

Loading

stu1130 commented Jul 31, 2020 •

edited

Loading

stu1130 commented Jan 4, 2021 •

edited

Loading