-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YOLO Models trained with Coco and Darknet53 return Cuda Memory error (MxNet engine) #78
Comments
Thanks for report the issue! Could you please try out: Add snapshot repository: "https://oss.sonatype.org/content/repositories/snapshots/" in maven repository. And use 0.6.0-SNAPSHOT version. recently we fixed an issue related to this. It might be caused by the NDArray is no CPU and not on GPU, and GPU cannot find the pointer caused the crash. |
I changed to 0.6.0-SNAPSHOT version, I got the same error. Btw, I tested with my MacPro without GPU, the Yolo models with Darknet53 and Coco work fine on CPU |
@danhlephuoc is that possible you can share your repo? The commit lanking520 mentioned is here: It's an error checking to identify where the mismatch device happens. You still need fix the ndarray creation. |
@frankfliu : I've forked report and changed the code for the error at https://github.com/danhlephuoc/djl.git , you just clone and run following command to reproduce the error mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection" I already limited to 1 GPU via CUDA_VISIBLE_DEVICES, but the error still stays |
I can reproduce the issue with single GPU
So it could be a problem with our symbolic model. |
I have tried out the same model with the same libmxnet using the MXNet Python and it worked fine. The next step is to dive deeper into our CachedOp |
@danhlephuoc Hi after I dove deeper, the root cause is that the image is too large, which causes the GPU OOM. I tried to reduce the size of the input image and works perfectly. In addition, I also tried the MXNet Python with the same image size and failed as well. The PR 4629a6c fixes it. |
Feel free to reopen the issue if you have any other question |
I think I found the problem @danhlephuoc. When I tried out the original gluoncv model, it works with the image as large as 1000 * 1000. To be able to run on the DJL, we have to hybridize the model. The hybridized model with current MXNet failed to execute on the GPU when we upgrade the mxnet from 1.6 to 1.7. I created a minimal reproducible script apache/mxnet#18834. I will keep you posted. |
As it is MXNet issue, will close the issue and update our MXNet artifact once it is fixed |
Description
ObjectDetection example with the pre-trained Yolo models (dataset=coco, backbone=darknet53) return error:
"MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered"
Expected Behavior
Object Detection example should work on different pre-trained Yolo models. Note, Yolo models trained with Pascal VOC work just fine.
Error Message
INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
Loading: 100% |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588|
[11:32:04] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.6.0. Attempting to upgrade...
[11:32:04] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
model yolo
[11:32:13] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[WARNING]
ai.djl.engine.EngineException: MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
Stack trace:
File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 16.461 s
[INFO] Finished at: 2020-05-30T11:32:18+02:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project examples: An exception occured while executing the Java class. MXNet engine call failed: CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
[ERROR] Stack trace:
[ERROR] File "/codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h", line 81
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[11:32:18] src/resource.cc:279: Ignore CUDA Error [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered
[[[[11:32:18] 11:32:18] src/engine/threaded_engine_perdevice.cc11:32:18src/engine/threaded_engine_perdevice.cc] src/engine/threaded_engine_perdevice.cc:27511:32:18:275:: 275: Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
] src/engine/threaded_engine_perdevice.cc:275Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
: : Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
Ignore CUDA Error [11:32:18] /codebuild/output/src546137840/src/git-codecommit.us-west-2.amazonaws.com/v1/repos/AWS-MXNet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:203: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
terminate called after throwing an instance of 'dmlc::Error'
what(): [11:32:18] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered
Aborted (core dumped)
How to Reproduce?
from
.optFilter("backbone", "resnet50")"
to
.optFilter("dataset", "coco")
.optFilter("imageSize","416")
.optFilter("backbone", "darknet53")
mvn exec:java -Dexec.mainClass="ai.djl.examples.inference.ObjectDetection"
Environment Info
Ubuntu, CUDA 10.2, GPU V100
The text was updated successfully, but these errors were encountered: