Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CUDA: an illegal memory access was encountered on hybridized yolo model #18834

Open
stu1130 opened this issue Jul 31, 2020 · 17 comments
Open

CUDA: an illegal memory access was encountered on hybridized yolo model #18834

stu1130 opened this issue Jul 31, 2020 · 17 comments
Labels
Bug CUDA Gluon v1.x Targeting v1.x branch

Comments

@stu1130
Copy link
Contributor

stu1130 commented Jul 31, 2020

Step to reproduce

  • build mxnet-cu101 from source based off mxnet 1.7.0 branch
  • gluoncv 0.7
from gluoncv import model_zoo, data, utils
from matplotlib import pyplot as plt
import mxnet as mx
from mxnet import gluon 

net = model_zoo.get_model('yolo3_darknet53_coco', pretrained=True, ctx=mx.gpu())
net.hybridize()
x = mx.nd.random.uniform(shape=(1, 3, 1000, 1000), ctx=mx.gpu())
_, scores, _ = net(x)
print(scores.shape)
net.export("yolo")

deserialized_net = gluon.nn.SymbolBlock.imports("yolo-symbol.json", ['data'], "yolo-0000.params", ctx=mx.gpu())
image = mx.nd.random.normal(shape=(1, 3, 1000, 1000), ctx=mx.gpu())
print(deserialized_net(image))
CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
[21:53:46] src/resource.cc:279: Ignore CUDA Error [21:53:46] src/storage/./pooled_storage_manager.h:97: CUDA: an illegal memory access was encountered


[21:53:46] src/resource.cc:331: Ignore CUDA Error [21:53:46] src/common/random_generator.cu:70: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered

It works fine when I use CPU or mxnet-cu101 1.6 pip wheel

@szha
Copy link
Member

szha commented Aug 4, 2020

@stu1130 could you help verify if this still happens with the master branch?

@stu1130
Copy link
Contributor Author

stu1130 commented Aug 4, 2020

@szha sure but when I build mxnet from source I ran into the error.

+++ patchelf --set-rpath '$ORIGIN' --force-rpath libopenblas.so
/home/ubuntu/mxnet_master/tools/dependencies/openblas.sh: line 35: patchelf: command not found

This is the command I used.

tools/staticbuild/build.sh cu101

@leezu
Copy link
Contributor

leezu commented Aug 4, 2020

@stu1130 how about installing patchelf? Just run apt install patchelf as per the documentation in https://github.com/apache/incubator-mxnet/tree/master/tools/staticbuild

@stu1130
Copy link
Contributor Author

stu1130 commented Aug 4, 2020

@leezu thanks! It works

@stu1130
Copy link
Contributor Author

stu1130 commented Aug 5, 2020

It seems gluoncv 0.7 doesn't work with mxnet 2.0. Is the gluoncv 0.7 the latest version I can get?

Traceback (most recent call last):
  File "demo_yolo.py", line 10, in <module>
    from gluoncv import model_zoo, data, utils
  File "/home/ubuntu/.local/lib/python3.5/site-packages/gluoncv/__init__.py", line 8, in <module>
    from . import data
  File "/home/ubuntu/.local/lib/python3.5/site-packages/gluoncv/data/__init__.py", line 4, in <module>
    from . import transforms
  File "/home/ubuntu/.local/lib/python3.5/site-packages/gluoncv/data/transforms/__init__.py", line 5, in <module>
    from . import image
  File "/home/ubuntu/.local/lib/python3.5/site-packages/gluoncv/data/transforms/image.py", line 6, in <module>
    from mxnet import nd
ImportError: cannot import name 'nd'

@szha
Copy link
Member

szha commented Aug 5, 2020

That is surprising as I can still import nd from a build on the master branch.

@troyliu0105
Copy link
Contributor

@szha
I built v1.8.0, v1.7.0 to reproduce this issue. and I can confirm that still happens.
But it works fine when I use prebuild package 1.6.0.post0 from pypi.

I have same issue in my code. After I turned hybridize on, it will produce this error. But with off, everything works fine.

@leezu
Copy link
Contributor

leezu commented Jan 1, 2021

@troyliu0105 it just means that you managed to corrupt your python profile. Try uninstalling all mxnet packages (you probably have multiple insatlled?) and installing again

@troyliu0105
Copy link
Contributor

@leezu actually, I created different anaconda env and duplicated source code directory for each build. So, I think they are isolated.

@leezu
Copy link
Contributor

leezu commented Jan 1, 2021

Sorry, but as you experience the issue of importing, it means that your environment is broken. You can try using pip install -e . to install while in the incubator-mxnet/python directory

@troyliu0105
Copy link
Contributor

@leezu 😂,that's not the issue I encountered. I think you confused me with the one opened this issue.

@leezu
Copy link
Contributor

leezu commented Jan 3, 2021

Ok. I thought you talk about the ImportError: cannot import name 'nd' issue described above.

@troyliu0105
Copy link
Contributor

@szha I build the most recently master branch. cuz gluoncv is not compatible with mxnet 2.0 (can't pass version assert), so I use previous exported weight to test like below:

from gluoncv import model_zoo, data, utils
from matplotlib import pyplot as plt
import mxnet as mx
from mxnet import gluon
deserialized_net = gluon.nn.SymbolBlock.imports("yolo-symbol.json", ['data'], "yolo-0000.params", ctx=mx.gpu())
image = mx.nd.random.normal(shape=(1, 3, 1000, 1000), ctx=mx.gpu())
print(deserialized_net(image))

and this issue still occurs.
image

@nicklhy
Copy link
Contributor

nicklhy commented Feb 8, 2021

Any updates ? Got the same problem for mxnet 1.7.

@Zha0q1
Copy link
Contributor

Zha0q1 commented Mar 30, 2021

@waytrue17 and I also had this issue with mxnet 1.7

@waytrue17
Copy link
Contributor

I can confirm that the issue only occurs at yolo3_darknet53_coco but not yolo3_darknet53_voc

@shesung
Copy link
Contributor

shesung commented Dec 1, 2021

This issue still exists in mxnet-cu102==1.8.0.post0, mxnet-cu102==1.7.0.post1,

It is OK with mxnet-cu102==1.6.0.post0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug CUDA Gluon v1.x Targeting v1.x branch
Projects
None yet
Development

No branches or pull requests

8 participants