Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to compile ONNX models? #59

Closed
davidas1 opened this issue Jan 8, 2020 · 9 comments
Closed

Is it possible to compile ONNX models? #59

davidas1 opened this issue Jan 8, 2020 · 9 comments

Comments

@davidas1
Copy link

davidas1 commented Jan 8, 2020

There are mentions of this capability in some docs + list of supported ops, but there's no example of how to do it in practice.
I tried compiling a simple pretrained resnet model from https://github.com/onnx/models/ and it failed with:

01/08/2020 12:51:20 PM ERROR [neuron-cc]: ***************************************************************
01/08/2020 12:51:20 PM ERROR [neuron-cc]:  An Internal Compiler Error has occurred
01/08/2020 12:51:20 PM ERROR [neuron-cc]: ***************************************************************
01/08/2020 12:51:20 PM ERROR [neuron-cc]: 
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Please contact Customer Support and provide the following details.
01/08/2020 12:51:20 PM ERROR [neuron-cc]: 
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Error message:  A process in the process pool was terminated abruptly while the future was running or pending.
01/08/2020 12:51:20 PM ERROR [neuron-cc]: 
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Error location: pipeline.compile.0
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Command line:   /home/ubuntu/anaconda3/envs/aws_neuron_tensorflow_p36/bin/neuron-cc compile --framework ONNX /home/ubuntu/resnet18v1.onnx --output /home/ubuntu/onnx_test/output.neff
01/08/2020 12:51:20 PM ERROR [neuron-cc]: 
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Internal details:
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "neuroncc/driver/Job.py", line 207, in neuroncc.driver.Job.runSingleInputFn
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "neuroncc/driver/Job.py", line 247, in neuroncc.driver.Job.SingleInputJob.run
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "neuroncc/driver/Job.py", line 252, in neuroncc.driver.Job.SingleInputJob.run
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "/home/ubuntu/anaconda3/envs/aws_neuron_tensorflow_p36/lib/python3.6/concurrent/futures/_base.py", line 432, in result
01/08/2020 12:51:20 PM ERROR [neuron-cc]:     return self.__get_result()
01/08/2020 12:51:20 PM ERROR [neuron-cc]:   File "/home/ubuntu/anaconda3/envs/aws_neuron_tensorflow_p36/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
01/08/2020 12:51:20 PM ERROR [neuron-cc]:     raise self._exception
01/08/2020 12:51:20 PM ERROR [neuron-cc]: 
01/08/2020 12:51:20 PM ERROR [neuron-cc]: Version information:
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   Neuron Compiler version 1.0.5939.0+5849551057
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   HWM version 1.0.720.0-5848815573
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   NEFF version 0.6
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   TVM version 1.0.1416.0+5849176296
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   NumPy version 1.17.4
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   MXNet not available
01/08/2020 12:51:21 PM ERROR [neuron-cc]:   TF version 1.15.0
01/08/2020 12:51:21 PM ERROR [neuron-cc]: 
01/08/2020 12:51:21 PM ERROR [neuron-cc]: Artifacts stored in: /home/ubuntu/neuroncc-ft4i1tln
@aws-taylor
Copy link
Contributor

Hello David,

It is definitely possible to compile ONNX models.

The particular model you are attempting to compile uncovered a few bugs on our end.

Specifically:

  • If the version of ONNX used to train the model is different than the version of ONNX installed then a segfault may occur and you receive this useless error message. I have opened an internal ticket for this issue. Minimally, we will be improving our error messages in a future release.
  • If you omit the ‘—io-config’ flag when attempting to compile, then you likewise receive a useless error message. I have opened another internal ticket for this issue and we will likewise be improving our error messages in a future release.

Beyond these two issues, the particular pre-trained model mentioned may have problems. I’m not sure precisely from where you downloaded this model, but the resnet18v1 model from https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v1/resnet18v1.onnx appears to have incorrectly named operators and other issues (#59). Since you mentioned you just picked a random model, I did not spend too much time investigating. If using this specific model is important, could you attach the .onnx model you were using to this issue?

That being said, here’s an example of compilation using resnet50 using the model at https://github.com/onnx/models/tree/master/vision/classification/resnet/resnet50.

neuron-cc compile --framework ONNX resnet50/model.onnx --output /tmp/onnx.neff --io-config '{"inputs":{"gpu_0/data_0":[[1,3,224,224], "float32"]},"outputs":["gpu_0/softmax_1"]}'

Notice how the inputs and outputs are specified. For this model, the github page above conveniently specifies the input and output names and dimensions. For a more general ONNX model, you may find the net_drawer.py script provided by ONNX useful for visualizing the network.

python3 /usr/local/lib/python3.6/dist-packages/onnx/tools/net_drawer.py --input resnet50/model.onnx --output model.dot --embed_docstring
dot -Tpng model.dot -o model.png

Hopefully this helps. Please let us know if you experience any further issues.

Regards,
Taylor

@davidas1
Copy link
Author

Just got around to testing your suggested solution, and I get the same error message with the resnet50 models as well (I tested all models from the link you gave - opset3 up to opset9)

About ONNX versions - I have installed onnx 1.6.0 and onnxruntime 1.1.0
What else can I check in my environment? I'm running DLAMI 26, aws_neuron_tensorflow_p36 conda env, updated as suggested in the DLAMI with Neuron Release Notes

@aws-taylor
Copy link
Contributor

Hello David,

After some debugging, it appears the issue may be related to onnx 1.6.0. I was able to reproduce the issue when using onnx 1.6.0, but compilation works fine when downgrading to 1.5.0.

python3 -m pip install neuron-cc onnx=1.5.0
wget -q https://s3.amazonaws.com/download.onnx/models/opset_9/resnet50.tar.gz
tar xvf resnet50.tar.gz
neuron-cc compile \
  --framework ONNX resnet50/model.onnx \
  --output onnx.neff \
  --io-config '{"inputs":{"gpu_0/data_0":[[1,3,224,224], "float32"]},"outputs":["gpu_0/softmax_1"]}'

ls -la onnx.neff

I'll continue to investigate and try to figure out why onnx 1.6.0 is problematic.

-Taylor

@aws-taylor
Copy link
Contributor

Hello again David,

I have some new information - the issue appears to be related to how the Onnx 1.6 binary wheel was compiled and the version of libprotobuf used. Looking at a corefile, I see the SEGFAULT coming from:

x00007f1b44b60a35 in pybind11::enum_<onnx::OpSchema::SupportType>::value(char const*, onnx::OpSchema::SupportType, char const*) ()
   from /usr/local/lib/python3.6/dist-packages/onnx/onnx_cpp2py_export.cpython-36m-x86_64-linux-gnu.so

Notably, this file has a dependency on libprotobuf, and I've found some other github issues that alude to this file being sensitive to protobuf version.

ldd /usr/local/lib/python3.6/dist-packages/onnx/onnx_cpp2py_export.cpython-36m-x86_64-linux-gnu.so
...
libprotobuf.so.10 => /usr/lib/x86_64-linux-gnu/libprotobuf.so.10 (0x00007f610b038000)

I'm still investigating, but in the mean time if you do a source install of onnx then you ought to be able to use 1.6.

python3 -m pip install --force-reinstall --no-binary onnx onnx

-Taylor

@aws-taylor
Copy link
Contributor

Seems like the same issue: schyun9212/maskrcnn-benchmark#3

@davidas1
Copy link
Author

Thanks, that seems to solve the issue and enables me to run a sanity check of my setup.

The actual model I'm trying to compile includes an Upsample op (which looks to be supported, based on ONNX supported ops) + I assume you support opset 9, since Upsample was deprecated in newer ONNX versions.

For some reason the compilation now fails with:
Error message: check_upsampling() takes at least 4 positional arguments (1 given)

I've attached the log and a visualization of one of the Upsample modules in Netron, which is very simple:
neuroncc.log
onnx_upsample

If needed, I can open an issue with AWS support and share additional data (ONNX file, compiler artifacts, etc..)

@aws-taylor
Copy link
Contributor

Thanks David,

I have opened an issue internally to track this error. We'll report back once we know more.

Regards,
Taylor

@aws-zejdaj
Copy link
Contributor

David, could you please share the model with us? Full or a small version that contains the upsample operator. That will speed up our debug process.

Thank you,
Jindrich

@awsrjh
Copy link
Contributor

awsrjh commented Mar 9, 2020

Closing

@awsrjh awsrjh closed this as completed Mar 9, 2020
aws-mesharma pushed a commit that referenced this issue Sep 22, 2020
Release notes for Neuron SDK Release - August 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants