New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ONNX-only #4291
Add support for ONNX-only #4291
Conversation
7e7e074
to
510027f
Compare
@ppwwyyxx @FrancescoMandru splitting up Detectron2 ONNX export in a separate (and clean) PR |
detectron2/export/__init__.py
Outdated
def add_export_config(cfg): | ||
warnings.warn("add_export_config has been deprecated and behaves as no-op function.") | ||
return cfg | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more need to move this function if it's not used at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a public API and fully working for detectron2 + pytorch < 1.11, so better give users a grace period before removing it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting removing it. But I suggest to keep the function where it was and it seems it won't break anyone. So there is no need to move the function into __init__.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PyTorch >= 1.11, the current detectron2
implementation (aka keeping add_export_config
at detectron2/export/api.py
) does break existing training scripts. See a full repro below:
conda create -n torch111_py39cu113 python=3.9 numpy
conda activate torch111_py39cu113
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch # pytorch 1.11
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
python -c "from detectron2.export import add_export_config"
The result is:
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: cannot import name 'add_export_config' from 'detectron2.export' (/anaconda3/envs/torch111_py39cu113/lib/python3.9/site-packages/detectron2/export/__init__.py)
The root cause is that after a "recent" PyTorch BC change, libcaffe2
is no longer distributed along with official PyTorch packages. When detectron2
tries to import anything from detectron2.export.apy
namespace, it will try to import caffe2 stuff indirectly, such as from caffe2.proto import caffe2_pb2
This change tries to create an explicit deprecation path by printing a scary warning so that users update their code while it still keeps things working (by moving add_export_config
to a caffe2-free spot)
"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml", inference_func, batch=2 | ||
) | ||
|
||
@skipIfUnsupportedMinOpsetVersion(16, STABLE_ONNX_OPSET_VERSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unittest.skipIf(
STABLE_ONNX_OPSET_VERSION < 16, "torch version too old")
this should be sufficient.
Also, is it the plan that a newer version of pytorch will support a new-enough onnx opset? If that's the case please add a todo here so this check can be removed in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opset 16 is the current ONNX opset version and the next PyTorch version already implements it.
We need to skip this test for now because detectron2 is still using older pytorch versions 1.8, 1.9 and 1.10 in the CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a todo here to update the value of STABLE_ONNX_OPSET_VERSION
so this check can be removed in the future.
skipIfUnsupportedMinOpsetVersion
is used only once, so it's easier to just write @unittest.skipIf( STABLE_ONNX_OPSET_VERSION < 16, "torch version too old")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c239e5e
to
c14e3b6
Compare
c14e3b6
to
317346a
Compare
Updates on this support? |
We are looking for the right POCs to review. Stay tuned. |
Also, can you please rebase since there seems to be conflicts? Thanks. |
@thiagocrepaldi is the main (unique) contributor on this feature so I tag him. |
from .flatten import TracingAdapter | ||
from .torchscript import scripting_with_instances, dump_torchscript_IR | ||
|
||
STABLE_ONNX_OPSET_VERSION = 11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to condition this value based on torch.__version__
? e.g. set to 16 for newer torch versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, pytorch and ONNX opset versions is a 1:N relationship. Many model authors experiment performance using different combinations of opset and/or pytorch versions. Many time newer pytorch versions improve implementation of older opsets... Other times, newer opsets have new operators available, which allows more efficient implementation
It is up to the authors pick their preferred (possibly older) opset version on, generally speaking, the latest pytorch version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ONNX Runtime is integrated on detectron2 CI pipeline and ONNX export is thoroughly tested, I will work on updating the supported opset to 16 and see how they compare to opset 11
Thanks @orionr ! I can help with most of the review but will definitely need someone at Meta to help land. |
@zhanghang1989 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
317346a
to
2720f3e
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
2720f3e
to
793b649
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
793b649
to
99a9ff7
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
@orionr Hello Sir, any updates on this PR? |
Thanks @zhanghang1989 @orionr @wat3rBro The internal test failed after 6h. Could you help me with some backtrace or relevant-non-proprietary log so that i can fix it? I would guess this is a false positive, as Caffe2 export is not currently supported by Detectron2 and the files I changed are only pertinent to Caffe2 export |
@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Re-importing the changes to fbsource. The test failure was a transient infra error on our end. Re-ran the tests. |
27a6209
to
95f05c3
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mcimpoi Cheers Mircea, I saw that the Linter issue is gone after your tip. THANK YOU The internal |
* add_export_config API is publicly available exposed even when caffe2 is not compiled along with PyTorch (the new default behavior on latest PyTorch). A warning message informing users about its deprecation on future versions is also added * `tensor.shape[0]` replaces `len(tensor)` and `for idx, img in enumerate(tensors)` replaces `for tmp_var1, tmp_var2 in zip(tensors, batched_imgs)` so that the tracer does not lose reference to the user input on the graphs. * Added unit tests (tests/torch_export_onnx.py) for detectron2 models * ONNX is added as dependency for the CI to be able to run the aforementioned tests * Added custom symbolic functions to allow CI pipelines to succeed. The symbolics are needed because PyTorch 1.8, 1.9 and 1.10 adopted by detectron2 have several bugs. They can be removed when 1.11+ is adopted by detectron2's CI infra
Although facebookresearch#4120 exported a valid ONNX graph, after running it with ONNX Runtime, I've realized that all model inputs were exported as constants. The root cause was that at detectron2/structures/image_list.py, the padding of batched images were done on a temporary (copy) variable, which the ONNX tracer cannot track the user input, leading in suppressing inputs and storing them as constant during export. Also, len(torch.Tensor) is traced as a constant shape, which poses a problem if the next input has different shape. This PR fixes that by using torch.Tensor.shape[0] instead
95f05c3
to
4bc6262
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
4bc6262
to
2626240
Compare
@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing. |
Please just disregard " internal Build & Tests failure" on github. To Meta employee: I had a complaint in the open source support group that internal warnings will show as failures on github. |
The reason I have rebased this branch was because #4295 was merged (yay!!!!), conflicting with this one (that was expected as I have been working in several PRs simultaneously to keep things going) |
@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thiagocrepaldi for fixing the linter errors!
I imported the changes again to fbsource.
Safe to ignore the last internal tools failure -- it was caused because some of the tests were cancelled, due to rebasing. |
Thank you all for reviewing and accepting this :) |
Summary: In order to export `KRCNNConvDeconvUpsampleHead` to ONNX using `torch.jit.script`, changes to both PyTorch and detectron2: - Pytorch has a bug which prevents a tensor wrapped by a list as float. Refer to the required [fix](pytorch/pytorch#81386) - `detectron2/structures/keypoints.py::heatmaps_to_keypoints` internally does advanced indexing on a `squeeze`d tensor. The aforementioned `squeeze` fails rank inference due to the presence of `onnx::If` on its implementation (to support dynamic dims). The fix is replacing `squeeze` by `reshape`. A possible fix to `squeeze` on PyTorch side might be done too (TBD and would take some time), but the proposed change here does not bring any consequence to detectron2 while it enables ONNX support with scriptable `KRCNNConvDeconvUpsampleHead `. After the proposed changes, the `KRCNNConvDeconvUpsampleHead` does include a `Loop` node to represent a for-loop inside the model and `dynamic outputs`, as shown below: ![image](https://user-images.githubusercontent.com/5469809/179559001-f60fb8af-ec79-4758-b271-736467b5d96f.png) This PR has been tested with ONNX Runtime (this [PR](#4205)) to ensure the ONNX output matches PyTorch's for different `gen_input(X, Y)` combinations and it succeeded. The model was converted to ONNX once with a particular input and tested with inputs of different shapes and compared to equality to PyTorch's Depends on: pytorch/pytorch#81386 and #4291 Pull Request resolved: #4315 Reviewed By: newstzpz Differential Revision: D42756423 fbshipit-source-id: dc410df18da07f48c14f4cae9a4a91530a0ec602
This PR is composed of different fixes to enable and end-to-end ONNX export functionality for detectron2 models
add_export_config
API is publicly available exposed even when caffe2 is not compiled along with PyTorch (that is the new default behavior on latest PyTorch). A warning message informing users about its deprecation on future versions is also addedtensor.shape[0]
replaceslen(tensor)
andfor idx, img in enumerate(tensors)
replacesfor tmp_var1, tmp_var2 in zip(tensors, batched_imgs)
so that the tracer does not lose reference to the user input on the graphs.input
. Instead, the input is exported as a model weightAdded unit tests (
tests/torch_export_onnx.py
) for detectron2 modelsONNX is added as dependency for the CI to be able to run the aforementioned tests
Added custom symbolic functions to allow CI pipelines to succeed. The symbolics are needed because PyTorch 1.8, 1.9 and 1.10 adopted by detectron2 have several bugs. They can be removed when 1.11+ is adopted by detectron2's CI infra
Fixes #3488
Fixes pytorch/pytorch#69674 (PyTorch repo)