Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ONNX-only #4291

Conversation

thiagocrepaldi
Copy link
Contributor

@thiagocrepaldi thiagocrepaldi commented May 31, 2022

This PR is composed of different fixes to enable and end-to-end ONNX export functionality for detectron2 models

  • add_export_config API is publicly available exposed even when caffe2 is not compiled along with PyTorch (that is the new default behavior on latest PyTorch). A warning message informing users about its deprecation on future versions is also added

  • tensor.shape[0] replaces len(tensor) and for idx, img in enumerate(tensors) replaces for tmp_var1, tmp_var2 in zip(tensors, batched_imgs) so that the tracer does not lose reference to the user input on the graphs.

    • Before the changes above, the graph (see below) does not have an actual input. Instead, the input is exported as a model weight
      image
    • After the fix, the user images are properly acknowledged as model's input (see below) during ONNX export
      image
  • Added unit tests (tests/torch_export_onnx.py) for detectron2 models

  • ONNX is added as dependency for the CI to be able to run the aforementioned tests

  • Added custom symbolic functions to allow CI pipelines to succeed. The symbolics are needed because PyTorch 1.8, 1.9 and 1.10 adopted by detectron2 have several bugs. They can be removed when 1.11+ is adopted by detectron2's CI infra

Fixes #3488
Fixes pytorch/pytorch#69674 (PyTorch repo)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 31, 2022
@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch 3 times, most recently from 7e7e074 to 510027f Compare June 1, 2022 14:57
@thiagocrepaldi thiagocrepaldi marked this pull request as ready for review June 1, 2022 15:33
@thiagocrepaldi
Copy link
Contributor Author

thiagocrepaldi commented Jun 1, 2022

@ppwwyyxx @FrancescoMandru splitting up Detectron2 ONNX export in a separate (and clean) PR

Comment on lines 21 to 28
def add_export_config(cfg):
warnings.warn("add_export_config has been deprecated and behaves as no-op function.")
return cfg


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more need to move this function if it's not used at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a public API and fully working for detectron2 + pytorch < 1.11, so better give users a grace period before removing it. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting removing it. But I suggest to keep the function where it was and it seems it won't break anyone. So there is no need to move the function into __init__.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PyTorch >= 1.11, the current detectron2 implementation (aka keeping add_export_config at detectron2/export/api.py) does break existing training scripts. See a full repro below:

conda create -n torch111_py39cu113 python=3.9 numpy
conda activate torch111_py39cu113 
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch # pytorch 1.11
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
python -c "from detectron2.export import add_export_config"

The result is:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'add_export_config' from 'detectron2.export' (/anaconda3/envs/torch111_py39cu113/lib/python3.9/site-packages/detectron2/export/__init__.py)

The root cause is that after a "recent" PyTorch BC change, libcaffe2 is no longer distributed along with official PyTorch packages. When detectron2 tries to import anything from detectron2.export.apy namespace, it will try to import caffe2 stuff indirectly, such as from caffe2.proto import caffe2_pb2

This change tries to create an explicit deprecation path by printing a scary warning so that users update their code while it still keeps things working (by moving add_export_config to a caffe2-free spot)

detectron2/export/c10.py Outdated Show resolved Hide resolved
tests/test_export_onnx.py Show resolved Hide resolved
"COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml", inference_func, batch=2
)

@skipIfUnsupportedMinOpsetVersion(16, STABLE_ONNX_OPSET_VERSION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unittest.skipIf(
    STABLE_ONNX_OPSET_VERSION < 16, "torch version too old")

this should be sufficient.

Also, is it the plan that a newer version of pytorch will support a new-enough onnx opset? If that's the case please add a todo here so this check can be removed in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opset 16 is the current ONNX opset version and the next PyTorch version already implements it.
We need to skip this test for now because detectron2 is still using older pytorch versions 1.8, 1.9 and 1.10 in the CI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a todo here to update the value of STABLE_ONNX_OPSET_VERSION so this check can be removed in the future.

skipIfUnsupportedMinOpsetVersion is used only once, so it's easier to just write @unittest.skipIf( STABLE_ONNX_OPSET_VERSION < 16, "torch version too old")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used once because ONNX export functionality is divided into several PRs.

#4205 uses skipIfUnsupportedMinOpsetVersion at least 2 times (here and here) and possibly more when I add more tests. This PR is on hold until all other 3 are in

setup.py Outdated Show resolved Hide resolved
detectron2/export/README.md Outdated Show resolved Hide resolved
detectron2/structures/image_list.py Show resolved Hide resolved
detectron2/structures/instances.py Outdated Show resolved Hide resolved
@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from c239e5e to c14e3b6 Compare June 3, 2022 15:43
@thiagocrepaldi thiagocrepaldi changed the title Add support for ONNX-only (original PR #4120) Add support for ONNX-only Jun 6, 2022
@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from c14e3b6 to 317346a Compare June 6, 2022 23:00
@FrancescoMandru
Copy link

Updates on this support?

@orionr
Copy link

orionr commented Jun 14, 2022

We are looking for the right POCs to review. Stay tuned.

@orionr
Copy link

orionr commented Jun 14, 2022

Also, can you please rebase since there seems to be conflicts? Thanks.

@FrancescoMandru
Copy link

Also, can you please rebase since there seems to be conflicts? Thanks.

@thiagocrepaldi is the main (unique) contributor on this feature so I tag him.

from .flatten import TracingAdapter
from .torchscript import scripting_with_instances, dump_torchscript_IR

STABLE_ONNX_OPSET_VERSION = 11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to condition this value based on torch.__version__? e.g. set to 16 for newer torch versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, pytorch and ONNX opset versions is a 1:N relationship. Many model authors experiment performance using different combinations of opset and/or pytorch versions. Many time newer pytorch versions improve implementation of older opsets... Other times, newer opsets have new operators available, which allows more efficient implementation

It is up to the authors pick their preferred (possibly older) opset version on, generally speaking, the latest pytorch version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ONNX Runtime is integrated on detectron2 CI pipeline and ONNX export is thoroughly tested, I will work on updating the supported opset to 16 and see how they compare to opset 11

@ppwwyyxx
Copy link
Contributor

Thanks @orionr ! I can help with most of the review but will definitely need someone at Meta to help land.

@facebook-github-bot
Copy link
Contributor

@zhanghang1989 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 317346a to 2720f3e Compare June 23, 2022 22:53
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 2720f3e to 793b649 Compare June 23, 2022 23:31
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 793b649 to 99a9ff7 Compare June 23, 2022 23:31
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@thiagocrepaldi
Copy link
Contributor Author

@ppwwyyxx I think I have addressed everything, hopefully it is aligned with your ideas

I will jump to #4315 while you take your time with this one

Thanks for taking the time

@FrancescoMandru
Copy link

@orionr Hello Sir, any updates on this PR?

@thiagocrepaldi
Copy link
Contributor Author

@zhanghang1989 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Thanks @zhanghang1989 @orionr @wat3rBro The internal test failed after 6h. Could you help me with some backtrace or relevant-non-proprietary log so that i can fix it? I would guess this is a false positive, as Caffe2 export is not currently supported by Detectron2 and the files I changed are only pertinent to Caffe2 export

@facebook-github-bot
Copy link
Contributor

@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mcimpoi
Copy link

mcimpoi commented Jul 14, 2022

Re-importing the changes to fbsource.

The test failure was a transient infra error on our end. Re-ran the tests.

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 27a6209 to 95f05c3 Compare July 14, 2022 18:06
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@thiagocrepaldi
Copy link
Contributor Author

@mcimpoi Cheers Mircea, I saw that the Linter issue is gone after your tip. THANK YOU

The internal Build & Tests failure persists, though. Would you know whether it is a transient infra issue or an actual problem that I could work on? Thank you again

* add_export_config API is publicly available exposed even when caffe2 is not
compiled along with PyTorch (the new default behavior on latest PyTorch).
A warning message informing users about its deprecation on future versions is also added

* `tensor.shape[0]` replaces `len(tensor)` and `for idx, img in enumerate(tensors)`
replaces `for tmp_var1, tmp_var2 in zip(tensors, batched_imgs)`
so that the tracer does not lose reference to the user input on the graphs.

* Added unit tests (tests/torch_export_onnx.py) for detectron2 models

* ONNX is added as dependency for the CI to be able to run the aforementioned tests

* Added custom symbolic functions to allow CI pipelines to succeed.
The symbolics are needed because PyTorch 1.8, 1.9 and 1.10 adopted
by detectron2 have several bugs. They can be removed when 1.11+ is
adopted by detectron2's CI infra
Although facebookresearch#4120 exported a valid ONNX graph, after running it with ONNX
Runtime, I've realized that all model inputs were exported as constants.

The root cause was that at detectron2/structures/image_list.py, the
padding of batched images were done on a temporary (copy) variable,
which the ONNX tracer cannot track the user input, leading in
suppressing inputs and storing them as constant during export.

Also, len(torch.Tensor) is traced as a constant shape, which poses a
problem if the next input has different shape.
This PR fixes that by using torch.Tensor.shape[0] instead
@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 95f05c3 to 4bc6262 Compare July 15, 2022 17:24
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/add-onnx-export-support branch from 4bc6262 to 2626240 Compare July 15, 2022 17:41
@facebook-github-bot
Copy link
Contributor

@thiagocrepaldi has updated the pull request. You must reimport the pull request before landing.

@ppwwyyxx
Copy link
Contributor

Please just disregard " internal Build & Tests failure" on github.

To Meta employee: I had a complaint in the open source support group that internal warnings will show as failures on github.

@thiagocrepaldi
Copy link
Contributor Author

The reason I have rebased this branch was because #4295 was merged (yay!!!!), conflicting with this one (that was expected as I have been working in several PRs simultaneously to keep things going)

@facebook-github-bot
Copy link
Contributor

@mcimpoi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link

@mcimpoi mcimpoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thiagocrepaldi for fixing the linter errors!
I imported the changes again to fbsource.

@mcimpoi
Copy link

mcimpoi commented Jul 15, 2022

Safe to ignore the last internal tools failure -- it was caused because some of the tests were cancelled, due to rebasing.

@thiagocrepaldi thiagocrepaldi deleted the thiagofc/add-onnx-export-support branch July 16, 2022 04:31
@thiagocrepaldi
Copy link
Contributor Author

Thank you all for reviewing and accepting this :)

facebook-github-bot pushed a commit that referenced this pull request Feb 10, 2023
Summary:
In order to export `KRCNNConvDeconvUpsampleHead` to ONNX using `torch.jit.script`, changes to both PyTorch and detectron2:

- Pytorch has a bug which prevents a tensor wrapped by a list as float. Refer to the required [fix](pytorch/pytorch#81386)

- `detectron2/structures/keypoints.py::heatmaps_to_keypoints` internally does advanced indexing on a `squeeze`d tensor. The aforementioned `squeeze` fails rank inference due to the presence of `onnx::If` on its implementation (to support dynamic dims). The fix is replacing `squeeze` by `reshape`.  A possible fix to `squeeze` on PyTorch side might be done too (TBD and would take some time), but the proposed change here does not bring any consequence to detectron2 while it enables ONNX support with scriptable `KRCNNConvDeconvUpsampleHead `.

After the proposed changes, the `KRCNNConvDeconvUpsampleHead` does include a `Loop` node to represent a for-loop inside the model and `dynamic outputs`, as shown below:

![image](https://user-images.githubusercontent.com/5469809/179559001-f60fb8af-ec79-4758-b271-736467b5d96f.png)

This PR has been tested with ONNX Runtime (this [PR](#4205)) to ensure the ONNX output matches PyTorch's for different `gen_input(X, Y)` combinations and it succeeded. The model was converted to ONNX once with a particular input and tested with inputs of different shapes and compared to equality to PyTorch's
Depends on: pytorch/pytorch#81386 and #4291

Pull Request resolved: #4315

Reviewed By: newstzpz

Differential Revision: D42756423

fbshipit-source-id: dc410df18da07f48c14f4cae9a4a91530a0ec602
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to export PointRend model (from detectron2) to ONNX Exporting a model to ONNX; onnx.optimizer
6 participants