-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MultiModal] Fusion model inference acceleration with TensorRT #2836
Conversation
d3c0ab1
to
c49b91a
Compare
ddd7158
to
265de7b
Compare
Job PR-2836-99121a5 is done. |
Job PR-2836-1d05c3e is done. |
Thanks for the change @liangfu. Do you know why preprocessing logics are generally slower in the TRT execution env? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some accuracy metrics as well just to make sure accuracy doesn't compromise post TRT transformation
if not onnx_path: | ||
onnx_path = os.path.join(self.path, default_onnx_path) | ||
|
||
device_type = "cuda" if torch.cuda.is_available() else "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why previously we use GPU for onnx_export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually used cpu instead, see deleted comment below.
To quote
# Perform tracing on cpu, since we're facing an error when tracing with cuda device:
# ERROR: Tensor-valued Constant nodes differed in value across invocations.
# This often indicates that the tracer has encountered untraceable code.
# Comparison exception: The values for attribute 'shape' do not match: torch.Size([]) != torch.Size([384]).
# from https://github.com/rwightman/pytorch-image-models/blob/3aa31f537d5fbf6be8f1aaf5a36f6bbb4a55a726/timm/models/swin_transformer.py#L112
device = "cpu"
num_gpus = 0
multimodal/src/autogluon/multimodal/models/fusion/fusion_mlp.py
Outdated
Show resolved
Hide resolved
pure_model = model.module if isinstance(model, nn.DataParallel) else model | ||
if isinstance(pure_model, OnnxModule): | ||
for k in batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What possible types can batch[k].dtype
be when code runs here? Can we say it is always in [torch.float32, torch.int32, torch.int64]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special handling for converting the data type in token_ids. I'm not sure why the token_ids are in int32, but the inputs are required to be int64.
Job PR-2836-6cb5a9e is done. |
elif "valid_length" in k or k.startswith("numerical") or k.startswith("timm_image"): | ||
dynamic_axes[k] = {0: "batch_size"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape of images
in timm_image
's input should be (b, n, c, h, w)
, where both b and n may be dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest avoid adding image size into dynamic dimensions, because
- adding an extra dynamic dimension would increase the complexity of model compilation
- the compiled model with too many dynamic dimensions could be suboptimal in terms of performance
The question becomes do we really need to support dynamic shape in image data? If yes, what is the lower bound and upper bound of the image size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n
is the number of images per sample. (c, h, w)
are the fixed shape of one image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a pretrained setting, n should be a fixed number, since input DataFrame should have same number of image columns, isn't it?
try: | ||
import tensorrt # Unused but required by TensorrtExecutionProvider | ||
except: | ||
logger.warning( | ||
"Failed to import tensorrt package. " | ||
"onnxruntime would fallback to CUDAExecutionProvider instead of using TensorrtExecutionProvider." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy to install tensorrt
for users? If not, consider making this a lazy import to avoid unnecessary warnings for users who don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it a lazy import can also reduce predictor's init time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easy to install tensorrt for users?
Yes, it's just pip install tensorrt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried lazy import, but didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird why lazy import doesn't work. It's fine to keep it here for now. Maybe later dive into the reason.
logger.info("Loading ONNX file from path {}...".format(onnx_path)) | ||
onnx_model = onnx.load(onnx_path) | ||
|
||
trt_module = OnnxModule(onnx_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned module may not use tensorrt. Maybe moving the warning of importing tensorrt inside OnnxModule
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to move importing tensorrt inside OnnxModule, but that didn't work. (onnxruntime.InferenceSession doesn't compile to use TensorrtExecutionProvider)
73dc079
to
1e75f6d
Compare
Job PR-2836-1e75f6d is done. |
Job PR-2836-9c8a787 is done. |
Job PR-2836-d37b0ad is done. |
|
||
input_dict = {k: args[i].cpu().numpy() for i, k in enumerate(self.input_names)} | ||
onnx_outputs = self.sess.run(self.output_names, input_dict) | ||
onnx_outputs = onnx_outputs[:3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need only the first 3 model outputs? Don't we need them all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes with the undetermined size of output dict for a fusion_mlp model, which contains a tuple of (features, logits, multimodal_logits)
. The variable multimodal_logits
contains the logit output from all modalities.
Don't we need them all?
Good question. In short, multimodal_logits
are only used for computing loss, not used for inference.
Specifically, get_output_dict()
would take outputs from forward()
, but in onnxruntime outputs are flatten, which means the list of tensors in multimodal_logits
would be merged into other outputs (e.g. features, logits). We kind of lost the information about where did the extra tensors come from.
# Prediction with default predictor | ||
y_pred = predictor.predict(test_df) | ||
|
||
trt_module = predictor.export_tensorrt(path=model_path, data=tail_df, batch_size=batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, export_onnx
and export_tensorrt
either returns a path or a module. When a user calls export_something()
, do users expect return something? Or users just expect the model is saved to disk? If users want to use the saved model, they probably want to load it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are excellent questions.
There are several different usecases:
U1: Some users might expect an ONNX file to be exported, so that they can use the ONNX file where ever they want.
U2: Some users might expect faster inference time with ONNX. They actually don't care much about the details on where is the ONNX file.
In terms of U1, I think the expected output could be either the location of the ONNX file, or the ONNX model itself. This is the way how export_onnx
is defined.
In terms of U2, we should be able to generate a drop-in replacement of torch.nn.Module, so that integration with existing inference flow would be easy. This is the way how export_tensorrt
is defined.
But I would agree, we should have a better name for these public APIs.
batch[key] = inp.to(device, dtype=dtype) | ||
else: | ||
batch[key] = inp.to(device) | ||
self._model.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need self._model.to(device)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. We need to ensure the model parameters are moved to CPU before tracing.
Job PR-2836-dc7fc9b is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for supporting Tensorrt!
Issue #, if available:
Description of changes:
This PR brings support to accelerate fusion models with tensorrt.
For a fusion model trained with petfinder dataset, TensorrtExecutionProvider can boost inference speed up to 2.7x faster, comparing to GPU-based realtime prediction with PyTorch.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.