Tensorflow to TensorRT converter (TF2TRT)
This converter parses a Tensorflow protobuf file or graph_def object and creates a TensorRT network out of it. Everything is written in Python 3 and does not require the installation of additional packages.
There is no support for dynamic shapes. Computations based on shapes of tensors inside the network at runtime are therefore not possible. Every shape musst be fully specified at construction time in Tensorflow.
Tensorflow has a lot of custom made operations and not all of them are supported in TensorRT. Right now the converter is missing a lot of operations and attributes. Adding them is quite easy if they are supported by TensorRT.
The layout of the input data into the Tensorflow network should be channel-first (NCHW) making the conversion easier.
Why use TF2TRT?
- Easy to extend
- No special 3rd party dependencies
- Pure Python
If you are interested in a full fledged converter try out ONNX-TensorRT
The shortest way to use the converter is just to call
import tensorrt as trt from trt_importer import TRTImporter importer = TRTImporter(trt.Logger.VERBOSE) network = importer.from_tensorflow_graph_def(graph_def, ["input_tensor"], [[1, 3, 224, 224]], ["softmax"]) serialized_engine = importer.optimize_network(network, max_workspace_size=4 * (1 << 30))
The example also explains how to froze a Tensorflow graph and run a TensorRT engine.
How to use FP16?
The TensorRT network can be optimized for FP16 computation by adding a
fp16_mode=True parameter to the
optimize_network(...) method. Please note it is important to still provide a normal FP32 graph and do not have any manual FP16 casting operation in it.
Optimize for bigger batch sizes
An additional parameter
max_batch_size=32 of the
optimize_network(...) method will define the maximal batch size the resulting engine will be able to execute. Optimizing for higher batch sizes might reduce the performance of smaller ones. If you need perfect performance for several batch sizes, you need to create execution profiles for each one of them.
Quality vs speed up the optimization process
Since the optimization process is an exaustic search of implementations to run a particular layer in the least amount of time. We can specify how often each implementation should be repreated (
average_find_iterations) in order to get an average computation time. Our API provides a simplified parameter for that
importer.optimize_network(..., fast_pass=True) which sets both iteration parameter to 1 if Fast Pass is True and otherwise to 5.
Expected memory consumption?
The resulting serialized engine contains the weights and structure of the network. When executing an inference pass all the weights will be copied into the memory of the GPU plus some additional space for intermediate calculations. The
serialized_engine.device_memory_size variables gives an approximation of how many memory will be consumed.
We love to get in contact with the community. Feel free to e-mail us or use the issue system to suggest new features and ask questions. Pull requests are always welcome, we try to incorporate them into the master branch as fast as possible. Not sure if that typo is worth a pull request? Do it! We will appreciate it.