v0.7.0: add options for faster diffusion inference: shared variable caching, efficient bias fusion, and TF32 acceleration.
What's Changed
We’re excited to announce the open-source release of Protenix v0.7.0, supported by @yangyanpinghpc, featuring several performance optimizations for diffusion inference. This version introduces three new optional acceleration flags (enabled by default in inference stage) and improved support for batched inference:
- --enable_cache
Precomputes and caches shared intermediate variables (pair_z, p_lm, c_l) across the N_sample and N_step dimensions. - --enable_fusion
Fuses bias transformations and normalization in the 24-layer diffusion transformer blocks at compile time. - --enable_tf32
Enables TF32 precision for matrix multiplications when using FP32 computation, trading slight numerical accuracy for speed. - Batched Diffusion Support (N_sample > 1)
Shares s_trunk and z_pair across the N_sample dimension during diffusion, reducing memory and compute overhead without affecting results.
You can run it using the following example command:
(Note: if not specified, --enable_cache, --enable_fusion, and --enable_tf32 default to true.)
protenix predict -i examples/example.json -o ./test_outputs/cmd/output_mini -s 105,106 -n "protenix_mini_default_v0.5.0" --triatt_kernel "torch" --trimul_kernel "torch" --enable_cache true --enable_fusion true --enable_tf32 true
