Skip to content

v0.7.0: add options for faster diffusion inference: shared variable caching, efficient bias fusion, and TF32 acceleration.

Choose a tag to compare

@zhangyuxuann zhangyuxuann released this 05 Nov 07:25
· 78 commits to main since this release

What's Changed

We’re excited to announce the open-source release of Protenix v0.7.0, supported by @yangyanpinghpc, featuring several performance optimizations for diffusion inference. This version introduces three new optional acceleration flags (enabled by default in inference stage) and improved support for batched inference:

  • --enable_cache
    Precomputes and caches shared intermediate variables (pair_z, p_lm, c_l) across the N_sample and N_step dimensions.
  • --enable_fusion
    Fuses bias transformations and normalization in the 24-layer diffusion transformer blocks at compile time.
  • --enable_tf32
    Enables TF32 precision for matrix multiplications when using FP32 computation, trading slight numerical accuracy for speed.
  • Batched Diffusion Support (N_sample > 1)
    Shares s_trunk and z_pair across the N_sample dimension during diffusion, reducing memory and compute overhead without affecting results.

You can run it using the following example command:
(Note: if not specified, --enable_cache, --enable_fusion, and --enable_tf32 default to true.)

protenix predict -i examples/example.json -o  ./test_outputs/cmd/output_mini -s 105,106 -n "protenix_mini_default_v0.5.0" --triatt_kernel "torch" --trimul_kernel "torch" --enable_cache true --enable_fusion true --enable_tf32 true

v0.7.0 performance