In this work, we present ERTACache, a principled and efficient caching framework for accelerating diffusion model inference. By decomposing cache-induced degradation into feature shift and step amplification errors, we develop a dual-path correction strategy that combines offline-calibrated reuse scheduling, trajectory-aware timestep adjustment, and closed-form residual rectification. The following figure gives an overview of our ERTACache framework, which adopts a dual-dimensional correction strategy: (1) we first perform offline policy calibration by searching for a globally effective cache schedule using residual error profiling; (2) we then introduce a trajectory-aware timestep adjustment mechanism to mitigate integration drift caused by reused features; (3) finally, we propose an explicit error rectification that analytically approximates and rectifies the additive error introduced by cached outputs, enabling accurate reconstruction with negligible overhead.
As the figure shown below, ERTACache preserves fine-grained visual details and frame-to-frame consistency, outperforming TeaCache and matching the performance of the non-cache reference. In video generation tasks using CogVideoX, Wan2.1-1.3B, and OperaSora 1.2, ERTA-Cache achieves noticeably better temporal consistency, particularly between the first and last frames. When applied to the Flux-dev 1.0 image model, it enhances visual richness and details. These results highlight ERTACache as a uniquely effective solution that balances visual quality and computational efficiency for consistent video generation.
Unlike prior heuristics-based methods, ERTACache provides a theoretically grounded yet lightweight solution that significantly reduces redundant computations while maintaining high-fidelity outputs. Empirical results across multiple benchmarks validate its effectiveness and generality, highlighting its potential as a practical solution for efficient generative sampling.
Text to Video
- ERTACache4Wan2.1
- ERTACache4CogVideoX-2B
- ERTACache4OpenSora1.2
Text to Image
- ERTACache4FLUX
| T2V Model | Method | LPIPS | SSIM | PSNR | Latency(s) |
|---|---|---|---|---|---|
| OpenSora 1.2 | TeaCache | 0.2511 | 0.7477 | 19.10 | 19.84 |
| ERTACache | 0.1659 | 0.8170 | 22.34 | 18.04 | |
| CogvideoX-2B | TeaCache | 0.2057 | 0.7614 | 20.97 | 26.88 |
| ERTACache | 0.1012 | 0.8702 | 26.44 | 26.78 | |
| Wan2.1-1.3B | TeaCache | 0.2913 | 0.5685 | 16.17 | 99.5 |
| ERTACache | 0.1095 | 0.8200 | 23.77 | 91.7 | |
| FLUX-dev 1.0 | TeaCache | 0.4427 | 0.7445 | 16.47 | 14.21 |
| ERTACache | 0.3029 | 0.8962 | 20.51 | 14.01 | |
The running environment set-up depends on the specific model. For example, for FLUX, you need to install the FLUX packages:
pip install --upgrade diffusers[torch] transformers protobuf tokenizers sentencepieceFor all the supported models, you can enter in the specific folder (for example: go to \ERTACache4FLUX ), then use the following command to get the outputs saved in the .\sample folder
sh run.shThis repository is built based on VideoSys, Diffusers, Open-Sora, CogVideoX, FLUX, Wan2.1, Thanks for their contributions!

