Ishaan Preetam Chandratreya*,1, David Charatan*,1, Basile Van Hoorick2, Sergey Zakharov2, Vitor Guizilini2, Phillip Isola1, Vincent Sitzmann1
1Massachusetts Institute of Technology 2Toyota Research Institute
*equal contribution
[Project Page] [Paper] [Code] [Models] [Dataset]
TL;DR: We present a long-memory autoregressive framework for video generation. A hierarchical latent space and coarse-to-fine rollout let it stay consistent many times longer than a conventional diffusion model under the same token budget.
Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information—such as scene layout and semantics—while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.
conda create python=3.12 -n millivid
conda activate millivid
pip install -r requirements.txtUse the command below to generate videos with a pretrained MilliVid model. It will automatically download a tiny subset of the Loopcraft dataset and a pretrained MilliVid model.
python3 demo.pyIf you want to compare MilliVid's outputs to the baselines' outputs, use python3 demo.py --baselines instead.
- ✅ Pre-trained checkpoints (for main baseline comparisons)
- ✅ Demo/inference script
- ✅ Images and latents for test set uploaded
- ⏳ Images and latents for training set are currently uploading
- ❌ Training/testing instructions not yet given (will be done over the next few days)
- ❌ Data generation scripts not yet uploaded (coming soon as well)
TL;DR: If you're very eager to start experimenting with MilliVid, everything you need is here, but there may be some sharp edges. Those will be ironed out in the coming days.
- Experiment Configuration README: Configurations for various models and ablations.
If you find this work useful, please consider citing:
@inproceedings{chandratreya2026millivid,
title = {MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation},
author = {Chandratreya, Ishaan Preetam and Charatan, David and Van Hoorick, Basile and Zakharov, Sergey and Guizilini, Vitor and Isola, Phillip and Sitzmann, Vincent},
booktitle = {arXiv},
year = {2026},
url = {https://davidcharatan.com/millivid},
}We thank Andrew Song and Hannah Schlueter for their feedback during the process of writing and editing the paper. This work was supported by the Toyota Research Institute (TRI) University 3.0 (URP) program, the National Science Foundation under Grant No. 2211259, by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) under 140D0423C0075, by the Amazon Science Hub, by the MIT-Google Program for Computing Innovation, by AMD via the MIT AI Hardware Program, and by a 2025 MIT Office of Research Computing and Data Seed Grant. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of any other entity.