MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Ishaan Preetam Chandratreya^*,1, David Charatan^*,1, Basile Van Hoorick², Sergey Zakharov², Vitor Guizilini², Phillip Isola¹, Vincent Sitzmann¹

¹Massachusetts Institute of Technology ²Toyota Research Institute

^*equal contribution

arXiv 2026

[Project Page] [Paper] [Code] [Models] [Dataset]

TL;DR: We present a long-memory autoregressive framework for video generation. A hierarchical latent space and coarse-to-fine rollout let it stay consistent many times longer than a conventional diffusion model under the same token budget.

Abstract

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information—such as scene layout and semantics—while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

Quick Start

Setup

1. Create a Conda environment and install dependencies.

conda create python=3.12 -n millivid
conda activate millivid
pip install -r requirements.txt

2. Generate some videos!

Use the command below to generate videos with a pretrained MilliVid model. It will automatically download a tiny subset of the Loopcraft dataset and a pretrained MilliVid model.

python3 demo.py

If you want to compare MilliVid's outputs to the baselines' outputs, use python3 demo.py --baselines instead.

Code Release Progress

✅ Pre-trained checkpoints (for main baseline comparisons)
✅ Demo/inference script
✅ Images and latents for test set uploaded
⏳ Images and latents for training set are currently uploading
❌ Training/testing instructions not yet given (will be done over the next few days)
❌ Data generation scripts not yet uploaded (coming soon as well)

TL;DR: If you're very eager to start experimenting with MilliVid, everything you need is here, but there may be some sharp edges. Those will be ironed out in the coming days.

Useful Information for Extending MilliVid

Experiment Configuration README: Configurations for various models and ablations.

Citation

If you find this work useful, please consider citing:

@inproceedings{chandratreya2026millivid,
  title     = {MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation},
  author    = {Chandratreya, Ishaan Preetam and Charatan, David and Van Hoorick, Basile and Zakharov, Sergey and Guizilini, Vitor and Isola, Phillip and Sitzmann, Vincent},
  booktitle = {arXiv},
  year      = {2026},
  url       = {https://davidcharatan.com/millivid},
}

Acknowledgements

We thank Andrew Song and Hannah Schlueter for their feedback during the process of writing and editing the paper. This work was supported by the Toyota Research Institute (TRI) University 3.0 (URP) program, the National Science Foundation under Grant No. 2211259, by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) under 140D0423C0075, by the Amazon Science Hub, by the MIT-Google Program for Computing Innovation, by AMD via the MIT AI Hardware Program, and by a 2025 MIT Office of Research Computing and Data Seed Grant. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of any other entity.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
config		config
source		source
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

arXiv 2026

Abstract

Quick Start

Setup

1. Create a Conda environment and install dependencies.

2. Generate some videos!

Code Release Progress

Useful Information for Extending MilliVid

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

arXiv 2026

Abstract

Quick Start

Setup

1. Create a Conda environment and install dependencies.

2. Generate some videos!

Code Release Progress

Useful Information for Extending MilliVid

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages