Skip to content

commaai/commavq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3992a16 · Apr 1, 2025

History

35 Commits
Mar 11, 2025
May 29, 2024
May 29, 2024
Aug 30, 2023
May 29, 2024
May 30, 2024
Aug 30, 2023
Jun 25, 2024
Aug 30, 2023
Oct 11, 2023
Jun 28, 2023
Apr 1, 2025
May 29, 2024

Repository files navigation

commaVQ challenge

Source Video Compressed Video Future Prediction
source_video.mp4
compressed_video.mp4
generated.mp4

A world model is a model that can predict the next state of the world given the observed previous states and actions.

World models are essential to training all kinds of intelligent agents, especially self-driving models.

commaVQ contains:

  • encoder/decoder models used to heavily compress driving scenes
  • a world model trained on 3,000,000 minutes of driving videos
  • a dataset of 100,000 minutes of compressed driving videos

Task

Lossless compression challenge: make me smaller! $500 challenge

Losslessly compress 5,000 minutes of driving video "tokens". Go to ./compression/ to start

Prize: highest compression rate on 5,000 minutes of driving video (~915MB) - Challenge ended July, 1st 2024 11:59pm AOE

Submit a single zip file containing the compressed data and a python script to decompress it into its original form using this form. Top solutions are listed on comma's official leaderboard.

Implementation Compression rate
szabolcs-cs (self-compressing neural network) 3.4
pkourouklidis (arithmetic coding with GPT) 2.6
anonymous (zpaq) 2.3
rostislav (zpaq) 2.3
anonymous (zpaq) 2.2
anonymous (zpaq) 2.2
0x41head (zpaq) 2.2
tillinf (zpaq) 2.2
baseline (lzma) 1.6

Overview

A VQ-VAE [1,2] was used to heavily compress each video frame into 128 "tokens" of 10 bits each. Each entry of the dataset is a "segment" of compressed driving video, i.e. 1min of frames at 20 FPS. Each file is of shape 1200x8x16 and saved as int16.

A world model [3] was trained to predict the next token given a context of past tokens. This world model is a Generative Pre-trained Transformer (GPT) [4] trained on 3,000,000 minutes of driving videos following a similar recipe to [5].

Examples

./notebooks/encode.ipynb and ./notebooks/decode.ipynb for an example of how to visualize the dataset using a segment of driving video from comma's drive to Taco Bell

./notebooks/gpt.ipynb for an example of how to use the world model to imagine future frames.

./compression/compress.py for an example of how to compress the tokens using lzma

Download the dataset

  • Using huggingface datasets
import numpy as np
from datasets import load_dataset
num_proc = 40 # CPUs go brrrr
ds = load_dataset('commaai/commavq', num_proc=num_proc)
tokens = np.load(ds['0'][0]['path']) # first segment from the first data shard

References

[1] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in neural information processing systems 30 (2017).

[2] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[3] https://worldmodels.github.io/

[4] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[5] Micheli, Vincent, Eloi Alonso, and François Fleuret. "Transformers are Sample-Efficient World Models." The Eleventh International Conference on Learning Representations. 2022.

About

commaVQ is a dataset of compressed driving video

Resources

License

Citation

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages