WavFlow: Audio Generation in Waveform Space

Feiyan Zhou^1,2 · Luyuan Wang¹ · Shoufa Chen^1,* · Zhe Wang¹ · Zhiheng Liu¹ · Yuren Cong¹ · Xiaohui Zhang¹ · Fanny Yang¹ · Belinda Zeng¹

¹ Meta AI · ² Northeastern University

🌐 Project Page · 📄 arXiv · 🛠 Training Guide

Overview

WavFlow introduces a paradigm for generating synchronized, high-fidelity audio from video and text inputs directly in the raw waveform space, bypassing latent compression entirely. Through waveform patchifying and amplitude lifting, WavFlow enables stable flow matching on raw audio via direct x-prediction. Evaluation on the VGGSound (VT2A) and AudioCaps (T2A) benchmarks shows that WavFlow delivers performance on par with established latent-based methods, proving that end-to-end waveform generation can match traditional frameworks in acoustic richness, fidelity, and synchronization.

Demo

🌳 Forest (natural) forest.mp4	🐸 Frog (animal) frog.mp4
🥁 Drum (music) drum.mp4	🛹 Skateboard (sport) skateboard.mp4

See the Project Page for 24+ samples and side-by-side benchmark comparisons.

Method

Installation

git clone https://github.com/facebookresearch/WavFlow.git
cd WavFlow
bash scripts/setup.sh        # creates conda env 'wavflow' and installs everything
conda activate wavflow

Manual setup

conda create -n wavflow python=3.10 -y
conda activate wavflow
pip install -r requirements.txt
pip install -e . --no-deps
conda install -n wavflow -c conda-forge "ffmpeg<7" -y    # for torio video decoding

All required external weights (CLIP, Synchformer, the empty-string CFG embedding) are downloaded or computed automatically on first run and cached under ~/.cache/wavflow/.

Inference

⚠️ Due to organizational policy constraints, we are currently unable to release the production-trained checkpoints. We are working on a foundation checkpoint trained on fully open-source data; in the meantime you can train your own — see the training guide.

Once you have a trained checkpoint, run:

bash scripts/launch/predict.sh [--gpu N] [--config PATH]

The default config is wavflow/configs/infer.yaml. The input CSV (data.csv_path) accepts video, text, or both:

video_path,caption,video_exist,text_exist
/abs/path/sample1.mp4,a whistling rocket explodes,1,1   # video + text
/abs/path/sample2.mp4,birds chirping in a forest,1,1    # video + text
,a whistling rocket explodes,0,1                        # text-only
/abs/path/sample3.mp4,,1,0                              # video-only

Configuration reference

Launcher options

Flag / env	Default	Description
`--gpu N` (or `GPU=N`)	`0`	CUDA device index
`--config PATH` (or `CONFIG_PATH=...`)	`wavflow/configs/infer.yaml`	YAML config to load
`WAVFLOW_ENV`	`wavflow`	conda env name to auto-activate

Any extra positional argument is forwarded to python -m wavflow.infer.

Key fields in `infer.yaml`

Field	What to set
`data.csv_path`	the input CSV (above)
`model.name`	one of `medium_16k`, `medium_44k`, `large_16k`, `large_44k` (must match the trained ckpt)
`model.ckpt_path`	a `checkpoint_.pth` (full ckpt) or `ema_epoch_.pth` (EMA-only)
`model.use_ema`	`true` to load `model_ema1` from a full ckpt; `false` to use the live `model` weights
`inference.duration_sec` / `target_sample_rate`	output length and SR (must match model arch)
`inference.cfg`, `num_steps`, `noise_scale`, `noise_shift`, `prediction_type`, `seed`	sampling hyperparameters
`inference.batch_size`	rows per ODE batch
`inference.trim_to_duration`	trim output to `duration_sec`
`output.output_dir`	where wavs are written
`output.loudness_norm`, `loudness_target_lufs`	optional `pyloudnorm` post-processing

CSV semantics

video_exist=0 → uses learned empty CLIP/Sync tokens (no video decode)
text_exist=0 → uses learned empty CLIP-text token (caption ignored)
Optional id column; otherwise the wav file name is derived from Path(video_path).stem, falling back to row_<idx> for text-only rows
Captions with commas must be quoted

EMA caveat

The EMA tensor stored as model_ema1 is updated with ema_decay = 0.9999 per step. After only a few hundred / thousand steps it still contains random-init values and produces noise during inference. Set model.use_ema: false (or pass an ema_epoch_*.pth saved after enough steps) when sampling from a short / overfit run.

Training

For feature extraction and training (single-node and multi-node), see TRAINING.md.

Citation

@misc{zhou2026wavflowaudiogenerationwaveform,
      title={WavFlow: Audio Generation in Waveform Space}, 
      author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},
      year={2026},
      eprint={2605.18749},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.18749}, 
}

Acknowledgements

WavFlow builds on the open-source community. We gratefully acknowledge:

MMAudio — multimodal audio generation
JiT — Just Image Transformer
Synchformer — audio-visual synchronization

License

The majority of WavFlow is licensed under CC-BY-NC 4.0. Portions of the project are vendored from third-party open source projects under their original license terms (MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License). See NOTICE.txt for the full per-component breakdown and license texts.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
feature_extract		feature_extract
scripts		scripts
training_samples		training_samples
wavflow		wavflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
TRAINING.md		TRAINING.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WavFlow: Audio Generation in Waveform Space

Overview

Demo

Method

Installation

Inference

Launcher options

Key fields in `infer.yaml`

CSV semantics

EMA caveat

Training

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WavFlow: Audio Generation in Waveform Space

Overview

Demo

Method

Installation

Inference

Launcher options

Key fields in infer.yaml

CSV semantics

EMA caveat

Training

Citation

Acknowledgements

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Key fields in `infer.yaml`

Packages