AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

This is an implementation of AV-Flow.

Requirements

Install requirements: pip install -r requirements.txt

Install DRTK. Follow instructions from: https://github.com/facebookresearch/DRTK/

Install audio2photoreal: https://github.com/facebookresearch/audio2photoreal/

Put audio2photoreal in the same parent directory with this repo, e.g.:

    |-- /home/user/repos/
            |-- audio2photoreal/
            |-- avflow/

Dataset

We use 2 datasets in this work.

Download 50h dataset from https://www.meta.com/emerging-tech/codec-avatars/4d-talking-avatar

    |-- /path/to/dataset/50h_dataset/
            |-- audio/
            |-- expression_codes/
            |-- headpose/
            |-- head_vae/
            |-- wav2vec2_fairseq_base_ls960_asr_ls960_vad100/
            |-- face_topo_small_neck.obj

Move face_topo_small_neck.obj under ./models/.

Audio2Photoreal:

Follow instructions from Audio2Photoreal to download their dataset.

We release the ASR tokens and head VAE that we use in our work.

Place them under each corresponding subject:

    |-- /path/to/dataset/audio2photoreal/
            |-- PXB184/
                    |-- scene01_audio.wav
                    |-- scene01_body_pose.npy
                    |-- scene01_face_expression.npy
                    |-- scene01_missing_face_frames.npy
                    |-- ...
                    |-- wav2vec2_fairseq_base_ls960_asr_ls960_vad100/
                            |-- scene01.npy
                            |-- scene02.npy
                            |-- ...
                    |-- head_vae/
                            |-- iter-0500000.pt

Pre-trained Model

We release a pre-trained AV-Flow model for the 50h dataset. It is included in the downloaded dataset as av_flow_ckpt_50h.pt.

Renderer

The pre-trained personalized renderers are included in the downloaded datasets. For more details in Codec Avatars, please see the following repos:

https://github.com/facebookresearch/goliath/

https://github.com/facebookresearch/ca_body/

Vocoder

We use a pre-trained BigVGAN vocoder to get the audio signal from the generated spectrograms. We use the base version (bigvgan_base_22khz_80band). It is included in the downloaded dataset as ./bigvgan_base_22khz_80band/.

You can also download it from the official repo: https://github.com/NVIDIA/BigVGAN/.

We also include a fine-tuned version of BigVGAN on the 50h dataset: ./50h_dataset/bigvgan_base_22khz_80band_finetuned. You can use it by changing the corresponding path under DEFAULT_CONFIG in recipes/flow_matching_audiovisual.py.

Training

The main script for training and inference is recipes/flow_matching_audiovisual.py. It includes a DEFAULT_CONFIG. Change it accordingly with the dataset.

For the 50h dataset, it should be:

    DEFAULT_CONFIG = {
        ...
        # dataset name
        "dataset_name": "50h",
        ...
        # dataset config
        "dataset": {
            ...
            "body": False,
            "input_audio_sample_rate": 48000,
            "input_head_dim": 7,
            ...
        },
        ...
    }

For the Audio2Photoreal dataset, it should be:

    DEFAULT_CONFIG = {
        ...
        # dataset name
        "dataset_name": "Audio2Photoreal",
        "subject": "PXB184",  # subjects: ["PXB184", "RLW104", "TXB805", "GQS883"]
        ...
        # dataset config
        "dataset": {
            ...
            "body": True,
            "input_audio_sample_rate": 48000,
            "input_head_dim": 4,
            ...
        },
        ...
    }

Assuming the datasets are stored under /path/to/dataset/, the training command is:

python recipes/flow_matching_audiovisual.py -i /path/to/dataset/50h_dataset/ -o /path/to/output/dir/av_flow_50h/

This will train the basic AV-Flow model on the actor's data. See the "Dyadic Conversations" section for training in the dyadic setting.

Example using Audio2Photoreal dataset:

python recipes/flow_matching_audiovisual.py -i /path/to/dataset/audio2photoreal/PXB184/ -o /path/to/output/dir/av_flow_PXB184/

Inference

Inference from audio or text or tokens.

Example:

python recipes/flow_matching_audiovisual.py -i /path/to/dataset/50h_dataset/ -o /path/to/output/dir/av_flow_50h/ -c /path/to/output/dir/av_flow_50h/checkpoints/iter-1000000.pth --audio /path/to/audio/filename.wav

To infer from audio: --audio /path/to/audio/filename.wav

To infer from text: --text "In the quiet morning, the birds chirp happily, filling the air with their sweet melodies."

To infer from tokens: --tokens /path/to/tokens/filename.npy

The results (3 generated audio samples and corresponding video) are saved under /path/to/output/dir/results/viz/.

If audio is provided, we extract tokens using an ASR model (Wav2Vec2). If you want to save the extracted tokens from multiple audio files, you can use the utils/extract_asr.py.

If text is provided, we predict tokens from our pre-trained text-to-tokens model.

Extract tokens from audio

Assuming there are .wav files under /path/to/audio/:

python utils/extract_asr.py -i /path/to/audio/

The extracted tokens are saved as .npy files under /path/to/audio/wav2vec2_fairseq_base_ls960_asr_ls960/.

Then you can use the inference command with --tokens /path/to/audio/wav2vec2_fairseq_base_ls960_asr_ls960/filename.npy.

Dyadic Conversations

We additionally provide expression codes and head poses extracted from the monocular video of the participant, using SMIRK.

The corresponding features are stored under the downloaded dataset: /path/to/dataset/50h_dataset/participant/.

To train the model in the dyadic setting (conversations between actor and participant), set the following in the DEFAULT_CONFIG in recipes/flow_matching_audiovisual.py:

    DEFAULT_CONFIG = {
        ...
        "participant_cond": True,
        "other_dim": 58 + 29,  # 58 SMIRK features and 29 ASR tokens
        ...
    }

In this way, the model will be additionally conditioned on the participant's tokens, expressions, and head poses, leading to an always-on avatar that actively listens.

Inference

Download the participant images from:

Move them under: /path/to/dataset/50h_dataset/participant/camera01/images/

To visualize the results, run the following:

python recipes/flow_matching_audiovisual.py -i /path/to/dataset/50h_dataset/ -o /path/to/output/dir/av_flow_50h_dyadic/ -c /path/to/output/dir/av_flow_50h_dyadic/checkpoints/iter-1000000.pth --test

The results (3 generated audio samples and corresponding videos) are saved under /path/to/output/dir/test/segment/viz/.

License

The code and dataset are released under CC-NC 4.0 International license.

Citation

If you find our work useful, please cite our paper:

@InProceedings{Chatziagapi_2025_ICCV,
    author    = {Chatziagapi, Aggelina and Morency, Louis-Philippe and Gong, Hongyu and Zollh\"ofer, Michael and Samaras, Dimitris and Richard, Alexander},
    title     = {AV-Flow: Transforming Text to Audio-Visual Human-like Interactions},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {14270-14282}
}

Acknowledgements

We would like to acknowledge the following repos:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
learn		learn
models		models
recipes		recipes
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Requirements

Dataset

Pre-trained Model

Renderer

Vocoder

Training

Inference

Extract tokens from audio

Dyadic Conversations

Inference

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

facebookresearch/av_flow

Folders and files

Latest commit

History

Repository files navigation

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Requirements

Dataset

Pre-trained Model

Renderer

Vocoder

Training

Inference

Extract tokens from audio

Dyadic Conversations

Inference

License

Citation

Acknowledgements

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages