Skip to content

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units

License

Notifications You must be signed in to change notification settings

choijeongsoo/lip2speech-unit

Repository files navigation

Intelligible Lip-to-Speech Synthesis with Speech Units

Official PyTorch implementation for the following paper:

Intelligible Lip-to-Speech Synthesis with Speech Units
Jeongsoo Choi, Minsu Kim, Yong Man Ro
Interspeech 2023
[Paper] [Project]

Installation

conda create -y -n lip2speech python=3.10
conda activate lip2speech

git clone -b main --single-branch https://github.com/choijeongsoo/lip2speech-unit.git
cd lip2speech-unit

pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

git clone https://github.com/facebookresearch/fairseq.git
cd fairseq
git checkout afc77bd
pip install -e ./
cd ..

Data Preparation

Video and Audio

  • ${ROOT}/datasets/${DATASET}/audio for processed audio files
  • ${ROOT}/datasets/${DATASET}/video for processed video files
  • ${ROOT}/datasets/${DATASET}/label/*.tsv for training manifests

Speech Units

  • 6th layer of HuBERT Base + KM200

Speaker Embedding

Mel-spectrogram

  • config

    filter_length: 640
    hop_length: 160
    win_length: 640
    n_mel_channels: 80
    sampling_rate: 16000
    mel_fmin: 0.0
    mel_fmax: 8000.0
    

We provide sample data in 'datasets/lrs3' directory.

Model Checkpoints

Lip Reading Sentences 3 (LRS3)

1st stage 2nd stage STOI ESTOI PESQ WER(%)
Multi-target Lip2Speech Multi-input Vocoder 0.552 0.354 1.31 50.4
Multi-target Lip2Speech Multi-input Vocoder
+ augmentation
0.543 0.351 1.28 50.2
Multi-target Lip2Speech
+ AV-HuBERT
Multi-input Vocoder
+ augmentation
0.578 0.393 1.31 29.8
Lip Reading Sentences 2 (LRS2)

1st stage 2nd stage STOI ESTOI PESQ WER(%)
Multi-target Lip2Speech Multi-input Vocoder
Multi-target Lip2Speech Multi-input Vocoder
+ augmentation
0.565 0.395 1.32 44.8
Multi-target Lip2Speech
+ AV-HuBERT
Multi-input Vocoder
+ augmentation
0.585 0.412 1.34 35.7

We use the pre-trained AV-HuBERT Large (LRS3 + VoxCeleb2 (En)) model available from here.

For inference, download the checkpoints and place them in the 'checkpoints' directory.

Training

scripts/${DATASET}/train.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Inference

scripts/${DATASET}/inference.sh

in 'multi_target_lip2speech' and 'multi_input_vocoder' directory

Acknowledgement

This repository is built using Fairseq, AV-HuBERT, ESPnet, speech-resynthesis. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@article{choi2023intelligible,
      title={Intelligible Lip-to-Speech Synthesis with Speech Units},
      author={Jeongsoo Choi and Minsu Kim and Yong Man Ro},
      journal={arXiv preprint arXiv:2305.19603},
      year={2023},
}

About

[Interspeech 2023] Intelligible Lip-to-Speech Synthesis with Speech Units

Resources

License

Stars

Watchers

Forks