UTUT

Official PyTorch implementation for the following paper:

Textless Unit-to-Unit Pre-training for Many-to-Many Multimodal-to-Speech Machine Translation by Learning Unified Speech and Text Representations
Minsu Kim*, Jeongsoo Choi*, Dahun Kim, Yong Man Ro
[Demo]

Setup

Python >=3.7,<3.11

git clone -b main --single-branch https://github.com/choijeongsoo/utut
cd utut
git submodule init
git submodule update
pip install -e fairseq
pip install -r requirements.txt
apt-get install espeak

Model Checkpoints

Speech to Unit Quantization

mHuBERT Base, layer 11, km 1000

reference: textless_s2st_real_data

Unit to Unit Translation (UTUT)

Pre-trained Model

Task Pretraining Data Model

STS VoxPopuli (from year 2013), mTEDx download

TTS VoxPopuli (from year 2013), mTEDx download

TTST VoxPopuli (from year 2013), mTEDx download

Unit to Speech Synthesis

En (English), Es (Spanish), and Fr (French)

reference: textless_s2st_real_data

It (Italian), De (German), and Nl (Dutch)

Unit config	Unit size	Vocoder language	Dataset	Model
mHuBERT, layer 11	1000	It	M-AILABS (male)	ckpt, config
mHuBERT, layer 11	1000	De	CSS10	ckpt, config
mHuBERT, layer 11	1000	Nl	CSS10	ckpt, config

Inference

UTUT is pre-trained on Voxpopuli and mTEDx, where a large portion of data is from European Parliament events.
Before utilizing the pre-trained model, please consider the data domain where you want to apply it.

Pipeline for Speech-to-Speech Translation (STS)

$ cd utut
$ PYTHONPATH=fairseq python inference_sts.py \
  --in-wav-path samples/en/1.wav samples/en/2.wav samples/en/3.wav \
  --out-wav-path samples/es/1.wav samples/es/2.wav samples/es/3.wav \
  --src-lang en --tgt-lang es \
  --mhubert-path /path/to/mhubert_base_vp_en_es_fr_it3.pt \
  --kmeans-path /path/to/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin \
  --utut-path /path/to/utut_sts.pt \
  --vocoder-path /path/to/vocoder_es.pt \
  --vocoder-cfg-path /path/to/config_es.json

Pipeline for Text-to-Speech Synthesis (TTS)

$ cd utut
$ PYTHONPATH=fairseq python inference_tts.py \
  --in-txt-path samples/en/a.txt samples/en/b.txt samples/en/c.txt \
  --out-wav-path samples/en/a.wav samples/en/b.wav samples/en/c.wav \
  --src-lang en --tgt-lang en \
  --utut-path /path/to/utut_tts.pt \
  --vocoder-path /path/to/vocoder_en.pt \
  --vocoder-cfg-path /path/to/config_en.json

Pipeline for Text-to-Speech Translation (TTST)

$ cd utut
$ PYTHONPATH=fairseq python inference_tts.py \
  --in-txt-path samples/en/a.txt samples/en/b.txt samples/en/c.txt \
  --out-wav-path samples/es/a.wav samples/es/b.wav samples/es/c.wav \
  --src-lang en --tgt-lang es \
  --utut-path /path/to/utut_ttst.pt \
  --vocoder-path /path/to/vocoder_es.pt \
  --vocoder-cfg-path /path/to/config_es.json

19 source languages: en (English), es (Spanish), fr (French), it (Italian), pt (Portuguese), el (Greek), ru (Russian), cs (Czech), da (Danish), de (German), fi (Finnish), hr (Croatian), hu (Hungarian), lt (Lithuanian), nl (Dutch), pl (Polish), ro (Romanian), sk (Slovak), and sl (Slovene)

6 target languages: en (English), es (Spanish), fr (French), it (Italian), de (German), and nl (Dutch)

Acknowledgement

This repository is built upon Fairseq and speech-resynthesis. We appreciate the open source of the projects.

Citation

If our work is useful for your research, please cite the following paper:

@article{kim2023many,
    title={Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation},
    author={Minsu Kim and Jeongsoo Choi and Dahun Kim and Yong Man Ro},
    journal={arXiv preprint arXiv:2308.01831},
    year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data_helper		data_helper
fairseq @ 0338cdc		fairseq @ 0338cdc
imgs		imgs
samples/en		samples/en
speech2unit		speech2unit
unit2speech		unit2speech
unit2unit		unit2unit
.gitmodules		.gitmodules
README.md		README.md
dict.txt		dict.txt
inference_sts.py		inference_sts.py
inference_tts.py		inference_tts.py
phoneme_dict.txt		phoneme_dict.txt
phonemize.py		phonemize.py
requirements.txt		requirements.txt
util.py		util.py

Task	Pretraining Data	Model
STS	VoxPopuli (from year 2013), mTEDx	download
TTS	VoxPopuli (from year 2013), mTEDx	download
TTST	VoxPopuli (from year 2013), mTEDx	download

choijeongsoo/utut

Folders and files

Latest commit

History

Repository files navigation

UTUT

Setup

Model Checkpoints

Speech to Unit Quantization

Unit to Unit Translation (UTUT)

Unit to Speech Synthesis

Inference

Pipeline for Speech-to-Speech Translation (STS)

Pipeline for Text-to-Speech Synthesis (TTS)

Pipeline for Text-to-Speech Translation (TTST)

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Languages