Skip to content

formiel/speech-translation

Repository files navigation

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

This is the codebase for the paper Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation (COLING 2020, Oral presentation).

News

  • 02/11/2020: First release, with training recipes and pre-trained models.

Table of Contents

  1. Pre-trained models
  2. Dependencies
  3. Data preparation
  4. Training
  5. Decoding
  6. References

1. Pre-trained models

Pre-trained models are available for download in the links below. This table shows the performance on the MuST-C test set. To replicate the results, please follow Section 5 Decoding.

type side self src merge epochs WER BLEU de es fr it nl pt ro ru
Inaguma et al. [1] 50 12.0 25.05 22.91 27.96 32.69 23.75 27.43 28.01 21.90 15.75
Gangi et al. [2] - - 17.70 20.90 26.50 18.00 20.00 22.60 - -
Gangi et al. [2] - 17.55 16.50 18.90 24.50 16.20 17.80 20.80 15.90 9.80
Link independent++ 25 11.6 24.60 22.82 27.20 32.11 23.34 26.67 28.98 21.37 14.34
Link par both ✔️ ✔️ concat 25 11.6 25.00 22.74 27.59 32.86 23.50 26.97 29.51 21.94 14.88
Link parR3 both - ✔️ sum 25 11.6 24.87 22.84 27.92 32.12 23.61 27.29 29.48 21.16 14.50
Link par++ both - ✔️ sum 25 11.4 25.62 23.63 28.12 33.45 24.18 27.55 29.95 22.87 15.21

[1] Inaguma et al., 2020. Espnet-st: All-in-one speech translation toolkit. (Bilingual one-to-one models)

[2] Gangi et al., 2019. One-to-many multilingual end-to-end speech translation.

2. Dependencies

You will need PyTorch, Kaldi, and ESPNet. In the sequel, it is assumed that you are already inside a virtual environment with PyTorch installed (together with necessary standard Python packages), and that $WORK is your working directory.

Note that the instructions here are different from the ones on the official ESPNet repo (they install a miniconda virtual environment that will be activated each time you run an ESPNet script).

Kaldi

Clone the Kaldi repo:

cd $WORK
git clone https://github.com/kaldi-asr/kaldi.git

The following commands may require other dependencies, please install them accordingly.

Check and make its dependencies:

cd $WORK/kaldi/tools
bash extras/check_dependencies.sh
touch python/.use_default_python
make -j$(nproc)

Build Kaldi, replace the MKL paths with your system's ones:

cd $WORK/kaldi/src
./configure --shared \
    --use-cuda=no \
    --mkl-root=/some/path/linux/mkl \
    --mkl-libdir=/some/path/linux/mkl/lib/intel64_lin
make depend -j$(nproc)
make -j$(nproc)

Important: After installing Kaldi, make sure there's no kaldi/tools/env.sh and no kaldi/tools/python/python, otherwise there will be an error (no module sentencepiece) when running ESPNet.

ESPNet

Clone this repo:

cd $WORK
git clone https://github.com/formiel/speech-translation.git

Prepare the dependencies:

cd $WORK/speech-translation
ln -s $WORK/kaldi tools/kaldi
pip install .
cd tools
git clone https://github.com/moses-smt/mosesdecoder.git moses

If you prefer to install it in editable mode, then replace the pip install line with

pip install --user . && pip install --user -e .

3. Data preparation

  1. Run the following command to process features and prepare data in json format.
bash run.sh --ngpu 0 \
            --stage 0 \
            --stop-stage 2 \
            --must-c $MUSTC_ROOT \
            --tgt-langs de_es_fr_it_nl_pt_ro_ru \
            --nbpe 8000 \
            --nbpe_src 8000

where MUSTC_ROOT is directory where you save raw MuST-C data.

  1. Create symlinks so that the processed data is saved in the required strutured for training.
python tools/create_symlinks.py --input-dir /path/to/dump \
                                --output-dir ${DATA_DIR} \
                                --use-lid --use-joint-dict

where ${DATA_DIR} is the path to the data folder for training. Its structure is as below.

${DATA_DIR}
└──${tgt_langs}
    └──use_dict1
    |    └──src8000_tgt8000
    |    |   └──train_sp
    |    |        └──en-de.json
    |    |        └──en-es.json
    |    |        └──...
    |    |    └──dev
    |    |        └──en-de.json
    |    |        └──en-es.json
    |    |        └──...
    |    |    └──tst-COMMON
    |    |        └──en-de.json
    |    |        └──en-es.json
    |    |        └──...
    |    |    └──tst-HE
    |    |        └──en-de.json
    |    |        └──en-es.json
    |    |        └──...
    |    |    └──lang_1spm
    |    |        train_sp.en-${tgt_langs}_bpe8000_tc_${suffix}.txt
    |    └──src32000_tgt32000
    |        ....
    └──use_dict2
         └──src8000_tgt8000
         └──src8000_tgt32000
         ....

In which, ${tgt_langs} is the target languages separated by _. For example, for a model trained on 8 languages, ${tgt_langs} is de_es_fr_it_nl_pt_ro_ru.

4. Training

The training configurations are saved in ./conf/training.

Please run the following command to train or resume training. The training will be automatically resumed from the last checkpoints in the exp/${tag}/results folder if this folder exists (and there are checkpoints of the format snapshot.iter.${NUM_ITER} in it), where ${tag} is the name tag of the experiment and ${NUM_ITER} is the iteration number. If exp/${tag}/results folder does not exist, the model will be trained from scratch (the weights is initialized using the pre-trained weights provided).

bash run.sh --stage 4 --stop-stage 4 --ngpu ${ngpu} \
            --preprocess-config ./conf/specaug.yaml \
            --datadir ${DATA_DIR} \
            --tgt-langs ${tgt_langs} \
            --tag ${tag}

where

  • ${ngpu}: number of GPUs to be used for training. Training on multi-node is currently not supported.
  • ${DATA_DIR}: path to the input folder (as described above).
  • ${tag}: name of the training configuration file (without .yaml extension).
  • ${tgt_langs}: the target languages separated by _ (as described above).

The checkpoints are saved in ./exp/${tag}/results, and the tensorboard is saved in ./tensorboard/${tag}.

5. Decoding

The decoding configurations are saved in ./conf/decoding.

Please run the following command for decoding.

bash run.sh --stage 5 --stop-stage 5 --ngpu 0 \
            --preprocess-config ./conf/specaug.yaml \
            --datadir ${DATA_DIR} \
            --tgt-langs ${tgt_langs} \
            --decode-config ${decode_config} \
            --trans-set ${trans_set} \
            --trans-model ${trans_model} \
            --tag ${tag}

where

  • ${DATA_DIR}, ${tgt_langs}, and ${tag} are same parameters as described above.
  • ${trans_set}: datasets to be decoded, seperated by space, e.g. tst-COMMON.en-de.de tst-COMMON.en-fr.fr, etc. If this value is an empty string, then the default datasets for decoding are tst-COMMON and tst-HE sets of all target languages.

6. References

If you find the resources in this repository useful, please cite the following paper:

@inproceedings{le2020dualdecoder,
    title       = {Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation},
    author      = {Le, Hang and Pino, Juan and Wang, Changhan and Gu, Jiatao and Schwab, Didier and Besacier, Laurent},
    booktitle   = {Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020},
    publisher   = {Association for Computational Linguistics}
    year        = {2020}
}

This repo is a fork of ESPNet. You should consider citing their papers as well if you use this code.

About

Multilingual speech translation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published