Skip to content
Switch branches/tags
Go to file


Failed to load latest commit information.
Latest commit message
Commit time

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

This is the codebase for the paper Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation (COLING 2020, Oral presentation).


  • 02/11/2020: First release, with training recipes and pre-trained models.

Table of Contents

  1. Pre-trained models
  2. Dependencies
  3. Data preparation
  4. Training
  5. Decoding
  6. References

1. Pre-trained models

Pre-trained models are available for download in the links below. To replicate the results, please follow Section 5 Decoding.

type side self src merge epochs WER BLEU de es fr it  nl pt ro  ru
Inaguma et al. [1] 50 12.0 25.05 22.91 27.96 32.69 23.75 27.43 28.01 21.90 15.75
Gangi et al. [2] - - 17.70 20.90 26.50 18.00 20.00 22.60 - -
Gangi et al. [2] - 17.55 16.50 18.90 24.50 16.20 17.80 20.80 15.90 9.80
Link independent++ 25 11.6 24.60 22.82 27.20 32.11 23.34 26.67 28.98 21.37 14.34
Link par both ✔️ ✔️ concat 25 11.6 25.00 22.74 27.59 32.86 23.50 26.97 29.51 21.94 14.88
Link parR3 both - ✔️ sum 25 11.6 24.87 22.84 27.92 32.12 23.61 27.29 29.48 21.16 14.50
Link par++ both - ✔️ sum 25 11.4 25.62 23.63 28.12 33.45 24.18 27.55 29.95 22.87 15.21

[1] Inaguma et al., 2020. Espnet-st: All-in-one speech translation toolkit. (Bilingual one-to-one models)

[2] Gangi et al., 2019. One-to-many multilingual end-to-end speech translation.

2. Dependencies

You will need PyTorch, Kaldi, and ESPNet. In the sequel, it is assumed that you are already inside a virtual environment with PyTorch installed (together with necessary standard Python packages), and that $WORK is your working directory.

Note that the instructions here are different from the ones on the official ESPNet repo (they install a miniconda virtual environment that will be activated each time you run an ESPNet script).


Clone the Kaldi repo:

cd $WORK
git clone

The following commands may require other dependencies, please install them accordingly.

Check and make its dependencies:

cd $WORK/kaldi/tools
bash extras/
touch python/.use_default_python
make -j$(nproc)

Build Kaldi, replace the MKL paths with your system's ones:

cd $WORK/kaldi/src
./configure --shared \
    --use-cuda=no \
    --mkl-root=/some/path/linux/mkl \
make depend -j$(nproc)
make -j$(nproc)

Important: After installing Kaldi, make sure there's no kaldi/tools/ and no kaldi/tools/python/python, otherwise there will be an error (no module sentencepiece) when running ESPNet.


Clone this repo:

cd $WORK
git clone

Prepare the dependencies:

cd $WORK/speech-translation
ln -s $WORK/kaldi tools/kaldi
pip install .
cd tools
git clone moses

If you prefer to install it in editable mode, then replace the pip install line with

pip install --user . && pip install --user -e .

3. Data preparation

  1. Run the following command to process features and prepare data in json format.
bash --stage 0 \
            --stop-stage 2 \
            --ngpu 0 \
            --must-c ${must_c}

where must_c is directory where you save raw MuST-C data.

  1. Create symlinks so that the processed data is saved in the required strutured for training.
python --output-dir ${DATA_DIR}

where ${DATA_DIR} is the path to the input folder for training. Its structure is as below.


In which, ${tgt_langs} is the target languages separated by _. For example, for a model trained on 8 languages, ${tgt_langs} is de_es_fr_it_nl_pt_ro_ru.

4. Training

The training configurations are saved in ./conf/training.

Please run the following command to train or resume training. The training will be automatically resumed from the last checkpoints in the exp/${tag}/results folder if this folder exists (and there are checkpoints of the format snapshot.iter.${NUM_ITER} in it), where ${tag} is the name tag of the experiment and ${NUM_ITER} is the iteration number. If exp/${tag}/results folder does not exist, the model will be trained from scratch (the weights is initialized using the pre-trained weights provided).

bash --stage 4 --stop-stage 4 --ngpu ${ngpu} \
            --preprocess-config ./conf/specaug.yaml \
            --datadir ${DATA_DIR} \
            --tgt-langs ${tgt_langs} \
            --tag ${tag}


  • ${ngpu}: number of GPUs to be used for training. Training on multi-node is currently not supported.
  • ${DATA_DIR}: path to the input folder (as described above).
  • ${tag}: name of the training configuration file (without .yaml extension).
  • ${tgt_langs}: the target languages separated by _ (as described above).

The checkpoints are saved in ./exp/${tag}/results, and the tensorboard is saved in ./tensorboard/${tag}.

5. Decoding

The decoding configurations are saved in ./conf/decoding.

Please run the following command for decoding.

bash --stage 5 --stop-stage 5 --ngpu 0 \
            --preprocess-config ./conf/specaug.yaml \
            --datadir ${DATA_DIR} \
            --tgt-langs ${tgt_langs} \
            --decode-config ${decode_config} \
            --trans-set ${trans_set} \
            --trans-model ${trans_model} \
            --tag ${tag}


  • ${DATA_DIR}, ${tgt_langs}, and ${tag} are same parameters as described above.
  • ${trans_set}: datasets to be decoded, seperated by space, e.g., etc. If this value is an empty string, then the default datasets for decoding are tst-COMMON and tst-HE sets of all target languages.

6. References

If you find the resources in this repository useful, please cite the following paper:

    title       = {Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation},
    author      = {Le, Hang and Pino, Juan and Wang, Changhan and Gu, Jiatao and Schwab, Didier and Besacier, Laurent},
    booktitle   = {Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020},
    publisher   = {Association for Computational Linguistics}
    year        = {2020}

This repo is a fork of ESPNet. You should consider citing their papers as well if you use this code.