This repository contains the code for the EMNLP 2023 paper End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation by Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinavasan, Prashant Mathur, Brian Thompson, and Marcello Federico.
@inproceedings{zuluaga-gomez-etal-2023-end,
title = "End-to-End Single-Channel Speaker-Turn Aware
Conversational Speech Translation",
author = "Zuluaga-Gomez, Juan and
Huang, Zhaocheng and
Niu, Xing and
Paturi, Rohit and
Srinavasan, Sundararajan and
Mathur, Prashant and
Thompson, Brian and
Federico, Marcello",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2311.00697",
}
mkdir -p ~/stac_st/
cd ~/stac_st/
git clone /this/repo/
cd ~/stac_st/
conda create -n stac_st python=3.11.3
conda activate stac_st
git clone https://github.com/speechbrain/speechbrain.git
cd speechbrain/
(we can use SpeechBrain v0.5.14, above versions may break the training at some point)
pip install -r ../stac-st/requeriments_file.txt
# then install SpeechBrain
pip install --editable .
For some reason, multi-GPU training does not work out of the box in EC2 machines. TO ALLOW THE SYSTEM TO DO DDP, YOU NEED TO DO FOLLOWING
1. vim ~/code/speechbrain/speechbrain/core.py
2. Go to line 321 and add the following line (only last line!):
# local_rank = None
# if "local_rank" in run_opts:
# local_rank = run_opts["local_rank"]
# else:
# if "LOCAL_RANK" in os.environ and os.environ["LOCAL_RANK"] != "":
# local_rank = int(os.environ["LOCAL_RANK"])
# run_opts["local_rank"] = local_rank %<------------- ADD THIS LINE --|
At this stage, you should be able to run SpeechBrain recipes with multiple GPUs
There are 2 aditional scripts/modules that do not come with SpeechBrain, but we provide in stac-st/modules/
:
stac-st/modules/TransformerMultiTask.py
: script to instantiate a Multitask Transformer model. It is simlar to theTransformerASR()
class of SpeechBrain.stac-st/modules/mutitask_decoder.py
: This sequence-to-sequence decoder allows the addition of special tokens (in our case, language tokens) as prompts before starting the decoding phase. This is not part of SpeechBrain, but the script is based on theS2SMultiTaskTransformerBeamSearch()
class.
Our scripts already load these modules directly.
The dataset folder is in datasets/*
. There is one folder per dataset. Feel free to modify them depending on your needs!
% # TODO: probably we need to make one script that prepares all datasets
All the datasets are prepared in JSON format, example below:
"sp_2101-B-061126-061245-st": {
"wav": "{data_root}/callhome-train/wav/sp_2101-B-061126-061245.wav",
"source_lang": "es",
"target_lang": "en",
"duration": 1.19,
"task": "translation",
"transcription": "y hoy no es el cuatro",
"translation_0": "and today is not the fourth",
"transcription_and_translation": "y hoy no es el cuatro\nand today is not the fourth"
},
To prepare this dataset go to datasets/fisher_callhome
. Here, you will find the preparation scripts, but first, download the data!
To donwload the Fisher-CALLHOME corpora, you need to have access to LDC. Download the "speech" data and the "transcripts" (that is why there are two links).
Fisher:
After the extraction, that folders needs to look like this:
/folder/LDC2010T04/fisher_spa/data/speech/
and /LDC2010T04/fisher_spa_tr/data/transcripts/
CALLHOME
After the extraction, that folders needs to look like this:
/folder/LDC96T17/ch_sp/callhome/spanish/speech/
and /folder/LDC96T17/callhome_spanish_trans_970711/transcrp/
When you have the folders structured as above. You can continue with preparation
The data preparation scripts are in:
- Data preparation (single-turn):
run_data_preparation.sh
- Data preparation (multi-turn):
run_data_preparation_turns.sh
- Segmentation with SHAS and WebRCT:
run_shas_segmentation.sh
For single-turn and multi-turn preparation run:
# first, open each file and set:
# ORIGINAL_DATA=/path/to/fisher/folder/
bash run_data_preparation.sh
bash run_data_preparation_turns.sh
For segmentation with SHAS and WebRCT:
# Open the script run_shas_segmentation.sh and set these two
fisher_data="/path/to/fisher/folder/"
callhome_data="/path/to/callhome/folder/"
# then
bash run_shas_segmentation.sh
At this point, you should have a tree like this:
.
├── audio_segmenter/
├── create_json_and_segment.py
├── data/
├── extra_requirements.txt
├── mask_wav_files.py
├── run_data_preparation.sh
├── run_data_preparation_turns.sh
├── run_shas_segmentation.sh
└── st_asr_task
To prepare this dataset go to datasets/common_voice_13
. Here, you will find the preparation scripts.
The following command downloads and prepare the CommonVoice data!
# YOU NEED TO ENTER THESE PATHS!
# data_folder="/path/to/folder/datasets/common_voice_13/cv-corpus-13.0-2023-03-09"
# save_folder="/path/to/folder/datasets/common_voice_13/data"
# then run:
bash run_prepare_commonvoice.sh
To prepare this dataset go to datasets/covost
. Here, you will find the preparation scripts.
The following command downloads and prepare the covost2 data!
IMPORTANT, YOU NEED TO FIRST PREPARE COMMONVOICE AS COVOST2 DEPENDS ON IT
# YOU NEED TO ENTER THESE PATHS!
# common_voice="/path/to/folder/datasets/common_voice_13/cv-corpus-13.0-2023-03-09"
covost_data="/path/to/folder/datasets/covost/splits"
# then run:
bash run_prepare_covost2.sh
To prepare this dataset go to datasets/mslt
. Here, you will find the preparation scripts.
The following command downloads and prepare the MSLT data!
# YOU NEED TO ENTER THESE PATHS!
# data_folder="/path/to/folder/datasets/mslt"
# save_folder="/path/to/folder/datasets/mslt/data"
# then run:
bash run_prepare_mslt.sh
A key part of STAC-ST is that it can use ASR or ST corpora from different datasets. Thus, in this step we merge in one JSON file the ASR and ST datasets i.e., CommonVoice + CoVoST2.
# YOU NEED TO SET THESE PATHS!
# datasets_folder=/path/to/folder/datasets
# mslt_dataset=${datasets_folder}/mslt/data/v_1/
# commonvoice=${datasets_folder}/common_voice_13/data/
# covost=${datasets_folder}/covost/splits/
# fisher_callhome=${datasets_folder}/fisher_callhome_spanish/data_processed/data
# then run:
bash prepare_cross_datasets.sh
The prepare_cross_datasets.sh
prepares:
- CV train/dev/test single- and multi-turn subsets
- CoVoST2 train/dev/test single- and multi-turn subsets
- merges all data, single- and multi-turn, CV and CommonVoice.
When the datasets are ready, we can start training.
Run the default model with:
bash run_default.sh
This trains a 5K BPE Tokenizer (exp/tokenizer_bpe_5k_es_en
) and a STAC-ST model (exp/stact-st/
). We train with single- and multi-turn Fisher-CALLHOME corpora.
Otherwise, you can run directly the following steps:
To train the tokenizer:
DATA_FOLDER="/path/to/dataset/fisher_callhome/data" # needs to be the folder that contains each subset
TOKENIZER="/path/to/tokenizer/"
lang_tokens=$(echo "[ES],[EN]")
python3 stac-st/train_tokenizer.py stac-st/hparams/train_bpe_5k_special_prefix.xx_to_xx.yaml \
--languages "'$lang_tokens'" \
--annotation_read "transcription_and_translation" \
--train_json_file "$DATA_FOLDER/fisher-callhome-train/data-asr-st.json" \
--output_folder "$TOKENIZER"
Train the model with:
DATA_FOLDER="/path/to/dataset/fisher_callhome/data # needs to be the folder that contains each subset"
TOKENIZER="/path/to/tokenizer/"
_TRAIN_SCRIPT="stac-st/train_multitask.py"
_hparams="stac-st/hparams/transformer_multitask.yaml"
# train split
_TRAIN_SPLIT="fisher-callhome-train-and-30s/data-turns-asr-st"
OMP_NUM_THREADS=40 torchrun \
--nproc_per_node=$n_jobs \
--nnodes=1 --node_rank=0 \
${_TRAIN_SCRIPT} ${_hparams} \
--distributed_launch --distributed_backend='nccl' \
--use_xt_token=True --use_turn_token=True \
--encoder_module=transformer \
--attention_type=regularMHA \
--train_splits=$_TRAIN_SPLIT \
--output_folder_name="./exp/stact-st/" \
--max_batch_len=500 \
--max_batch_len_val=250 \
--test_batch_size=1 \
--lr_adam=0.001 \
--transformer_dropout=0.1 \
--grad_accumulation_factor=8 \
--scheduler_step_limit=100000 \
--n_warmup_steps=10000 \
--cooldown=10000 \
--ctc_weight=0.3 \
--d_model=256 \
--d_ffn=1024 \
--nhead=4 \
--num_encoder_layers=12 \
--num_decoder_layers=6 \
--valid_search_interval=100 \
--data_folder "$DATA_FOLDER" \
--tokenizer_file $TOKENIZER/${units}_bpe.model
Training parameters, note that max_batch_len tells how many seconds of audio each batch has 500 == 500s == 8.33mins | i.e, max_batch_len=500.
The effective batch size is max_batch_len x grad_accumulation_factor, STAC-ST uses 500 x 8 (GPUs) == 4000s | i.e., grad_accumulation_factor=8
To train a model with out-of-domain corpora (CoVoST2 + CommonVoice), you need to modify the data paths and the hyperparameters scripts. First, you need to prepare CoVoST2 and CommonVoice corpora and then run datasets/cross_datasets_experiments/prepare_cross_datasets
.
For Tokenizer:
DATA_FOLDER="/path/to/dataset/cross_datasets_experiments/data_covost_mslt"
TOKENIZER="/path/to/tokenizer/with/cv-covost2"
... (remains the same)
For STAC-ST model:
DATA_FOLDER="/path/to/dataset/cross_datasets_experiments/data_covost_mslt"
TOKENIZER="/path/to/tokenizer/with/cv-covost2"
_TRAIN_SCRIPT="stac-st/train_multitask.py"
_hparams="stac-st/hparams/transformer_fisher_cv_xx_to_xx.yaml"
# define source and target locale (languages)
source_locale="es"
target_locale="en"
lang_pair_id="${source_locale}_to_${source_locale}.and.${target_locale}"
# train split (contains single- and multi-turn data)
# The last part of the path can be changed, check the folder in $DATA_FOLDER/${lang_pair_id}.
# We have prepared data with CoVoST2 data only, single-turn data only, for CV+CoVoST2.
# You can use different data configurations
TRAIN_SPLIT="${lang_pair_id}/train-and-30s.${source_locale}_${target_locale}"
... (remains the same)
You can evaluate a model by simply running the following code:
DATA_FOLDER="/path/to/dataset/fisher_callhome/data # needs to be the folder that contains each subset"
TOKENIZER="/path/to/tokenizer/"
_TRAIN_SCRIPT="stac-st/train_multitask.py"
_hparams="stac-st/hparams/transformer_fisher_cv_xx_to_xx.yaml"
# no_eval=False indicates to perform evaluation, (which is True by default, above)
# you need to set use_xt_token/use_turn_token to True/False, depending on what you set while training the model. It should match!
CUDA_VISIBLE_DEVICES=0 python3 \
${_TRAIN_SCRIPT} ${_hparams} \
--no_eval=False --num_workers=12 \
--use_xt_token=True --use_turn_token=True \
--encoder_module=transformer \
--attention_type=regularMHA \
--train_splits=$_TRAIN_SPLIT \
--output_folder_name="./exp/stact-st/" \
--max_batch_len=100 \
--max_batch_len_val=150 \
--test_batch_size=1 \
--lr_adam=0.001 \
--transformer_dropout=0.1 \
--grad_accumulation_factor=1 \
--scheduler_step_limit=100000 \
--n_warmup_steps=10000 \
--cooldown=10000 \
--ctc_weight=0.3 \
--d_model=256 \
--d_ffn=1024 \
--nhead=4 \
--num_encoder_layers=12 \
--num_decoder_layers=6 \
--valid_search_interval=100 \
--data_folder=$DATA_FOLDER \
--tokenizer_file=$TOKENIZER/${units}_bpe.model
Note that this needs to be run after the training has been completed. Defined by scheduler_step_limit
.
IMPORTANT: in _hparams="stac-st/hparams/transformer_fisher_cv_xx_to_xx.yaml"
, the evaluation sets are defined as:
test_splits_1_translations: [
!ref "<join>/covost_<source_locale>_<target_locale>_test-30s",
!ref "<join>/covost_<source_locale>_<target_locale>_dev-30s",
...
]
You can define your own evaluation datasets as:
test_splits_1_translations: [
!ref "<join>/covost_<source_locale>_<target_locale>_test-30s",
...
!ref "/path/to/eval/dataset/my_dummy_eval_set",
...
]
Our model is scaled to different sizes, you can do this by running:
# small size (default)
bash ablations/run_m_and_l_size.sh "small"
# large size
bash ablations/run_m_and_l_size.sh "medium"
# XL size
bash ablations/run_m_and_l_size.sh "large"
Note that this scrips uses CommonVoice and CoVoST2 corpora for training.
Here we list some details on how to perform the other ablations listed in the paper.
To use/remove the special tokens: [turn] and [xt], we have added two flags to the training scripts:
--use_turn_token=True/False
and
--use_xt_token=True/False
These two flags can be set to True or False.
CTC weight can be passed while training by setting
--ctc_weight=x.x
# where x.x goes from --> [0, 1)
You need to change:
- When training Tokenizer, change
lang_tokens=$(echo "[ES],[EN]")
tolang_tokens=$(echo "[DE],[EN]")
in case you want to do German-to-English ASR+ST. - Set
source_locale="de"
andtarget_locale="en"
when training STAC-ST.
For step #2, see more in section: 3. Train STAC-ST with Fisher-CALLHOME + CoVoST2 + CommonVoice
We benchmark STAC-ST in the following folder: evaluations/README.md.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.