The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"
We already build a runnable docker, you can run the following command to download and run the docker
docker run -it --volume-driver=nfs --shm-size=64G j4ckl1u/espnet-py36-img:latest /bin/bash
Regarding data preparation, I suggest you read ESPnet instructions. It should be note that espnet doesn't do speed perturbation, but I strongly recommend to do it according to the better performance on dev-other and test-other datasets.
To enable semantic mask training, you have to align audio and word.
In our work, we use the alignment results released by this repo, which is obtained using Montreal Forced Aligner. We put the extracted information on data
directory. start.txt and end.txt record the alignment position in frame for each word in each utterance.
For training, I upload my training configs into configs folder, including base setting and large setting respectively. Our archtecture is similar to ESPnet, but replacing position embedding with CNN in both encoder and decoder. The specific code change can be found at here
In terms of decoding, pleaes first download the ESPnet pre-trained RNN language model, and then run our decoding script to get the model output.
We release a base model (12 encoder layers and 6 decoder layers) and a large model (24 encoder layers and 12 decoder layers). It achevies following results with shallow language model fusion setting.
dev-clean | dev-other | test-clean | test-other | |
---|---|---|---|---|
Base | 2.07 | 5.06 | 2.31 | 5.21 |
Large | 2.02 | 4.91 | 2.19 | 5.19 |