Skip to content

The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"

Notifications You must be signed in to change notification settings

cywang97/SemanticMask

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASR_SemanticMask

The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"

Preparation

We already build a runnable docker, you can run the following command to download and run the docker

docker run -it --volume-driver=nfs --shm-size=64G j4ckl1u/espnet-py36-img:latest /bin/bash

Regarding data preparation, I suggest you read ESPnet instructions. It should be note that espnet doesn't do speed perturbation, but I strongly recommend to do it according to the better performance on dev-other and test-other datasets.

Word Alignment

To enable semantic mask training, you have to align audio and word. In our work, we use the alignment results released by this repo, which is obtained using Montreal Forced Aligner. We put the extracted information on data directory. start.txt and end.txt record the alignment position in frame for each word in each utterance.

Training and decoding

For training, I upload my training configs into configs folder, including base setting and large setting respectively. Our archtecture is similar to ESPnet, but replacing position embedding with CNN in both encoder and decoder. The specific code change can be found at here

In terms of decoding, pleaes first download the ESPnet pre-trained RNN language model, and then run our decoding script to get the model output.

Pre-train Models

We release a base model (12 encoder layers and 6 decoder layers) and a large model (24 encoder layers and 12 decoder layers). It achevies following results with shallow language model fusion setting.

dev-clean dev-other test-clean test-other
Base 2.07 5.06 2.31 5.21
Large 2.02 4.91 2.19 5.19

About

The repo contains our code of ``Semantic Mask for Transformer based End-to-End Speech Recognition"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%