Skip to content

speacher is a speech teacher framework that enables research in curriculum learning for speech recognition and quality assessment for speech synthesis models.

Notifications You must be signed in to change notification settings

gcambara/speacher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speacher

speacher (speech teacher) is a curriculum learning toolkit for speech recognition, which enables research and development of curriculums for training speech recognition models. This framework takes dataset manifests in tab-separated values (.tsv) files as input, and generates an output folder with sets sub-sampled according to difficulty criteria. A variety of difficulty scoring and pacing functions are implemented. Besides, it can be used to assess speech synthesis samples with different quality metrics.

This framework has been designed to work with other speech recognition frameworks, like fairseq and Flashlight, which we recommend to install to allow speacher's full functionality. Fairseq is very easy to install with pip, and Flashlight is adviced to be installed via Docker, as they provide with the Docker image at their repo.

Install the requirements

Once you have installed fairseq, simply run the following command:

pip install -r requirements.txt

Scoring a dataset

A dataset can be scored and sorted in order of difficulty with different metrics, like by sample lengths, or an error metric like WER or TER. For evaluation of such metrics with a pretrained ASR, three frameworks can be used: fairseq, HuggingFace and Flashlight.

Using HuggingFace

If you want to use an ASR model from HuggingFace repository, specify the --huggingface argument. Also, introduce the name of the model so it can be downloaded from the model zoo:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --sort_manifest --scoring_function asr --asr_metric wer --asr_download_model facebook/wav2vec2-base-960h --huggingface --batch_size 4

Inference can be speed up by sorting by length in order to reduce padding, using "n_frames" column in the manifest TSV. Use --scoring_sorting function:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --sort_manifest --scoring_function asr --asr_metric wer --asr_download_model facebook/wav2vec2-base-960h --huggingface --batch_size 4 --scoring_sorting ascending

Currently, models coming from two HuggingFace's users are supported: facebook and speechbrain. Find here an example of using a speechbrain model:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --sort_manifest --scoring_function asr --asr_metric wer --asr_download_model speechbrain/asr-wav2vec2-commonvoice-en --huggingface --batch_size 4 --scoring_sorting ascending

Other SpeechBrain models can be found in the HuggingFace repo.

Using fairseq

Use --fairseq option, to select fairseq toolkit. With it, you will need to set the path to a pretrained model, and also a data manifest TSV with the paths to every audio sample. The fairseq framework will be called to assess such samples:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --sort_manifest --scoring_function asr --asr_metric wer --asr_model <path_to_pretrained_checkpoint> --fairseq

You can also leave the --asr_model argument empty, and specify the name of a downloadable model. This way, the script will download a pretrained model from fairseq's repo, and use it for evaluation. For instance, for downloading and using "s2t_transformer_s" from the model zoo:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --sort_manifest --scoring_function asr --asr_metric wer --asr_download_model s2t_transformer_s --fairseq

Using Flashlight

On the other hand, calling Flashlight for usage is not currently supported. However, you can still pass in a log from a Flashlight testing execution, along with the data manifest TSV, and speacher will return the TSV sorted by the WER/TER in the Flashlight log. Simply type the following command:

python score_dataset.py --manifest <path_to_data_manifest> --out_dir <path_to_output_manifest> --scoring_function asr --asr_metric wer --flashlight --flashlight_log <path_to_test_log> --sort_manifest

Training with a curriculum

Fixed exponential pacing function

Once you have a manifest sorted by scored order of difficulty, you can run a fairseq training with curriculum learning, using the following command:

python train_curriculum.py --manifest <path_to_sorted_manifest> --out_dir <path_to_saved_checkpoints> --starting_percent 0.04 --fairseq --fairseq_yaml <path_to_base_config_yaml> --save_curriculum_yaml --pacing_function fixed_exponential

Binning pacing function

Instead of scoring individual samples with a single score, you can directly create a distribution of bins with the score to be used:

python train_curriculum.py --manifest <path_to_sorted_manifest> --out_dir <path_to_saved_checkpoints> --fairseq --fairseq_yaml <path_to_base_config_yaml> --save_curriculum_yaml --pacing_function binning --bin_variable mean_delta_f0 --n_bins 4 --bin_method qcut

Some information about the bins will be printed on screen, like number of samples per bin, for instance. If you want to validate your bins against some metric like WER, for instance, in order to see if there is significance, you can specify it like this:

python train_curriculum.py --manifest <path_to_sorted_manifest> --out_dir <path_to_saved_checkpoints> --fairseq --fairseq_yaml <path_to_base_config_yaml> --save_curriculum_yaml --pacing_function binning --bin_variable mean_delta_f0 --n_bins 4 --bin_method qcut --bin_validation_metric wer

Where --bin_validation_metric is the name of the validation metric in the manifest column. The mean and the standard deviation of the mean will be displayed. This is useful to check if the binning is significant or not with a quality metric like WER.

About

speacher is a speech teacher framework that enables research in curriculum learning for speech recognition and quality assessment for speech synthesis models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages