Skip to content

Latest commit

 

History

History
68 lines (47 loc) · 6.23 KB

README.md

File metadata and controls

68 lines (47 loc) · 6.23 KB

Getting Started with Automatic Speech Recognition in Flashlight

This tutorial uses the following binaries with the following capabilities:

The wav2letter Robust ASR (RASR) recipe contains robust pre-trained models and resources for finetuning some of which are used in the Colab tutorials above.

See the full documentation for more general training or decoding instructions.

Finetuning a Pretrained Model with Already-Labeled Audio

The outline below describes the end-to-end process of finetuning a pretrained acoustic model. In several steps:

  1. Preprocessing the audio.

    a. Most audio formats are supported and are automatically detected.

    b. All audio used in training or inference must have the same sample rate; up/downsampling audio may be necessary. Provided pretrained models were trained using 16 kHz and will require that sample rate for finetuning, so up/downsample your audio as necessary.

  2. Force-aligning the labeled audio.

    a. Using the existing transcriptions, generate audio-text alignments using the fl_asr_align binary. See the full alignment documentation.

    b. Based on the alignments, trim the existing audio to include sections containing speech. Doing so typically increases training speed.

  3. Generate a final list file for training and validation sets using the trimmed audio and transcriptions. See the list file documentation for more details.

  4. Use the fl_asr_tutorial_finetune_ctc binary to finetune the pretrained model (or train your own from scratch). List files can be passed to finetuning or inference binaries using the train/valid or test flags, respectively.

Inference with a Pretrained CTC Model

See this colab notebook for a step-by-step tutorial.

The fl_asr_tutorial_inference_ctc binary provides a way to perform inference with CTC-trained acoustic models. To perform inference, you'll need the following components (with their corresponding flags):

  • An acoustic model (AM) (am_path)
  • A token set with which the AM was trained (tokens_path)
  • A lexicon (lexicon_path)
  • A language model for decoding (lm_path)

The following parameters are also configurable when performing inference:

  • The sample rate of input audio (sample_rate)
  • The beam size when decoding (beam_size)
  • The beam size of the token beam when decoding (beam_size_token)
  • The beam threshold for decoding (beam_threshold)
  • The LM weight score for decoding (lm_weight)
  • The word score for decoding (word_score).

See the complete ASR app documentation for a more detailed explanation of each of these flags. See the aforementioned colab tutorial for sensible values used in a demo.

Finetuning with a Pretrained CTC Model

See this colab notebook for a step-by-step tutorial.

The fl_asr_tutorial_finetune_ctc binary provides a means of finetuning a pretrained acoustic model on additional labeled audio. Usage of the binary is as follows:

./fl_asr_tutorial_finetune_ctc [path to directory containing model] [...flags]

To finetune, you'll need the following components (with their corresponding flags):

  • An acoustic model (AM) to finetune (the first argument to the binary invocation, e.g. fl_asr_tutorial_finetune_ctc [path] [...flags])
  • A token set with which the AM was trained (tokens)*
  • A lexicon (lexicon)
  • Validation sets to use for finetuning (valid)
  • Train sets with data on which to finetune (train)
  • Other training flags for flashlight training or audio processing as per the ASR documentation.
  • Should be identical to that with which the original AM was trained. Will be provided with the AM in recipes/tutorials.